276 17 6MB
English Pages 268 Year 2015
George K. Mikros and Ján Mačutek (Eds.) Sequences in Language and Text
Quantitative Linguistics
Editors Reinhard Köhler Gabriel Altmann Peter Grzybek Advisory Editor Relja Vulanović
Volume 69
Sequences in Language and Text Edited by George K. Mikros Ján Mačutek
DE GRUYTER MOUTON
ISBN 978-3-11-036273-2 e-ISBN (PDF) 978-3-11-036287-9 e-ISBN (EPUB) 978-3-11-039477-1 ISSN 0179-3616 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2015 Walter de Gruyter GmbH, Berlin/Boston Printing and binding: CPI books GmbH, Leck ♾ Printed on acid-free paper Printed in Germany www.degruyter.com
Foreword Gustav Herdan says in his book Language as Choice and Chance that “we may deal with language in the mass, or with language in the line”. Within the framework of quantitative linguistics, the language-in-the-mass approach dominates; however, the other direction of research has enjoyed an increased attention in the last few years. This book documents some recent results in this area. It contains an introduction and a collection of thirteen papers of altogether twenty-two authors. The contributions, ordered alphabetically according to the first author surname, cover quite a broad spectrum of topics. They can be, at least roughly, divided into two groups (admittedly, the distinction between the groups is a bit fuzzy). The first of them consists of theoretically oriented papers. One could distinguish three subgroups here – development of text characteristics, either within a text or in time (Andreev and Borisov, Čech, Köhler and Tuzzi, Rovenchak, Zörnig), linguistic motifs (Köhler, Mačutek and Mikros, Milička) and, in the last contribution from this group, Benešová and Čech demonstrate that, with respect to the Menzerath-Altmann law, a randomized text behaves differently from the “normal” one, and thus the role of a sequential structure of a text is confirmed. Four papers belonging to the other group focus (more or less) on applications, or at least they are inspired by real-world problems. Bavaud et al. apply the apparatus of autocorrelation to different text properties. Pawłowski and Eder investigate a development of rhythmic patterns in a particular Old Czech text. A prediction of sales trends based on the analysis of short texts from Twitter is presented by Rentoumi et al. Finally, Rama and Borin discuss different measures of string similarities and their appropriateness for automatic text classification. We hope that this book will provide an impetus for a further development of linguistic research both in language-in-the-mass and language-in-the-line directions. It goes without saying that the two approaches complement and not exclude each other. We would like to express our thanks to Gabriel Altmann, who gave an impulse which was at the beginning of the book, and to Reinhard Köhler for his continuous help and encouragement during the process of editing. Ján Mačutek would also like to acknowledge VEGA grant 2/0038/12 which supported him during the preparation of this volume. George K. Mikros, Ján Mačutek
Contents Foreword V Gabriel Altmann Introduction 1 Sergey N. Andreev and Vadim V. Borisov Linguistic Analysis Based on Fuzzy Similarity Models 7 François Bavaud, Christelle Cocco, Aris Xanthos Textual navigation and autocorrelation 35 Martina Benešová and Radek Čech Menzerath-Altmann law versus random model 57 Radek Čech Text length and the lambda frequency structure of a text 71 Reinhard Köhler Linguistic Motifs 89 Reinhard Köhler and Arjuna Tuzzi Linguistic Modelling of Sequential Phenomena: The role of laws 109 Ján Mačutek and George K. Mikros Menzerath-Altmann Law for Word Length Motifs 125 Jiří Milička Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 133 Adam Pawłowski and Maciej Eder Sequential Structures in “Dalimil’s Chronicle”: Quantitative analysis of style variation 147
VIII Contents Taraka Rama and Lars Borin Comparative Evaluation of String Similarity Measures for Automatic Language Classification 171 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos Predicting Sales Trends Can sentiment analysis on social media help? 201 Andrij Rovenchak Where Alice Meets Little Prince: Another approach to study language relationships 217 Peter Zörnig A Probabilistic Model for the Arc Length in Quantitative Linguistics 231 Subject Index 247 Authors’ Index 253 Authors’ Addresses 257
Gabriel Altmann
Introduction
Sequences occur in texts, written or spoken, and not in language which is considered system. However, the historical development of some phenomenon in language can be considered sequence, too. In the first case we have a linear sequence of units represented as classes or as properties represented by numbers resulting from measurements on the ordinal, interval or ratio scales. In the second case we have a linear sequence of years or centuries and some change of states. A textual sequence differs from a mathematical sequence; the latter can usually be captured by a recurrence function. A textual sequence is rather a repetition pattern displaying manifold regularities such as distances, lumping, strengthening or weakening towards the end of sentence or chapter or text, oscillations, cohesion, etc. This holds both for units and their conceptually constructed and measured properties, as well as for combinations of these properties which are abstractions of higher degree. The sequential study of texts may begin with the scrutinizing of repetitions. Altmann (1988: 4f.) showed several types of repetitions some of which can be studied sequentially: (a) Runs representing uninterrupted sequences of identical elements, especially formal elements such as word-length, metric patterns, sentence types. (b) Aggregative repetitions arising from Skinner’s formal reinforcement which can be expressed by the distribution of distances or by a decreasing function of the number of identities. (c) The relaxing of strict identity may yield aggregative repetitions of similar entities. (d) Cyclic repetitions occurring especially in irregular rhythm, in prose rhythm, in sentence structure, etc. Different other forms of sequences are known from text analyses. They will be studied thoroughly in the next chapters. If we study the sequences of properties of text entities, we have an enormous field of research in our view. An entity has as many properties as we are able to discriminate at the given state of science. The number of properties increases with time, as can be seen in 150 years of linguistic research. The properties are not “possessed” by the entities, they do not belong to the fictive essence of entities; they are results of definitions and measurements ascribed to the entities by us. In practice, we can freely devise new properties, but not all of them need to turn out to be useful. In practise, a property is useful (reasonable)
2 Gabriel Altmann if its behaviour can be captured by an a priori hypothesis. (Usually, we proceed by trial and error, isolate conspicuous phenomena and search for a regularity inductively.) A property must display some kind of regularity, even if the regularity is a long range one. Further, it is reasonable, if it is not isolated but significantly correlated with at least some other property. This requirement makes it incorporable into a theory. A property should be element of a control cycle. Of course, a property does not have the same sequential realization in all texts. The difference may consist in parameter values, in form modifications, resulting from the differences of text sorts, of languages, etc. But all this are only boundary conditions if we are able to set up at least an elementary theory from which we derive our models. In order to illustrate the proliferation of properties let us consider those of the concept of “word”. Whatever its definition - there are several dozens - one can find a quantification and measure the following properties: (a) length measured in terms of syllable, morpheme or phoneme numbers, or in that of duration; (b) frequency in text; (c) polysemy in the dictionary; (d) polytexty representing the number of different contexts in which it occurs; (e) morphological status: simple, derived, compound, reduplicated; (f) number of partsof-speech classes to which it belongs (with or without an additional affix); (g) productivity measured in terms of derivates, compounds or both which can be formed with the word; (h) age given in number of years or centuries; (i) valency; (j) the number of its grammatical categories (conjugation, declination, gender, times, modes,…) or the number of different affixes it can obtain; (k) emotionality vs. notionality, e.g. mother vs. clip; (l) meaning abstractness vs. concreteness; (m) generality vs. specificity; (n) degree of dogmatism; (o) number of associations in an association dictionary (= connotative potency); (p) number of synonyms (in a dictionary); (q) number of possible functions in a sentence; (r) diatopic variation (= in how many sites of a dialectological atlas it exists); (s) number of dialectal competitors; (t) discourse properties; (u) state of standardization, (v) originality (genuine, calque, borrowing); (w) the language of origin (with calques and borrowings); (x) phrasal potentiality (there are dictionaries of this property); (y) degree of verb activity; (z) subjective ratings proposed by Carroll (1968). This list can be extended. An analogical list can be set up for every linguistic entity, hence we do not know how many sequences of different kind can be created in text. At this elementary scientific level there is no hindrance to using intuition, Peircean abduction, inductive generalizations, qualitative comparisons, classifications, etc. These all are simply the first steps on the way to the construction of a theory. But before we shall be able to work theoretically, observed sequences can be used (a) to characterize texts using some numerical indicators; (b) to
Introduction 3
state the mean of the property and compare the texts setting up text classes, (c) to find tendencies depending (usually) on other properties, (d) to search for idiosyncrasies; (e) to set up an empirical control cycle joining several properties, and (f) to use the sequences in practice, e.g. in psychology and psychiatry. Though according to Bunge’s dictum “no laws, no theory” and “no theory, no explanation”, we may try to find at least some “causes” or “supporting circumstances” or “motives” or “mechanisms” leading to the rise of sequences. They are: (a) Restrictions of the inventories of units. The greater the restrictions (= the smaller the inventory), the more frequently a unit must be repeated. A certain sound has a stronger repetitiveness than a certain sentence because the inventory of sentences is infinite. (b) The grammar of the given language not allowing many repetitions of the same word or word class in immediate neighbourhood. (c) Thematic restriction. This restriction forces the repetition of some words, terms etc., but hinders the use of some constructions. (d) Stylistic and aesthetic grounds which may evoke some regularities like rhyme, rhythm, but avoiding repetition of words. (e) Perseveration reinforcing repetitions, supporting self-stimulation, etc. This phenomenon is well known in psychology and psychiatry. (f) The flow of information in didactic works is ideal if words are repeated in order to concentrate the text around the main theme. In press texts the information flow is more rapid. It is not always possible to find the ground for the rise of the given sequence. Speaking is a free activity, writing is full of pauses for thinking, corrections, addition or omissions, it is not always a spontaneous creation. Hence the study of texts is a very complex task and the discovery of sequences is merely one of the steps towards theory formation. Sequences are secondary units and their establishment is a step in concept formation. As soon as they are defined, one strives for the next definitions, viz. the quantification of their properties and behaviour. By means of analogy, abduction, intuition, or inductive generalization one achieves a state in which it is possible to set up hypotheses. They may concern any aspect of sequences including form, repetition, distances, perseveration, dependences, history, role, etc. Of course, hypotheses are a necessary but not a sufficient component of a theory. They may (and should) be set up at every stage of investigation but the work attains its goal if we are able to set up a theory from which the given hypotheses may be derived deductively Usually, this theory has a mathematical form and one must use the means of mathematics. The longest way is that of testing the given hypotheses. One sole text does not do, and one sole language is merely a good beginning. Often, a corroboration in an Indo-European language seduces us to consider the hypothesis as sufficiently corroborated - but this is merely the first step. Our recommendation is: do not begin with English.
4 Gabriel Altmann Ask a linguist knowing a different language (usually, linguists do) for data and test your hypothesis at the same time on a different text sort. If the hypothesis can be sufficiently corroborated, try to find some links to other properties. Set up a control cycle and expand it stepwise. The best example is Köhler’s linguistic synergetics containing units, properties, dependencies, forces and hypotheses. The image of such a system gets complex with every added entity, but this is the only way to construct theories in linguistics. They furnish us laws which are derived and corroborated hypotheses and thereby at the same time explanations because phenomena can be subsumed under a set of laws. There are no overall theories of language, there are only restricted ones. They expand in the course of time, just as in physics, but none of them embraces the language as a whole, just as in physics. In the sequel we shall list some types of sequences in text. (A) Symbolic sequences. If one orders the text entities in nominal classes which can also be dichotomous - one obtains a sequence of symbols. Such classes are e.g. parts-of-speech, reduction to noun (N) and rest (R), in order to study the nominal style; or to adjectives (A), verbs (V) and rest (R), in order to study the ornamentality vs. activity of the text; or to accentuated and non-accentuated syllable, in order to study rhythm, etc. Symbolic sequences can be studied as runs, as a sequence of distances between equal symbols, as a devil’s staircase, as thematic chains, etc. (B) Numerical sequences may have different forms: there can be oscillation in the values, there are distances between neighbours, one can compute the arc length of the whole sequence, one can consider maxima and minima in the parts of the sequence and compute Hurst’s coefficient; one can consider the sequence a fractal and compute its different properties; numerical sequences can be subdivided in monotonous sub-sequences which have their own properties (e.g. length), probability distributions, etc. Numerical sequences represent a time series which may display autocorrelation, “seasonal oscillation”, they may represent a Fourier series, damped oscillation, climax, etc. (C) Musical sequences are characteristic for styles and development stages as has been shown already by Fucks (1968). Besides motifs defined formally and studied by Boroda (1973, 1982, 1988) there is the possibility of studying sequentially the complete full score obtaining multidimensional sequences. In music, we have, as a matter of fact, symbolic sequences, but the symbols have ordinal values, thus different types of computations can be performed. Spoken text is a multidimensional sequence: there are accents, intonation, tones, which are not marked in the written texts. Even a street can be considered simple sequence if we restrict our attention to one sole property. Observations are always simplified in order to make our conceptual operations possible
Introduction 5
and lucid. Nothing can be captured in its wholeness because we do not even know what it is. We conjecture that textual sequences abide by laws but there is a great number of boundary conditions which will hinder theory construction. Most probably we shall never discover all boundary conditions but we can help by introducing un-interpreted parameters which may but need not change with the change of the independent variable. In many laws derivable from the “unified theory” (cf. Wimmer, Altmann 2005) there is usually a constant representing the state of the language. At the beginning of research, we set up elementary hypotheses and improve, specify them step by step. Sometimes, we simply conjecture that there may be some trend or there may be some relation between two variables and try to find an empirical formula. In order to remove the “may be”-formulation we stay at the beginning of a long way beginning with quantification and continuing with measurement, conjecturing, testing, systematization, explanation. Even a unique hypothesis may keep occupied many teams of researchers because a well corroborated hypothesis means “corroborated in all languages”. And since there is no linguist having at his disposal all languages, every linguistic hypothesis is a challenge for many generations of researchers. Consider for example the famous Zipf’s law represented by a power function used in many scientific disciplines (http://www.nslij-genetics.org/wli/zipf). Originally there was only an optic conjecture made by Zipf without any theoretical foundation. Later on one began to search for modifications of the function, found a number of exceptions, fundamental deviations from the empirical data depending on the prevailing “type” realized in the language, changed the theoretical foundations, etc. Hence, no hypothesis and none of its mathematical expressions (models) are final. Nevertheless, the given state of the art helps us to get orientation, and testing the hypothesis on new data helps us to accept or to modify it. Sometimes a deviation means merely the taking of boundary conditions into account. As an example, consider the distribution of word length in languages: it has been tested in about 50 languages using ca 3000 texts (cf. http://www.gwdg.de/~kbest/litlist.htm) but in Slavic languages we obtain different results if we consider the non-syllabic prepositions as prefixes/clitics or as separate words. The same holds for Japanese postpositions, etc. Thus already the first qualitative interpretation of facts may lead to different data and consequently to different models. In spite of the labile and ever changing (historically, geographically, idiolectally, sociolectally, etc.) linguistic facts from which we construct our data, we cannot stay for ever at the level of rule descriptions but try to establish laws and systems of laws, that is, we try to establish theories. But even theories may con-
6 Gabriel Altmann cern different aspects of the reality. There is no overall theory. While Köhler (1986, 2005) in his synergetic linguistics concentrated on the links among punctual properties, here we look at the running text, try to capture its properties and find links among these sequential properties. The task is enormous and cannot be got through by an individual researcher. Even a team of researchers will need years in order to come at the boundary of the first stage of a theory. Nevertheless, one must begin.
References Altmann, Gabriel. 1988. Wiederholungen in Texten. Bochum: Brockmeyer. Carroll, John B. 1968. Vectors of prose style. In Thomas A. Sebeok (ed.), Style in language, 283–292. Cambridge, Mass.: MIT Press. Boroda, Mojsej G. 1973. K voprosu o metroritmičeski elementarnoj edinice v muzyke. Bulletin of the Academy of Sciences of the Georgian SSR. 71(3). 745–748. Boroda, Mojsej G. 1982. Die melodische Elementareinheit. In Jurij K.Orlov, Mojsej G.Boroda & Isabella Š. Nadarejšvili (eds), Sprache, Text, Kunst. Quantitative Analysen, 205–222. Bochum: Brockmeyer. Boroda, Mojsej G. 1988. Towards a problem of basic structural units of a musical texts. Musikometrika. 1. 11–69. Fucks, Wilhelm. 1968. Nach allen Regeln der Kunst. Stuttgart: Dt. Verlags-Anstalt. Köhler, Reinhard. 1986. Zur linguistischen Synergetik. Struktur und Dynamik der Lexik. Bochum: Brockmeyer. Köhler, Reinhard. 2005. Synergetic linguistics. In Reinhard Köhler, Gabriel Altmann & Rajmund G. Piotrowski (eds.), Quantitative Linguistics. An International Handbook, 760– 774. Berlin-New York: de Gruyter. Wimmer, Gejza, Gabriel Altmann. 2005. Unified derivation of some linguistic laws. In Reinhard Köhler, Gabriel Altmann & Rajmund G. Piotrowski (eds.), Quantitative Linguistics. An International Handbook, 791–807. Berlin-New York: de Gruyter.
Sergey N. Andreev and Vadim V. Borisov
Linguistic Analysis Based on Fuzzy Similarity Models 1 Introduction Modern state of research in linguistics is characterized by development and active use of various mathematical methods and models as well as their combinations to solve such complex problems as text attribution, automatic gender classification, classification of texts or individual styles, etc. (Juola 2006, Hoover 2007, Mikros 2009, Rudman 2002). Among important and defining features of linguistic research is the necessity to take into account uncertainty, heuristic representation and subjectivity of estimation of the analyzed information, heterogeneity and different measurement scales of characteristics. All this leads to the necessity to use data mining methods, primarily, methods of fuzzy analysis and modeling. In empirical sciences the following possibilities of building fuzzy models exist. The first approach consists in the adaptation of fuzzy models which were built to solve problems in other spheres (technical diagnostics, decision making support models, complex systems analysis). The advantages of this approach include the use of the obtained experience and scientific potential. At present in empirical sciences this approach prevails. The second approach presupposes introduction of fuzziness into already existing models to take into account various types of uncertainty. Despite the fact that this approach is rather effective, difficulties arise in the introduction of fuzziness into the models. The third approach implies creation of fuzzy models, aimed at solution of the problems in the given branch of empirical science under conditions of uncertainty. In our opinion this approach is preferable, though at the same time it is the most difficult due to the necessity to consider both specific aspects of the problems and characteristics of the developed fuzzy models (Borisov 2014). This chapter deals with some issues of building fuzzy models within the third approach, using as example fuzzy models for the estimation of similarity of linguistic objects. Such models usage makes the basis for solving several problems of linguistic analysis including those viewed in this chapter.
8 Sergey N. Andreev and Vadim V. Borisov Solutions of the following problems of linguistic analysis using the suggested fuzzy similarity models are considered. I) Comparison (establishing the degree of similarity) of the original and its translations. The essence of the approach consists in establishing estimated values of the compared linguistic objects, e.g. parts (chapters) of the original and of its translations. In this case the similarity model, built for the original text, is used. The obtained results of the estimation allow not only to rank the compared linguistic objects, but also to find out the degree of their similarity. II) Comparison of the structure of similarity models for the original text and its translations. The suggested fuzzy similarity models can be built not only for the original but also for its translations. The results of the comparison of the structure of these models present interest as they contain information about interrelations and consistency of the estimated characteristics. III) Analysis of the dynamics of individual style development. In this case comparison (establishing the degree of similarity) of consecutive fragments of the same text can be used.
2 Approaches to building similarity models for linguistic analysis As has been mentioned above many solutions of various linguistic problems are based on different similarity models, which are aimed at: firstly, direct estimation of the objects; secondly, classification of objects on the basis of different estimation schemes. These tasks are interconnected though not always interchangeable. Similarity model aimed at solving problems of the first class, may be used even for separate linguistic objects (without comparison with other objects). In this case the estimated objects may be ranked. On the other side, methods of ranking are often not applicable for the estimation of separate objects. More generally, the problem of multivariate estimation is formulated as follows. There are a number of estimation characteristics pi, i = 1, …, n. There is also a set of estimated linguistic objects A = {a1, …, aj , …, am}. For ∀ aj ∈ A it is necessary to find the values of these characteristics 𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 �. In fuzzy approach the values of these characteristics are expressed as a number in the range [0, 1] and characterize the degree of meeting the corresponding criterion.
Linguistic analysis based on fuzzy similarity models 9
As a rule, similarity models possess a complex structure and are represented as a hierarchy of interconnected characteristics. Consider as example a two-level system of characteristics. Let the characteristics pi, i = 1, …, n be partial characteristics of the lower level of the hierarchy. And let the generalized characteristic P be denoted. Then the value of this generalized characteristic for each of the estimated objects may be obtained by aggregating the values of partial characteristics (Borisov, Bychkov, Dementyev, Solovyev, Fedulov 2002): ∀𝑎𝑎𝑗𝑗 ∈ 𝐴𝐴, 𝑃𝑃𝑃𝑃𝑃𝑗𝑗 � = ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )�.
where h – is mapping, satisfying the following axioms: А1) ℎ(0, … ,0) = 0, ℎ(1, … ,1) = 1– borderline conditions. А2) For any pairs, �𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝𝑖𝑖′ �𝑎𝑎𝑗𝑗 �� ∈ ⌈0, 1⌉2 , if ∀𝑖𝑖 , 𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 � ≥ 𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, then
ℎ �𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �� ≥ ℎ �𝑝𝑝𝑖𝑖′ �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑖𝑖′ �𝑎𝑎𝑗𝑗 �� – non-diminishing axiom.
А3) h – symmetric function of its arguments, i.e. its value is the same at any permutation of its arguments. А4) h – continuous function (Dubois, Prade 1990). Thus the task of linguistic objects estimation conventionally may be divided into two stages: firstly, getting the estimation of partial characteristics of linguistic objects; secondly, getting a generalized estimation of linguistic objects. The first stage, as a rule, is performed in a traditional way. The second stage of getting generalized estimation is characterized by a number of specific peculiarities. The following main approaches to getting generalized estimation, which actually means building estimation models of different linguistic objects, exist: – – – –
generalized characteristic is formed under conditions of equality of partial characteristics; generalized characteristic is formed on the basis of recursive aggregation of partial characteristics; generalized characteristic is formed under conditions of inequality of partial characteristics; generalized characteristic is formed on the basis of utilizing fuzzy quantifiers for the aggregation of partial characteristics. Let us consider these approaches in detail.
Generalized characteristic is formed under conditions of equality of partial characteristics
10 Sergey N. Andreev and Vadim V. Borisov In this case the following variants of estimation are possible: i. the value of the generalized characteristic is determined by the lowest value of all the partial characteristics; ii. the value of the generalized characteristic is determined by the highest value of all the partial characteristics; iii. compromise strategies; iv. hybrid strategies (Dubois, Prade 1988). For variant (i), in addition to axioms А1) – А4), the following axiom must be fulfilled: А5) ∀𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, ℎ �𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 �� ≤ 𝑚𝑚𝑚𝑚𝑚𝑚 �𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 ��, i.e. the value of
the generalized characteristic must not exceed the minimal value of any of the partial characteristics.
For variant (ii), in addition to axioms А1) – А4), the following axiom must be fulfilled: А6) ∀𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� ≥ 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )�, i.e. the value of the generalized characteristic must not be less than the maximal value of any of
the partial characteristics.
For compromise strategies variant (iii), in addition to axioms А1) – А4), the following axiom must be fulfilled: A7) ∀𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �,
𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� < ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� < 𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )�, i.e.
the value of the generalized characteristic is between the values of the partial characteristics.
For hybrid strategy (variant (iv)) the value of the generalized characteristic may be obtained by aggregation of the values of the partial characteristics on the basis of symmetric sum such as median 𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝2 �𝑎𝑎𝑗𝑗 �; 𝑎𝑎𝑎, 𝑎𝑎 𝑎 [0, 1].
Another variant of hybrid strategy consists in using associative symmetric sums (excluding median): 1
if 𝑚𝑚𝑚𝑚𝑚𝑚 �𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 �� < , then 2
ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� ≤ 𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� 1
if 𝑚𝑚𝑚𝑚𝑚𝑚 �𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 �� > , then 2
ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� ≥ 𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )�;
This approach makes it possible under condition of equality of partial characteristics to identify adequate operations for their aggregation.
Linguistic analysis based on fuzzy similarity models 11
Establishing (identification) of adequate aggregation operations, which embrace the whole set of hybrid strategies between the “opposite” variants (i) and (ii), provides the parameterized family of operations of the following type: 𝛾𝛾
ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� = 𝐼𝐼𝐼𝐼𝐼1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� ∙ 𝑈𝑈 𝑈𝑈𝑈1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 ��
(1−𝛾𝛾)
, 𝛾𝛾 𝛾 [0,1]
where I, U are some operations of intersection and union correspondingly; 𝛾𝛾 is an index characterizing the degree of compromise between variants (i) and (ii) (Zimmermann, Zysno 1980).
The generalized characteristic is formed on the basis of recursive aggregation of partial characteristics Often it is difficult to carry out the estimation of all partial characteristics due to dimensionality of the estimation problem. In such cases the operation of obtaining a generalized estimation can be realized in a recursive way for the number of equivalent partial characteristics n > 2 with fulfilling commutativity axiom A3 (Dubois, Prade 1988): ℎ(𝑛𝑛) �𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� = ℎ�ℎ(𝑛𝑛𝑛𝑛) �𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛𝑛𝑛 (𝑎𝑎𝑗𝑗 )�, 𝑝𝑝𝑛𝑛𝑛𝑛 (𝑎𝑎𝑗𝑗 )�.
The generalized characteristic is formed under the condition of non-equality of partial characteristics Within this approach the following variants of estimation are possible: i. introduction of borderline values for partial characteristics; ii. weighing partial characteristics; iii. setting asymmetric generalized characteristic. In case variant (i) is applied, borderline values for partial characteristics are set, the fulfillment of which determines the acceptability of the generalized estimation of the object under study. If variant (ii) is chosen, every partial characteristic is assigned weight with consequent aggregation of weighed partial characteristics using the operations of aggregation. The most wide spread is weighed sum of partial characteristics. 𝑛𝑛
𝑛𝑛
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝑃𝑃𝑃𝑃𝑃𝑗𝑗 � = � 𝑤𝑤𝑖𝑖 𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, � 𝑤𝑤𝑖𝑖 = 1
However, the possibility of assigning weights wi is problematic if these weights belong to characteristics of different nature. In this case it is feasible to use relative values of these characteristics in the latter equation.
12 Sergey N. Andreev and Vadim V. Borisov The procedure of weighing partial characteristics can be also applied to other types of aggregation operations. Variant (iii) of setting asymmetric generalized characteristic is characterized by a complicated structure and the necessity to aggregate partial characteristics at the lowest level of hierarchy using, e.g. “AND-OR”-tree. The generalized characteristic is formed on the basis of fuzzy quantifiers during aggregation of partial characteristics These fuzzy quantifiers specify the estimation of achieving acceptable results, such as “majority”, “few”, “not less than a half”. Fuzzy quantifiers may also provide the estimation to what extent the borderline values of partial characteristics (or most of partial characteristics) which define the acceptability of the result are reached. In this case besides a fuzzy quantifier, weights of these partial characteristics are assigned (Dubois, Prade 1990). Actually, the importance of choosing the best approach to the formation of the generalized characteristic becomes more obvious when some generalized estimation, in its turn, is combined with other generalized estimations (Keeney, Raiffa 1981). Despite a big number of the existing approaches to building similarity models, most of them do not take into account consistency of partial heterogeneous characteristics for the operation of aggregation. As a result this does not permit to build adequate similarity models with complex structure for solving linguistic tasks.
3 Method and fuzzy similarity models for solving linguistic problems Consider the suggested method and fuzzy similarity models for solving the above mentioned linguistic problems: – comparison (establishing the degree of similarity) of the original and its translations; – comparison of the structure of similarity models for the original text and its translations; – analysis of the dynamics of individual style development (establishing the degree of similarity) of fragments of text.
Linguistic analysis based on fuzzy similarity models 13
We shall show the solution of these problems on the data-base of the poem by Samuel Coleridge The Rime of the Ancient Mariner and its two translations by Nikolay Gumilev and Wilhelm Levik. The main demands to similarity models for solving tasks of linguistic analysis include the following: – possibility to formulate a generalized characteristic on the basis of changing sets of partial characteristics; – possibility to aggregate heterogeneous characteristics (both quantitative and qualitative) which differ in measurement scales, range of variation; – taking into consideration different significance of partial characteristics in the generalized characteristic; – taking into consideration consistency of partial characteristics; – possibility to adjust (adapt) the model in a flexible way to the changes in the number of characteristics (addition, exclusion) and changes in the characteristics (in consistency and significance). The results of the analysis of the existing approaches to building similarity models for linguistic analysis enables us to make a conclusion that building fuzzy similarity models is the most feasible method. The suggested method and model allow defining the operation ℎ(𝑝𝑝1,…, 𝑝𝑝𝑛𝑛 ) of getting the generalized estimation P on the basis of operations with partial characteristics ℎ𝑘𝑘,𝑙𝑙 (𝑝𝑝𝑘𝑘 , 𝑝𝑝𝑙𝑙 ), 𝑘𝑘, 𝑙𝑙 ∈ {1, … , 𝑛𝑛}, 𝑘𝑘 𝑘 𝑘𝑘 (for two-place operations) and weights of these partial characteristics 𝑤𝑤𝑖𝑖 (𝑖𝑖 = 1, … , 𝑛𝑛, ∑𝑛𝑛𝑖𝑖𝑖𝑖 𝑤𝑤𝑖𝑖 = 1) In this case the identification operations ℎ𝑘𝑘,𝑙𝑙 (𝑝𝑝𝑘𝑘 , 𝑝𝑝𝑙𝑙 ), 𝑘𝑘, 𝑙𝑙 ∈ {1, … , 𝑛𝑛}, 𝑘𝑘 𝑘 𝑘𝑘 with partial characteristics pk and pl pairwise degrees of their consistency 𝑐𝑐𝑘𝑘,𝑙𝑙 (𝑘𝑘, 𝑙𝑙 = 1, … , 𝑛𝑛) must be established. Depending on the specifics of the estimation task in question, consistency may be treated as correlation, interdependence of partial characteristics, the possibility of obtaining exact values of the compared characteristics. As different methods may be used to establish degrees of consistency 𝑐𝑐𝑘𝑘,𝑙𝑙 (𝑘𝑘, 𝑙𝑙 = 1, … , 𝑛𝑛) of partial characteristics, it is feasible to specify a set 𝐶𝐶̃ , elements of which define criterion levels of these characteristics consistency in ascending order: 𝐶𝐶̃ = {NC – «No consistency», LC – «Low consistency», MC – «Medium consistency», HC – «High consistency», FC – «Full consistency»}. For example, NC – «No consistency» can be treated as the minimum of correlation coefficient values between partial characteristics, and FC – «Full consistency» – as the maximum of correlation coefficient values between partial characteristics. Other values LC, MC, HC – correspond to intermediate values of correlation coefficient between partial characteristics.
14 Sergey N. Andreev and Vadim V. Borisov Then the obtained pairwise consistency degrees for partial characteristics 𝑐𝑐𝑘𝑘,𝑙𝑙 (𝑘𝑘, 𝑙𝑙 = 1, … , 𝑛𝑛), regardless of the method with which they were obtained, are compared with criterial levels of consistency from the set 𝐶𝐶̃ 𝑐𝑐𝑘𝑘,𝑙𝑙 ⇔ 𝑐𝑐𝑐𝑖𝑖 ∈ 𝐶𝐶̃ = {𝑁𝑁𝑁𝑁, 𝐿𝐿𝐿𝐿, 𝑀𝑀𝑀𝑀, 𝐻𝐻𝐻𝐻, 𝐹𝐹𝐹𝐹}.
Obviously, as a result of such comparison the whole set of partial characteristics will be divided into subsets, each of them corresponding to one criterion consistency level. In their turn, all criterion levels of consistency of partial characteristics from the set 𝐶𝐶 = {𝑁𝑁𝑁𝑁, 𝐿𝐿𝐿𝐿, 𝑀𝑀𝑀𝑀, 𝐻𝐻𝐻𝐻, 𝐹𝐹𝐹𝐹} are compared to operations of aggregation of these characteristics, satisfying the above mentioned axioms: normalization, non-diminishing, continuity, bisymmetry (association). The results of the study substantiated a set of operations for the aggregation of partial characteristics which both satisfies the above-mentioned axioms and corresponds to the given criterion levels of characteristics consistency (Table 1, Fig. 1). Table 1: Comparison of aggregation operations with criterion levels of characteristics consistency № Operation of aggregation Criterion level of of characteristics pk, and pl characteristics consistency 1 min(pk, pl) NC 2 med(pk, pl, 0.25) LC
Description of criterion level of consistency No consistency Low consistency
3
med(pk, pl, 0.5)
MC
Medium consistency
4
med(pk, pl, 0.75)
HC
High consistency
5
max(pk, pl)
FC
Full consistency
Till now we spoke of a similarity model, supposing that it has a two-level hierarchical structure, consisting of the generalized characteristic and partial characteristics. However the system of characteristics may be organized in a more complex structure, viz. partial characteristics may be separated into groups which in their turn are divided into subgroups.1 Thus as a result of grouping characteristics the structure of the similarity model may include three or even more levels.
1 The approach, suggested in this chapter, makes it possible to build similarity models both with and without a priori grouping of the characteristics. In both cases all the provisions of the method and similarity models remain valid.
Linguistic analysis based on fuzzy similarity models 15
Fig. 1: Operations of characteristics aggregation
16 Sergey N. Andreev and Vadim V. Borisov For solving different linguistic analysis problems of verse such groupings of characteristics may be used: − according to linguistic levels: − syntactic characteristics; − characteristics of poetical syntax; − morphological characteristics − rhythmic characteristics; − characteristics of rhyme; − characteristics of stanza (strophic); − grouping based on the localization in the verse line: − localized: − in the first metrically strong position (1st ictus) of the line: − part of speech characteristics; − rhythmic characteristics − in the final metrically strong position (last ictus) of the line; − part of speech characteristics; − rhythmic characteristics − unlocalized: − syntactic characteristics; − characteristics of poetical syntax; − part of speech characteristics; − strophic characteristics. In case of grouping the estimation is found for each group (subgroup) of characteristics taking into account the consistency of these characteristics within the group (subgroup). The result of estimation of the characteristics in each group (subgroup) can have relevance of its own. But besides, the results of estimation in all groups of characteristics can be aggregated to achieve generalized estimation (taking into account the degree of consistency between these groups). The procedure of getting such generalized estimation is analogous to that of obtaining estimation for one group. Consider as example the building of similarity model for a group of part of speech characteristics of the poem The Rime of the Ancient Mariner by S.T. Coleridge. The group of these part-of-speech characteristics can be subdivided into three subgroups: (i) subgroup of parts of speech, localized at the end of the line and rhymed: – p1 – number of rhymed nouns; – p2 – number of rhymed verbs;
Linguistic analysis based on fuzzy similarity models 17
– – –
p3 – number of rhymed adjectives; p4 – number of rhymed adverbs; p5 – number of rhymed pronouns;
(ii) subgroup of parts of speech, localized at the end of the line and not rhymed: – p6 – number of unrhymed nouns; – p7 – number of unrhymed verbs; – p8 – number of unrhymed adjectives; – p9 – number unrhymed adverbs; – p10 – number of unrhymed pronouns; (iii) subgroup of parts of speech which are not localized at some position in the line: – p11 – number of nouns; – p12 – number of verbs; – p13 – number of adjectives; – p14 – number of adverbs. To establish the degree of consistency of the characteristics in the abovementioned subgroups we shall use correlation analysis (Pearson correlation coefficient). Table 2 contains the results of comparing the values of the correlation coefficient of part of speech localized rhymed characteristics from subgroup (i) with the criterion levels of consistency from set 𝐶𝐶 = {𝑁𝑁𝑁𝑁, 𝐿𝐿𝐿𝐿, 𝑀𝑀𝑀𝑀, 𝐻𝐻𝐻𝐻, 𝐹𝐹𝐹𝐹} which are represented as a matrix of consistency. Table 2: Consistency matrix of part of speech localized rhymed characteristics in the poem The Rime of the Ancient Mariner by S.T. Coleridge.
p1 p2 p3 p4 p5
p1 – HC LC MC LC
p2 HC – LC MC NC
p3 LC LC – HC MC
p4 MC MC HC – MC
p5 LC NC MC MC –
In the same way consistency matrices for localized unrhymed (Table 3) and unlocalized (Table 4) part of speech characteristics are formed.
18 Sergey N. Andreev and Vadim V. Borisov Table 3: Consistency matrix of part of speech localized unrhymed characteristics in the poem The Rime of the Ancient Mariner by S.T. Coleridge.
p6 p7 p8 p9 p10
p6 – MC LC LC HC
p7 MC – LC LC MC
p8 LC LC – MC MC
p9 LC LC MC – LC
p10 HC MC MC LC –
Table 4: Consistency matrix of part of speech unlocalized characteristics in the poem The Rime of the Ancient Mariner by S.T. Coleridge.
P11 P12 P13 P14
p11 – MC MC LC
p12 MC – LC MC
p13 MC LC – LC
p14 LC MC LC –
It should be noted that consistency matrices in Tables 2–4 may be treated as fuzzy relations, defined on the corresponding sets of characteristics (Lee 2004). The requirement of using different significance of partial characteristics in every group (or subgroup) is fulfilled by introduction of weight vector: 𝑛𝑛
𝑊𝑊 = {𝑤𝑤1 , 𝑤𝑤2 , … , 𝑤𝑤𝑛𝑛 }, ∀𝑖𝑖: 𝑤𝑤𝑖𝑖 ∈ [0,1], � 𝑤𝑤𝑖𝑖 = 1 𝑖𝑖𝑖𝑖
– – –
Weights {𝑤𝑤1 , 𝑤𝑤2 , … , 𝑤𝑤𝑛𝑛 } may be: defined during the experiment; assigned by an expert; obtained from the data of pairwise comparison of the characteristics significance, also carried out by an expert.
The latter approach to establishing weights of characteristics is the most convenient because it is focused on comparison of only one pair of objects at a time with consequent processing of the results of pairwise comparisons. This radically simplifies the task in case of a large number of characteristics to be compared. Consider getting weight vector based on the method of pairwise comparison of characteristics significance (Saaty 1980). At first, on the basis of pairwise comparison of the significance of characteristics from a certain group (subgroup), a positive definite, back
Linguistic analysis based on fuzzy similarity models 19
symmetric matrix (Table 5) is formed, in which element vi/vj denotes the degree of significance of the i-th characteristic as compared to the j-th characteristic.
Number
Table 5: Matrix of pairwise comparisons of the significance of the characteristics from a group.
1 2 … i … n
Characteristic number 1 2 … 1 v1/v2 … v2/v1 1 … … … … vi/v1 vi/v2 … … … vn/v1 vn/v2 …
i v1/vi v2/vi … 1 … vn/vi
… … … … … … …
n v1/vn v2/vn … vi/vn … 1
To define the values of vi and vj the following estimations of degrees of significance are used in the table: 1 – equally significant; 3 – a little more significant; 5 – sufficiently more significant; 7 – much more significant; 9 – absolutely more significant; 2, 4, 6, 8 – intermediate significance. Then on the basis of these pairwise comparisons weights of characteristics in the group (subgroup) are calculated:
𝑤𝑤𝑖𝑖 =
𝑣𝑣 �∏𝑛𝑛𝑙𝑙𝑙𝑙 � 𝑣𝑣𝑘𝑘 �
𝑛𝑛
𝑙𝑙
𝑣𝑣 ∑𝑛𝑛𝑘𝑘𝑘𝑘 �∏𝑛𝑛𝑙𝑙𝑙𝑙 � 𝑘𝑘 � 𝑣𝑣 𝑛𝑛
𝑙𝑙
For the analyzed example the following weights of characteristics were obtained using the above mentioned method: – for the subgroup of part of speech localized rhymed features: {w1 = 0.12, w2 = 0.29, w3 = 0.25, w4 = 0.15, w5 = 0.19}; – for the subgroup of part of speech localized unrhymed features: {w6 = 0.11, w7 = 0.30, w8 = 0.23, w9 = 0.14, w10 = 0.22}; – for the subgroup of part of speech unlocalized features: {w11 = 0.25, w12 = 0.3, w13 = 0.3, w14 = 0.15};
20 Sergey N. Andreev and Vadim V. Borisov The results of estimation of consistency and significance of characteristics within each group (subgroup) may be structurally represented as a fuzzy undirected graph 𝐺𝐺� with fuzzy nodes and fuzzy edges: 𝐺𝐺� = �𝑃𝑃�, 𝑅𝑅� �,
𝑝𝑝 where 𝑃𝑃� = �� 𝑖𝑖 �� , 𝑖𝑖 𝑖 {1, … , 𝑛𝑛}, 𝑝𝑝𝑖𝑖 ∈ 𝑃𝑃 – fuzzy set of vertices, and each node 𝑤𝑤𝑖𝑖
(𝑝𝑝 ,𝑝𝑝 ) 𝑝𝑝𝑖𝑖 is weighed by the corresponding value of 𝑤𝑤𝑖𝑖 ∈ [0,1]; 𝑅𝑅� = �� 𝑘𝑘̃ 𝑙𝑙 �� , 𝑘𝑘, 𝑙𝑙 = 𝐶𝐶𝑖𝑖
1, … , 𝑛𝑛, 𝑐𝑐𝑐𝑖𝑖 ∈ 𝐶𝐶̃ = {𝑁𝑁𝑁𝑁, 𝐿𝐿𝐿𝐿, 𝑀𝑀𝑀𝑀, 𝐻𝐻𝐻𝐻, 𝐹𝐹𝐹𝐹} - fuzzy set of edges, and each arch (𝑝𝑝𝑘𝑘 , 𝑝𝑝𝑙𝑙 )
corresponds to a certain criterion consistency level 𝑐𝑐𝑐𝑖𝑖 .
Figure 2 shows fuzzy undirected graph which corresponds to the results of the estimation of consistency and significance in the subgroup of part of speech localized rhymed characteristics in The Rime of the Ancient Mariner by Coleridge.
Fig. 2: Fuzzy undirected graph which corresponds to the results of the estimation of consistency and significance in the subgroup of part of speech localized rhymed characteristics {𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 , 𝑝𝑝5 , } in The Rime of the Ancient Mariner by Coleridge.
In the same way fuzzy graphs corresponding to the results of the estimation of consistency and significance in other groups (subgroup) can be made. Consider the method of building an similarity model on the data-source of the above-mentioned subgroup of localized rhymed part of speech characteristics {𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 , 𝑝𝑝5 , }. The suggested method is based on alternate search in fuzzy graph of �′ = �𝑃𝑃�′ , 𝑅𝑅� ′�, whose arcs correspond to one of the criterion complete subgraphs 𝐺𝐺 levels of consistency 𝑐𝑐𝑐𝑖𝑖 ∈ 𝐶𝐶̃ = {𝑁𝑁𝑁𝑁, 𝐿𝐿𝐿𝐿, 𝑀𝑀𝑀𝑀, 𝐻𝐻𝐻𝐻, 𝐹𝐹𝐹𝐹}. The search order of such
Linguistic analysis based on fuzzy similarity models 21
subgraphs is determined by the order of considering the levels of characteristics consistency. There are two orders of proceeding to consider the levels of characteristics consistency: – firstly, in the direction from the lowest to higher degrees of consistency; – secondly, from the most consistent to less consistent characteristics. The order of proceeding from the lowest to higher degrees of consistency allows «not to lose» «good» estimations of badly consistent characteristics, because the operation of aggregation of the characteristics is usually aimed at such estimation when the value of generalized characteristic is defined by the worst value of the partial characteristics. The order of proceeding from the most consistent to less consistent characteristics makes it possible to take into account «bad» estimations of highly consistent characteristics whose aggregation was based on such type of estimation when the value of the generalized characteristic is defined by the highest value of partial characteristics (Zernov 2007). �′ = �𝑃𝑃�′ , 𝑅𝑅� ′� following the Consider the alternate search of subgraphs 𝐺𝐺 procedure from the most consistent to less consistent characteristics in subgroup {𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 , 𝑝𝑝5 , }. The procedure starts with finding the highest level of consistency for the characteristics {𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 , 𝑝𝑝5 , } (Fig. 2). Pairs of characteristics p1 and p2, and also p3 and p4 have HC «High level of consistency», the largest in fuzzy graph in Figure 2. These pairs of nodes form two complete subgraphs of maximum size. In case there are several such subgraphs, these complete subgraphs of maximum size are sorted in the order of decreasing the total weight of their nodes (another method of analysis of subgraphs can also be applied). In Fig. 3a only those edges of fuzzy graph were left which correspond to the given level of consistency HC. Considering in succession these subgraphs, we merge all the nodes of these subgraphs into one on the basis of the operation of aggregation corresponding to the given level of consistency. If the operation of aggregation is not multi-place ℎ�𝑝𝑝1 , … , 𝑝𝑝𝑞𝑞 �, 𝑞𝑞 𝑞 {1, … , 𝑛𝑛}, but a two-place nonassociative operation ℎ(𝑝𝑝𝑘𝑘 , 𝑝𝑝𝑙𝑙 ), 𝑘𝑘, 𝑙𝑙 𝑙 {1, … , 𝑛𝑛}, 𝑘𝑘 𝑘 1, then the order of enumeration of characteristics pk and pi may be set, e.g. according to decreasing weights.
22 Sergey N. Andreev and Vadim V. Borisov In our case of bisymmetric operation it is also possible to start with the nodes of the largest weight. Then, supposing that 𝑤𝑤𝑖𝑖 ≤ 𝑤𝑤𝑖𝑖𝑖𝑖 , 𝑖𝑖 𝑖 {1, … , 𝑞𝑞 𝑞 1}, we get: ℎ∗ �𝑝𝑝1 , … , 𝑝𝑝𝑞𝑞 � = ℎ�ℎ�… (ℎ(𝑝𝑝1 , 𝑝𝑝2 ), … ), 𝑝𝑝𝑞𝑞𝑞𝑞 �, 𝑝𝑝𝑞𝑞 �
For aggregating the characteristics p1 and p2, and also p3 and p4 the operations 𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝1 , 𝑝𝑝2 ; 0.75) and 𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝3 , 𝑝𝑝4 ; 0.75) from Table 1, corresponding to the specified level of consistency HC are chosen. Figure 3b shows the fuzzy graph, made by merging nodes p1 and p2, and p3 and p4 using the above-mentioned operations. The weights of the new nodes of the fuzzy graph are determined by summing weights of the merging nodes.
Fig. 3: Subgraphs of a fuzzy graph, which correspond to the HC “High Consistency” level – a); fuzzy graph with merged nodes (𝑝𝑝1 , 𝑝𝑝2 ) and (𝑝𝑝3 , 𝑝𝑝4 ), received by using the operations, correspondingly, 𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝1 , 𝑝𝑝2 ; 0.75), and 𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝3 , 𝑝𝑝4 ; 0.75) – b).
Then from the set of edges of the fuzzy graph the edges which are connected with the merging nodes are removed. Afterwards levels of consistency of edges, adjacent to the merging nodes. These levels of consistency are defined according to the strategy of estimation which was chosen (see the classification of approaches to building similarity models for linguistic analysis). In the given example we choose the type of estimation in which the value of the generalized characteristic is determined by
Linguistic analysis based on fuzzy similarity models 23
the worst values of the partial characteristics. The level of consistency, established by this type, is shown in Figure 3b. Then the procedure is repeated. At the next step next highest level of consistency of the characteristics (nodes of a fuzzy graph) out of the set {(𝑝𝑝1 , 𝑝𝑝2 ), (𝑝𝑝3 , 𝑝𝑝4 ), 𝑝𝑝5 } is found. It is obvious that the pair of characteristics (𝑝𝑝3 , 𝑝𝑝4 ) and 𝑝𝑝5 demonstrates the highest level of consistency MC – «Medium consistency». To aggregate these characteristics according to Table 1 the operation 𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝3 , 𝑝𝑝4 ; 0.75)�, 𝑝𝑝5 ; 0.5� is used. Figure 4 shows fuzzy graph made after another merging operation of nodes. At the final stage of identification the operation of aggregation of characteristics (merged nodes of a fuzzy graph) (𝑝𝑝1 , 𝑝𝑝2 ) and �(𝑝𝑝3 , 𝑝𝑝4 ), 𝑝𝑝5 � is carried out according to the level of consistency NC – «No consistency».
Fig. 4: Fuzzy graph after one of successive merging operations of nodes
Thus the similarity model of the subgroup of part of speech localized rhymed characteristics for the poem The Rime of the Ancient Mariner can be expressed as follows: 𝑃𝑃𝐿𝐿𝐿𝐿 = min ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝1 , 𝑝𝑝2 ; 0.75)�, �𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝3 , 𝑝𝑝4 ; 0.75)�, 𝑝𝑝5 ; 0.5���
Similarity models for two other subgroups of characteristics are built in the same way. Figure 5 shows fuzzy undirected graph which corresponds to the results of estimation of consistency and significance in the subgroup of part of speech localized unrhymed characteristics {𝑝𝑝6 , 𝑝𝑝7 , 𝑝𝑝8 , 𝑝𝑝9 , 𝑝𝑝10 , } for the poem The Rime of the Ancient Mariner.
24 Sergey N. Andreev and Vadim V. Borisov
Fig. 5: Fuzzy undirected graph which corresponds to the results of estimation of consistency and significance in the subgroup of part of speech localized unrhymed characteristics {𝑝𝑝6 , 𝑝𝑝7 , 𝑝𝑝8 , 𝑝𝑝9 , 𝑝𝑝10 } for the poem The Rime of the Ancient Mariner.
Having repeated all the steps of this method we get similarity model for the subgroup of part of speech localized unrhymed characteristics as follows: 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 = 𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝6 , 𝑝𝑝10 ; 0.75)�, 𝑝𝑝7 ; 0.5�� , �𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝8 , 𝑝𝑝9 ; 0.5)�; 0.25�.
Fig. 6: Fuzzy undirected graph which corresponds to the results of estimation of consistency and significance in the subgroup of part of speech unlocalized characteristics {𝑝𝑝11 , 𝑝𝑝12 , 𝑝𝑝13 , 𝑝𝑝14 , } for the poem The Rime of the Ancient Mariner.
Linguistic analysis based on fuzzy similarity models 25
Figure 6 shows fuzzy undirected graph which corresponds to the results of estimation of consistency and significance in the subgroup of part of speech unlocalized characteristics {𝑝𝑝11 , 𝑝𝑝12 , 𝑝𝑝13 , 𝑝𝑝14 , } for the poem The Rime of the Ancient Mariner. Similarity model for the subgroup of part of speech unlocalized characteristics may be presented as follows: 𝑃𝑃𝑢𝑢𝑢𝑢 = 𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝11 , 𝑝𝑝12 ; 0.5)�, 𝑝𝑝13 ; 0.5�� , 𝑝𝑝14 ; 0.5�.
The results of the estimation of linguistic objects using different subgroups of characteristics may be useful. In this case the objects will be estimated on the basis of vector of estimations, every element of which is the estimation obtained with the help of similarity model using the corresponding subgroup of characteristics. For example, the vector of part of speech (morphological) characteristics of some linguistic object 𝑎𝑎𝑗𝑗 , 𝑗𝑗 = 1, … , 𝑚𝑚 includes the estimations obtained with the help of similarity models for the corresponding subgroups of part of speech characteristics 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 and 𝑃𝑃𝑢𝑢𝑢𝑢 : �𝑃𝑃𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 �, 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 �, 𝑃𝑃𝑢𝑢𝑢𝑢 �𝑎𝑎𝑗𝑗 ��; 𝑃𝑃𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � = 𝑚𝑚𝑚𝑚𝑚𝑚 � 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � = 𝑚𝑚𝑚𝑚𝑚𝑚 � 𝑃𝑃𝑢𝑢𝑢𝑢 �𝑎𝑎𝑗𝑗 � = 𝑚𝑚𝑚𝑚𝑚𝑚 �
�𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝2 �𝑎𝑎𝑗𝑗 �; 0.75�� ,
�𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝3 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝4 �𝑎𝑎𝑗𝑗 �; 0.75�� , 𝑝𝑝5 �𝑎𝑎𝑗𝑗 �; 0.5��
�;
�𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝6 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝10 �𝑎𝑎𝑗𝑗 �; 0.75�� , 𝑝𝑝7 �𝑎𝑎𝑗𝑗 �; 0.5�� , �𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝8 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝9 �𝑎𝑎𝑗𝑗 �; 0.5�� ; 0.25
�;
�𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝11 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝12 �𝑎𝑎𝑗𝑗 �; 0.5�� , 𝑝𝑝13 �𝑎𝑎𝑗𝑗 �; 0.5�� , �. 𝑝𝑝14 �𝑎𝑎𝑗𝑗 �; 0.5
Where 𝑃𝑃𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � is the estimation of linguistic object 𝑎𝑎𝑗𝑗 using the subgroup of localized part of speech rhymed characteristics and obtained with the help of similarity model 𝑃𝑃𝐿𝐿𝐿𝐿 ; 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � is the estimation of object aj using the subgroup of part of speech localized unrhymed characteristics with the help of similarity
26 Sergey N. Andreev and Vadim V. Borisov model 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 ; 𝑃𝑃𝑢𝑢𝑢𝑢 �𝑎𝑎𝑗𝑗 � is the estimation of the object aj using the subgroup of part of speech unlocalized characteristics, obtained with the help of the model 𝑃𝑃𝑢𝑢𝑢𝑢 .2 On the other side, the estimation results got for the above mentioned subgroups of characteristics may be aggregated into a generalized estimation for the whole group of part of speech characteristics. In order to do this according to the suggested method it is necessary to build a similarity model in which 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 and 𝑃𝑃𝑢𝑢𝑢𝑢 , in their turn, function as
“partial” characteristics, and 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 is a generalized characteristic which is
formed taking into account the consistency and significance of these “partial” characteristics:
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 𝐻𝐻 ∗ (𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 , 𝑃𝑃𝑢𝑢𝑢𝑢 ),
where 𝐻𝐻 ∗ is the notation for operation (set of operations) of aggregation, identified in accordance with consistency levels 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 and 𝑃𝑃𝑢𝑢𝑢𝑢 . Obviously, estimation of consistency levels between 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 and 𝑃𝑃𝑢𝑢𝑢𝑢 as well as setting their significance coefficients is carried out by experts. For the estimation of consistency between 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 and 𝑃𝑃𝑢𝑢𝑢𝑢 it is possible to use fuzzy cognitive model (Borisov, Fedulov 2004) and in order to set their significance – the method of pairwise comparison (Saaty 1980). Let us assume that 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 and 𝑃𝑃𝑢𝑢𝑢𝑢 correspond to the consistency level “High consistency”, and the following significance coefficients are assigned to them: 𝑤𝑤𝐿𝐿𝐿𝐿 = 0.35; 𝑤𝑤𝐿𝐿𝐿𝐿𝐿𝐿 = 0.25; 𝑤𝑤𝑢𝑢𝑢𝑢 = 0.40. Then the similarity model for the whole group of part of speech characteristics of the poem by Coleridge The Rime of the Ancient Mariner may be represented in the following form: 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝑢𝑢𝑢𝑢 ; 0.75)�, 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 ; 0.75�
Consider briefly some approaches to the analysis of linguistic objects using the developed models. Comparison (establishing the degree of similarity) of the original and its translations The essence of the approach consists in establishing estimated values of the compared linguistic objects, e.g. 𝑎𝑎1 , 𝑎𝑎2 , 𝑎𝑎3 – parts (chapters) of the original 2 Significance of characteristics is taken into account directly during the calculation of a particular linguistic object using the estimation models.
Linguistic analysis based on fuzzy similarity models 27
poem The Rime of the Ancient Mariner by Coleridge and of its two translations made by Gumilev and Levik. In this case the model 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 , which was built for the original text, is used. Specific numerical values of characteristics are set: – for the original text by Coleridge – {p1(a1), …, pn(a1)}; – for the translation by Gumilev – {p1(a2), …, pn(a2)}; – for the translation by Levik – {p1(a3), …, pn(a3)}. Then the values of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎1 ), 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎2 ) and 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎3 ) are calculated using the similarity model built for the group of part of speech characteristics in the original text. 𝑃𝑃𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � = 𝑚𝑚𝑚𝑚𝑚𝑚 � 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � = 𝑚𝑚𝑚𝑚𝑚𝑚 � 𝑃𝑃𝑢𝑢𝑢𝑢 �𝑎𝑎𝑗𝑗 � = 𝑚𝑚𝑚𝑚𝑚𝑚 �
�𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝2 �𝑎𝑎𝑗𝑗 �; 0.75�� ,
�𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝3 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝4 �𝑎𝑎𝑗𝑗 �; 0.75�� , 𝑝𝑝5 �𝑎𝑎𝑗𝑗 �; 0.5��
�
�𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝6 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝10 �𝑎𝑎𝑗𝑗 �; 0.75�� , 𝑝𝑝7 �𝑎𝑎𝑗𝑗 �; 0.5�� , �𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝8 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝9 �𝑎𝑎𝑗𝑗 �; 0.5�� ; 0.25
�
�𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝11 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝12 �𝑎𝑎𝑗𝑗 �; 0.5�� , 𝑝𝑝13 �𝑎𝑎𝑗𝑗 �; 0.5�� , �. 𝑝𝑝14 �𝑎𝑎𝑗𝑗 �; 0.5
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎𝑖𝑖 ) = 𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑃𝑃𝐿𝐿𝐿𝐿 (𝑎𝑎𝑖𝑖 ), 𝑃𝑃𝑢𝑢𝑢𝑢 (𝑎𝑎𝑖𝑖 ); 0.75)�, 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 (𝑎𝑎𝑖𝑖 ); 075�.
Then using the values of translations 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎2 ) and 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎3 ) their similarity with the original, i.e. proximity to the value of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎1 ) is established. The obtained results of the estimation allow not only to rank the compared linguistic objects, but also to find out the degree of their similarity. Table 6 contains the results of the estimation of 7 parts (chapters) of the texts of the original poem by Coleridge The Rime of the Ancient Mariner and its two translations by Gumilev and Levik, which were obtained using the similarity model for part of speech characteristics of the original.
28 Sergey N. Andreev and Vadim V. Borisov Table 6: Results of the estimation of 7 parts of the texts of the original poem by Coleridge The Rime of the Ancient Mariner and its two translations by Gumilev and Levik, obtained using the similarity model for part of speech characteristics of the original. Original (Coleridge) Translation (Gumilev) Translation (Levik) Parts PLR(a1) PLuR(a1) PuL(a1) PPoS(a1) PLR(a1) PLuR(a1) PuL(a1) PPoS(a1) PLR(a1) PLuR(a1) PuL(a1)PPoS(a1) Part 1 0.061 0.250 0.203 0.250 0.049 0.250 0.296 0.296 0.036 0.250 0.323 0.323 Part 2 0.05 0.250 0.235 0.250 0.017 0.217 0.322 0.322 0.046 0.215 0.329 0.329 Part 3 0.037 0.173 0.232 0.232 0.099 0.222 0.288 0.288 0.060 0.060 0.316 0.316 Part 4 0.103 0.221 0.213 0.221 0.103 0.191 0.267 0.267 0.029 0.176 0.345 0.345 Part 5 0.051 0.245 0.193 0.245 0.059 0.220 0.279 0.279 0.059 0.250 0.314 0.314 Part 6 0.048 0.192 0.218 0.218 0.163 0.250 0.285 0.285 0.076 0.229 0.273 0.273 Part 7 0.080 0.196 0.193 0.193 0.035 0.204 0.273 0.273 0.092 0.227 0.272 0.272
These results show that on the whole the translation by Gumilev is closer to the original in Parts 1–5 in which the idea of crime and punishment prevails. On the other hand, Part 6 and to some extent Part 7 in which the themes of atonement and relief of the Mariner (and the writer himself as it is implied in the text) by storytelling are expressed, are translated by Levik with a higher degree of exactness, thus revealing difference of approaches demonstrated by the two translators. Gumilev is much closer in his translation in unrhymed characteristics and also in unlocalized part of speech characteristics whereas Levic pays more attention to the rhymed words. Rhyme is one of important means of introducing vertical links into verse. Unrhymed words attract less attention, but together with unlocalized characteristics they allow Gumilev to render the distribution of parts of speech in the line. In other words, Levik creates better similarity in vertical organization, Gumilev – in horizontal aspect. The estimation was carried out with the following assumptions: – For the given example the evaluation model is unchanged for all parts (chapters) of the original poem. Obviously, it is possible to build different similarity models for different chapters of the text, and then estimation can be made for each chapter, using its “own” similarity model. – The comparison of these linguistic objects in the given example was made after normalizing the characteristics against the number of lines in the corresponding chapter. It is reasonable also to examine the possibilities of using this model with normalization of the values of characteristics against their maximum values and taking into account their weights. – While building the model in the given example we followed the strategy of examining the characteristics in the order “from the most consistent to the
Linguistic analysis based on fuzzy similarity models 29
least consistent». Using the other strategy of building the model one may widen the possibilities of the classification. Part of speech characteristics being highly important for authorship attribution, classification of styles and estimation of translation exactness (Andreev 2012, Koppel, Argamon, Shimoni 2002, Naumann, Popescu, Altmann 2012, Popescu, Čech, Altmann 2013), may further on be combined with characteristics, reflecting other linguistic layers: rhythm, rhyme, syntax and other features of verse (Köhler, Altmann 2014, 70-73). Such “parameter adjustment” is possible and will allow to focus on new aspects of classification. Similarity models can be built not only for the original but also for its translations. The results of the comparison of the structure of these models present interest as they contain information about interrelations and consistency of the estimated characteristics. Analysis of the dynamics of individual style development (establishing the degree of similarity of fragments of text). In this case the compared linguistic objects are different fragments (chapters) of a text or texts (see Table 6). Figure 7 reflects the dynamics of the values of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 throughout the original text and its two translations.
Fig. 7: Dynamics of the characteristic 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 over the original text The Rime of the Ancient Mariner by Coleridge and its translations by Gumilev and Levik.
According to the analysis of the dynamics of these characteristics it is possible to make the following conclusions. Firstly, in the original the generalized characteristic 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 is influenced by the characteristic 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 whereas in both Gumilev’s and Levil’s translations this
30 Sergey N. Andreev and Vadim V. Borisov characteristic 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 is mostly influenced by 𝑃𝑃𝑢𝑢𝑢𝑢 . This can explain higher values of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 in both translations as compared with the values of this characteristic in the original. Secondly, for the original 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 remains stable practically for the whole of the text. Some deviation – tendency to decreasing – takes place only in Part 6, reaching the minimum value of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 in Part 7. The same tendency is observed in both translations. Thirdly, there is a closer correlation of the changes of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 in the text of the poem between the original and the translation by Gumilev than between the original and the translation of Levik. The same type of comparisons were made for the dynamics over the text of, 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 , 𝑃𝑃𝑢𝑢𝑢𝑢 , for the original and the translations (Fig. 8–10). Variation of the values of two characteristics 𝑃𝑃𝑢𝑢𝑢𝑢 and 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 in all the three texts is rather moderate. Some deviation from this tendency is observed in Levik’s translation (Fig. 9) in Part 3 where the theme of punishment is introduced. On the other hand, the localized rhymed characteristics 𝑃𝑃𝐿𝐿𝐿𝐿 (Fig. 8) show a wide range of variation of their values in each text.
Fig. 8: Dynamics of the characteristic 𝑃𝑃𝐿𝐿𝐿𝐿 for the original text The Rime of the Ancient Mariner by Coleridge and its translations by Gumilev and Levik.
Linguistic analysis based on fuzzy similarity models 31
Fig. 9: Dynamics of the characteristic 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 for the original text The Rime of the Ancient Mariner by Coleridge and its translations by Gumilev and Levik.
Fig. 10: Dynamics of the characteristic 𝑃𝑃𝑢𝑢𝑢𝑢 for the original text The Rime of the Ancient Mariner by Coleridge and its translations by Gumilev and Levik.
4 Conclusion In this chapter the relevance of building fuzzy similarity models for linguistic analysis has been discussed. The models are aimed at solving a wide range of tasks under conditions of uncertainty:
32 Sergey N. Andreev and Vadim V. Borisov – – –
estimation of the degree of similarity of the original text (poem) and its translations using different groups of characteristics; estimation of the similarity of parts of the compared poems (chapters); classification of texts and their features according to certain rules.
Similarity models for complicated linguistic object have been analyzed, their classification depending on the type of characteristics aggregation has been suggested. The relevance of building similarity models based on the fuzzy approach and aimed at solving various linguistic problems is outlined. The suggested fuzzy similarity models for linguistic analysis satisfy the following conditions: – generalized characteristic is formed on the basis of changing sets of partial characteristics; – possibility to aggregate heterogeneous characteristics (both quantitative and qualitative) which differ in measurement scales, range of variation and weight; – possibility to build a similarity model both with and without a priori grouping of the characteristics; – possibility to create a complex hierarchical structure of the similarity model, in which partial characteristics are divided into groups and subgroups; – using criterion levels of characteristics consistency for the identification of aggregation operations; – possibility to adjust (adapt) the model in a flexible way to the changes of the number of characteristics (addition, exclusion) and changes of characteristics (in consistency and weight).
Acknowledgment Research is supported by the State Task of the Ministry of Education and Science of the Russian Federation (the basic part, task # 2014/123, project # 2493) and Russian Foundation for Humanities (project # 14-04-00266).
References Andreev, Sergey. 2012. Literal vs. liberal translation – formal estimation. Glottometrics 23. 62– 69.
Linguistic analysis based on fuzzy similarity models 33 Borisov, Vadim V. 2014. Hybridization of intellectual technologies for analytical tasks of decision-making support. Journal of Computer Engineering and Informatics 2(1). 11–19. Borisov, Vadim V., Igor A. Bychkov, Andrey V. Dementyev, Alexander P. Solovyev & Alexander S. Fedulov. 2002. Kompyuternaya podderzhka slozhnykh organizacionno-tekhnicheskikh sistem [Computer support of complex organization-technical systems]. Moskva: Goryachaya linia – Telekom. Borisov, Vadim V. & Alexander S. Fedulov. 2004. Generalized rule-based fuzzy cognitive maps: Structure and dynamics model. In Nikhil R. Pal, Nikola Kasabov, Rajani K. Mudi, Srimanta Pal & Swapan K. Parui (eds.), Neural information processing (Lecture notes in computer science 3316), 918–922. Berlin & Heidelberg: Springer. Dubois, Didier & Henri Prade. 1988. Théorie des possibilités: Application à la représentation des connaissances en informatique. Paris: Masson. Dubois, Didier & Henri Prade. 1990. Teoriya vozmozhnostey. Prilozheniya k predstavleniyu v informatike [Possibility theory. Applications to the representation of knowledge in Informatics]. Moskva: Radio i svyaz. Hoover, David L. 2007. Corpus stylistics, stylometry and the styles of Henry James. Style 41(2). 160–189. Juola, Patrick. 2006. Authorship attribution. Foundations and Trends in Information Retrieval 1(3). 233–334. Keeney, Ralph L. & Howard Raiffa. 1981. Prinyatye resheniy pri mnogikh kriteriyakh: Predpochteniya i zameshcheniya [Decisions with multiple objectives: Preferences and value tradeoffs]. Moskva: Radio i svyaz. Köhler, Reinhard & Gabriel Altmann. 2014. Problems in Quantitative Linguistics 4. Lüdenscheid: RAM-Verlag. Koppel, Moshe, Shlomo Argamon & Anat R. Shimoni. 2002. Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17(4). 401–412. Lee, Kwang H. 2004. First course on fuzzy theory and application. Berlin & Heidelberg: Springer. Mikros, George K. 2009. Content words in authorship attribution: An evaluation of stylometric features in a literary corpus. In Reinhard Köhler (ed.), Issues in Quantitative Linguistics, 61–75. Lüdenscheid: RAM-Verlag. Naumann, Sven, Ioan-Iovetz Popescu & Gabriel Altmann. 2012. Aspects of nominal style. Glottometrics 23. 23–35. Popescu, Ioan-Iovitz, Radek Čech & Gabriel Altmann. 2013. Descriptivity in Slovak lyrics. Glottotheory 4(1). 92–104. Rudman, Joseph. 2002. Non-traditional authorship attribution studies in eighteenth century literature: Stylistics statistics and the computer. Jahrbuch für Computerphilologie 4. 151– 166. Saaty, Thomas L. 1980. The analytic hierarchy process: Planning, priority setting, resource allocation. New York: McGraw - Hill. Zernov, Mikhail M. 2007. Sposob postroyeniya nechetkoy mnogokriteriyalnoy ocenochnoy modeli [Method of building fuzzy multicriteria estimation model]. Neyrokompyutery: Razrabotka i primeneniye 1, 40–49. Zimmermann, Hans-Jürgen & Peter Zysno. 1980. Latent connectives in human decision making. Fuzzy Sets and Systems 4(1), 37–51.
François Bavaud, Christelle Cocco, Aris Xanthos
Textual navigation and autocorrelation 1 Introduction Much work in the field of quantitative text processing and analysis has adopted one of two distinct symbolic representations: in the so-called bag-of-words model, text is conceived as an urn from which units (typically words) are independently drawn according to a specified probability distribution; alternatively, the sequential model describes text as a categorical time series, where each unit type occurs with a probability that depends, to some extent, on the context of occurrence. The present contribution pursues two related goals. First, it aims to generalise the standard sequential model of text by decoupling the order in which units occur from the order in which they are read. The latter can be represented by a Markov transition matrix between positions in the text, which makes it possible to account for a variety of ways of navigating the text, including in particular nonlinear and non-deterministic ones. Markov transitions thus define textual neighbourhoods as well as positional weights – the stationary distribution or prominence of textual positions. Building on the notion of textual neighbourhood, the second goal of this contribution is to introduce a unified framework for textual autocorrelation, namely the tendency for neighbouring positions to be more (or less) similar than randomly chosen positions with respect to some observable property – for instance whether the same unit types tend to occur, or units of similar length, or consisting of similar sub-units, and so on. Inspired from spatial analysis (see e.g. Cressie 1991; Anselin 1995; Bavaud 2013), this approach relates the above mentioned transition matrix (specifying neighbourhoodness) with a second matrix specifying the dissimilarity between textual positions. The remainder of this contribution is organised as follows. Section 2 introduces the foundations of the proposed formalism and illustrates them with toy examples. Section 3 presents several case studies intended to show how the formalism and some of its extensions apply to more realistic research problems involving, among others, morphosyntactic and semantic dissimilarities computed in literary or hypertextual documents. Conclusion briefly summarises the key ideas introduced in this contribution.
36 François Bavaud, Christelle Cocco, Aris Xanthos
2 Formalism 2.1 Textual navigation: positions, transitions and exchange matrix A text consists of n positions i=1,…,n. Each position contains an occurrence of a type (or term) α=1,…,v. Types may refer to characters, words (possibly lemmatised), sentences, or units from any other level of analysis. The occurrence of an instance of type a at position i is denoted o(i)=a.
Fig. 1: Fixations and ocular saccades during reading, illustrating the skimming navigation model. Source: Wikimedia Commons, from a study of speed reading made by Humanistlaboratoriet, Lund University (2005).
A simple yet non-trivial model of textual navigation can be specified by means of a regularn × n Markov transition matrix 𝑇𝑇𝑇𝑇 = (𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ), obeying 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ≥ 0 and ∑𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 1. Its stationary distribution 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 > 0, obeying ∑𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 and ∑𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 = 1, characterises the prominence of position 𝑖𝑖𝑖𝑖 . For instance, neglecting boundary effects,1 1 Boundary effects are, if necessary, dealt with the usual techniques (reflecting boundary, periodic conditions, addition of a "rest state", etc.), supporting the formal fiction of an ever-reading agent embodied in stationary navigation.
Textual navigation and autocorrelation 37
Fig. 2: Word cloud (generated by http://www.wordle.net), aimed at illustrating the free or bagof-words navigation model with differing positional weights f.
– – –
𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 1(𝑗𝑗𝑗𝑗 = 𝑖𝑖𝑖𝑖 + 1) corresponds to standard linear reading2 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 1(𝑖𝑖𝑖𝑖 < 𝑗𝑗𝑗𝑗 < 𝑖𝑖𝑖𝑖 + 𝑟𝑟𝑟𝑟 + 1)/𝑟𝑟𝑟𝑟 describes (directed) skimming with jumps of maximum length 𝑟𝑟𝑟𝑟 𝑟 1, as suggested by Figure 1 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 defines the free or bag-of-words navigation, consisting of independent random jumps towards weighted positions (Markov chain of order 0), as suggested by Figure 2. Define the 𝑛𝑛𝑛𝑛 × 𝑛𝑛𝑛𝑛 exchange matrix 𝐸𝐸𝐸𝐸 = (𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ) as 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ∶=
1 (𝑓𝑓𝑓𝑓 𝑡𝑡𝑡𝑡 + 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ) 2 𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
(1)
By construction, 𝐸𝐸𝐸𝐸 is symmetric, non-negative, with margins 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖 = 𝑒𝑒𝑒𝑒•𝑖𝑖𝑖𝑖 = 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 , and total 𝑒𝑒𝑒𝑒•• = 1.3 The exchange matrix constitutes a symmetrical measure of positional interaction, reading flow, or neighbourhoodness between textual positions 𝑖𝑖𝑖𝑖 and 𝑗𝑗𝑗𝑗. In particular, the following three exchange matrices correspond to the three transition matrices defined above (still neglecting boundary effects): – 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = [1(𝑗𝑗𝑗𝑗 = 𝑖𝑖𝑖𝑖 𝑖 1) + 1(𝑗𝑗𝑗𝑗 = 𝑖𝑖𝑖𝑖 + 1)]/2𝑛𝑛𝑛𝑛 for standard linear reading 2 Here and in the sequel, 1(𝐴𝐴𝐴𝐴) denotes the indicator function of event 𝐴𝐴𝐴𝐴, with value 1(𝐴𝐴𝐴𝐴) = 1 if 𝐴𝐴𝐴𝐴 is true, and 1(𝐴𝐴𝐴𝐴) = 0 otherwise. 3 Here and in the sequel, the notation "•" denotes summation over the replaced index, as in 𝑎𝑎𝑎𝑎•𝑖𝑖𝑖𝑖 : = ∑𝑖𝑖𝑖𝑖 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 , 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖𝑖 : = ∑𝑖𝑖𝑖𝑖 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 and 𝑎𝑎𝑎𝑎•• : = ∑𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 .
38 François Bavaud, Christelle Cocco, Aris Xanthos – –
1
𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 1(|𝑖𝑖𝑖𝑖 𝑖 𝑖𝑖𝑖𝑖| ≤ 𝑟𝑟𝑟𝑟) 1(𝑖𝑖𝑖𝑖 𝑖 𝑖𝑖𝑖𝑖) for skimming with (undirected) jumps of 2𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 maximum length 𝑟𝑟𝑟𝑟 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 for the free or bag-of-words exchange matrix.
Toy example 1: consider a text consisting of four positions, either (periodically) linearly read as ...123412341... (A) or freely read (B). The previously introduced quantities are then 0 0 𝐴𝐴𝐴𝐴 𝑇𝑇𝑇𝑇 = � 0 1
1 0 0 0
0 1 0 0
1/4 0 0 1 1 1/4 0 𝐴𝐴𝐴𝐴 𝐴𝐴𝐴𝐴 � 𝑓𝑓𝑓𝑓 = � � 𝐸𝐸𝐸𝐸 = � 1/4 1 8 0 1/4 0 1
1 1 1 𝐵𝐵𝐵𝐵 𝑇𝑇𝑇𝑇 = � 4 1 1
1 1 1 1
1 1 1 1
1/4 1 1 1 1 1/4 1 𝐵𝐵𝐵𝐵 𝐵𝐵𝐵𝐵 � 𝑓𝑓𝑓𝑓 = � � 𝐸𝐸𝐸𝐸 = � 1/4 1 16 1 1/4 1 1
1 0 1 0
0 1 0 1 1 1 1 1
1 0 � 1 0 1 1 1 1
1 1 � 1 1
(2)
(3)
Note that (2) can also be conceived as an example of (periodic) skimming with jumps of maximum length 𝑟𝑟𝑟𝑟 = 1, which is indeed equivalent to (periodic) linear reading. Similarly, free navigation is equivalent to skimming with jumps of maximum size 𝑛𝑛𝑛𝑛, with the single difference that the latter forbids jumping towards the currently occupied position.
2.2 Autocorrelation index Let us now consider a matrix of dissimilarities 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 between pairs of positions (𝑖𝑖𝑖𝑖, 𝑗𝑗𝑗𝑗). Here and in the sequel, we restrict ourselves to squared Euclidean dissimilarities, i.e. of the form 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 =∥ 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 − 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 ∥2 , where ∥. ∥ denotes the Euclidean norm and 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 , 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 are 𝑝𝑝𝑝𝑝-dimensional vectors, for some 𝑝𝑝𝑝𝑝 𝑝 1. The average dissimilarity between a pair of randomly chosen positions defines the (global) inertia Δ, while the average dissimilarity between a pair of neighbouring positions defines the local inertia Δloc : Δ ∶=
1 � 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 2 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
Δloc ∶=
1 � 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 2 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
(4)
A local inertia much smaller than the global inertia reflects the presence of textual autocorrelation: closer average similarity between neighbouring positions than
Textual navigation and autocorrelation 39
between randomly chosen positions. Conversely, negative autocorrelation characterizes a situation where neighbouring positions are more dissimilar than randomly chosen ones4. Textual autocorrelation can be measured by the autocorrelation index 𝛿𝛿𝛿𝛿 𝛿=
Δ − Δloc Δ
generalizing Moran's 𝐼𝐼𝐼𝐼 index of spatial statistics (Moran 1950). The latter holds in the one-dimensional case 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = (𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 − 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 )2 , where 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 , 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 are scalars, Δ = var(𝑥𝑥𝑥𝑥), and Δloc = varloc (𝑥𝑥𝑥𝑥) (e.g. Lebart 1969). Under the null hypothesis 𝐻𝐻𝐻𝐻0 of absence of textual autocorrelation, the expected value of the autocorrelation index is not zero in general, but instead 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿) =
trace(𝑊𝑊𝑊𝑊) − 1 , 𝑛𝑛𝑛𝑛 − 1
(5)
where 𝑊𝑊𝑊𝑊 = (𝑤𝑤𝑤𝑤𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ) is the transition matrix of a reversible Markov chain defined as 𝑤𝑤𝑤𝑤𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 : = 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 /𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 (so that 𝐸𝐸𝐸𝐸0 reduces to −1/(𝑛𝑛𝑛𝑛 𝑛 1) for off-diagonal exchange matrices). Similarly, under normal approximation, the variance reads Var0 (𝛿𝛿𝛿𝛿) =
2 (trace(𝑊𝑊𝑊𝑊) − 1)2 2) − 1 − �trace(𝑊𝑊𝑊𝑊 � , 𝑛𝑛𝑛𝑛2 − 1 𝑛𝑛𝑛𝑛 𝑛 1
(e.g. Cliff and Ord 1981; Bavaud 2013), thus making the autocorrelation index significant at level 𝛼𝛼𝛼𝛼 if �
𝛿𝛿𝛿𝛿 − 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿) �Var0 (𝛿𝛿𝛿𝛿)
� ≥ 𝑢𝑢𝑢𝑢1−𝛼𝛼𝛼𝛼 , 2
(6)
where 𝑢𝑢𝑢𝑢𝛼𝛼𝛼𝛼 denotes the 𝛼𝛼𝛼𝛼-th quantile of the standard normal distribution. Toy example 1, continued: suppose that the types occurring at the four positions of example 1 are the following trigrams: 𝑜𝑜𝑜𝑜(1) = 𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼, 𝑜𝑜𝑜𝑜(2) = 𝛼𝛼𝛼𝛼𝛿𝛿𝛿𝛿𝛼𝛼𝛼𝛼, 𝑜𝑜𝑜𝑜(3) = 𝛼𝛼𝛼𝛼𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖, and 𝑜𝑜𝑜𝑜(4) = 𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼. Define the (squared Euclidean) dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 as the number of characters by which trigrams 𝑜𝑜𝑜𝑜(𝑖𝑖𝑖𝑖) and 𝑜𝑜𝑜𝑜(𝑗𝑗𝑗𝑗) differ: 0 2 𝐷𝐷𝐷𝐷 = � 3 0
2 0 2 2
3 2 0 3
0 2 � 3 0
4 Up to the bias associated to the contribution of self-comparisons: see (6).
40 François Bavaud, Christelle Cocco, Aris Xanthos Under linear periodic navigation (2), one gets the values Δ𝐴𝐴𝐴𝐴 = 3/4, Δ𝐴𝐴𝐴𝐴loc = 7/8 and 𝛿𝛿𝛿𝛿 𝐴𝐴𝐴𝐴 = −1/6, higher than 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 𝐴𝐴𝐴𝐴 ) = −1/3: the dissimilarity between immediate neighbours under linear periodic navigation is on average (after subtracting the bias 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 𝐴𝐴𝐴𝐴 )) smaller than the dissimilarity between randomly chosen pairs – although not significantly by the above normal test. By contrast, free navigation (3) yields Δ𝐵𝐵𝐵𝐵 = Δ𝐵𝐵𝐵𝐵loc = 3/4 and 𝛿𝛿𝛿𝛿 𝐵𝐵𝐵𝐵 = 0, since local and global inertia here coincide by construction. Note that 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 𝐵𝐵𝐵𝐵 ) = 0 and Var0 (𝛿𝛿𝛿𝛿 𝐵𝐵𝐵𝐵 ) = 0 in case of free navigation, regardless of the values of 𝐷𝐷𝐷𝐷.
2.3 Type dissimilarities In most cases, the dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 between positions 𝑖𝑖𝑖𝑖 and 𝑗𝑗𝑗𝑗 depends only on the types 𝑜𝑜𝑜𝑜(𝑖𝑖𝑖𝑖) = 𝑎𝑎𝑎𝑎 and 𝑜𝑜𝑜𝑜(𝑗𝑗𝑗𝑗) = 𝑏𝑏𝑏𝑏 found at these positions. Thus, the calculation of the autocorrelation index can be based on the 𝑣𝑣𝑣𝑣 × 𝑣𝑣𝑣𝑣 type dissimilarity matrix 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 rather than on the 𝑛𝑛𝑛𝑛 × 𝑛𝑛𝑛𝑛 position dissimilarity matrix – which makes it possible to simplify both computation (since 𝑣𝑣𝑣𝑣 < 𝑛𝑛𝑛𝑛 in general) and notation (cf. sections 3.2 and 3.3). Here are some examples of squared Euclidean type dissimilarities, i.e. of the form 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = ∥ 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 − 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 ∥2 where 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 ∈ ℝ𝑝𝑝𝑝𝑝 are the 𝑝𝑝𝑝𝑝-dimensional coordinates of type 𝑎𝑎𝑎𝑎, recoverable by multidimensional scaling (see section 3.4): –
– –
𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = (𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 − 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 )2 where 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 characterises type 𝑎𝑎𝑎𝑎, e.g. 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 = "length of 𝑎𝑎𝑎𝑎" or 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 = 1(𝑎𝑎𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎) (presence-absence dissimilarity with respect to property 𝐴𝐴𝐴𝐴) 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 1(𝑎𝑎𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎), the discrete metric 1
1
𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = � + � 1(a≠b), the weighted discrete metric, where 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 > 0 is the relπa
πb
ative proportion of type 𝑎𝑎𝑎𝑎, with ∑𝑎𝑎𝑎𝑎 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 = 1 (Le Roux and Rouanet 2004; Ba-
vaud and Xanthos 2005) –
𝑝𝑝𝑝𝑝
𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = ∑𝑘𝑘𝑘𝑘𝑘𝑘
𝑟𝑟𝑟𝑟•𝑘𝑘𝑘𝑘 𝑟𝑟𝑟𝑟𝑎𝑎𝑎𝑎𝑘𝑘𝑘𝑘 𝑟𝑟𝑟𝑟••
(
𝑟𝑟𝑟𝑟•• 𝑟𝑟𝑟𝑟𝑎𝑎𝑎𝑎𝑎 𝑟𝑟𝑟𝑟•𝑘𝑘𝑘𝑘
−
𝑟𝑟𝑟𝑟𝑏𝑏𝑏𝑏𝑘𝑘𝑘𝑘 𝑟𝑟𝑟𝑟•• 2 ) , the chi-square dissimilarity, 𝑟𝑟𝑟𝑟𝑏𝑏𝑏𝑏𝑏 𝑟𝑟𝑟𝑟•𝑘𝑘𝑘𝑘
used for compo-
site types made of distinguishable features, where 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑘𝑘𝑘𝑘 is the type-feature matrix counting the occurrences of feature 𝑘𝑘𝑘𝑘 in type 𝑎𝑎𝑎𝑎.
In order to compute the autocorrelation index using a type dissimilarity matrix, a 𝑣𝑣𝑣𝑣 × 𝑣𝑣𝑣𝑣 type exchange matrix can be defined as 𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = � � 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 1(𝑜𝑜𝑜𝑜(𝑖𝑖𝑖𝑖) = 𝑎𝑎𝑎𝑎)1(𝑜𝑜𝑜𝑜(𝑗𝑗𝑗𝑗) = 𝑏𝑏𝑏𝑏) , 𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
Textual navigation and autocorrelation 41
whose margins specify the relative distribution of types: 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 = 𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎 = 𝛼𝛼𝛼𝛼•𝑎𝑎𝑎𝑎 = ∑𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 1(𝑜𝑜𝑜𝑜(𝑖𝑖𝑖𝑖) = 𝑎𝑎𝑎𝑎). Global and local inertias (4) can then be calculated as 1 Δ = � 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 2
1 Δloc = � 𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 2
𝑎𝑎𝑎𝑎,𝑎𝑎𝑎𝑎
𝑎𝑎𝑎𝑎,𝑎𝑎𝑎𝑎
(7)
The Markov transition probability from term 𝑎𝑎𝑎𝑎 to term 𝑏𝑏𝑏𝑏 is now 𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 /𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 . Following (5), the autocorrelation index 𝛿𝛿𝛿𝛿 = 1 − Δloc /Δ has to be compared with its expected value under independence 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿) =
∑𝑣𝑣𝑣𝑣𝑎𝑎𝑎𝑎=1
𝜖𝜖𝜖𝜖𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎
−1
(8)
𝑣𝑣𝑣𝑣 − 1
which generally differs from (5). Indeed, the permutation-invariance implied by the null hypothesis 𝐻𝐻𝐻𝐻0 of absence of textual autocorrelation relies on permutations of positions 𝑖𝑖𝑖𝑖 = 1, … , 𝑛𝑛𝑛𝑛 in (5), while it considers permutations of terms 𝑎𝑎𝑎𝑎 = 1, … , 𝑣𝑣𝑣𝑣 in (8) – two natural although not equivalent assumptions. In the following, the position permutation test will be adopted by default, unless explicitly stated otherwise. Toy example 1, continued: recall that the set of types occurring in the text of example 1 is {𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼𝛿𝛿𝛿𝛿𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖}. The type dissimilarity 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 corresponding to the position dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 previously used is defined as the number of characters by which trigrams 𝑎𝑎𝑎𝑎 and 𝑏𝑏𝑏𝑏 differ: 0 𝐷𝐷𝐷𝐷 = �2 3
2 0 2
3 2� 0
Under linear periodic navigation (2), the type exchange matrix and type proportions are 1 2 (𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ) = �1 8 1
1 0 1
1 1� 0
𝜋𝜋𝜋𝜋 =
1 2 �1� , 4 1
yielding inertias (7) Δ𝐴𝐴𝐴𝐴 = 3/4 and Δ𝐴𝐴𝐴𝐴loc = 7/8 with 𝛿𝛿𝛿𝛿 = −1/6, as already obtained.
3 Case studies The next sections present several case studies involving in particular chi-squared dissimilarities between composite types such as play lines, hypertext navigation, and semantic dissimilarities (further illustrations, including Markov iterations 𝑊𝑊𝑊𝑊 𝑊 𝑊𝑊𝑊𝑊 𝑟𝑟𝑟𝑟 , may be found in Bavaud et al. 2012). Unless otherwise stated, we use
42 François Bavaud, Christelle Cocco, Aris Xanthos the "skimming" navigation model defined in section 2.1 (slightly adapted to handle border effects) and let the maximum length of jumps vary as 𝑟𝑟𝑟𝑟 = 1,2,3, …, yielding autocorrelation indices 𝛿𝛿𝛿𝛿 [𝑟𝑟𝑟𝑟] for neighbourhoods of size 𝑟𝑟𝑟𝑟, i.e. including 𝑟𝑟𝑟𝑟 neighbours to the left and 𝑟𝑟𝑟𝑟 neighbours to the right of each position. In particular, 𝛿𝛿𝛿𝛿 [1] constitutes a generalisation of the Durbin-Watson statistic.
3.1 Autocorrelation between lines of a play The play Sganarelle ou le Cocu imaginaire by Molière (1660) contains 𝑛𝑛𝑛𝑛 = 207 lines declaimed by feminine or masculine characters (coded 𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 = 1 or 𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 = 0 respectively). Each line 𝑖𝑖𝑖𝑖 is also characterised by the number of occurrences 𝑛𝑛𝑛𝑛𝑖𝑖𝑖𝑖𝑘𝑘𝑘𝑘 of each part-of-speech (POS) tag 𝑘𝑘𝑘𝑘 = 1, … , 𝑝𝑝𝑝𝑝 as assigned by TreeTagger (Schmid 1994); here 𝑝𝑝𝑝𝑝 = 28. The first few rows and columns of the data are represented on Table 1. Table 1: First rows and columns of the Sganarelle data Position
1 2 3 4 …
– – – –
Gender # interjections 1 0 1 0 …
1 1 1 3 …
# adverbs # verbs (present) … 2 11 0 15 …
1 20 0 15 …
… … … … …
The following distances are considered: length the length dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = (𝑙𝑙𝑙𝑙𝑖𝑖𝑖𝑖 − 𝑙𝑙𝑙𝑙𝑖𝑖𝑖𝑖 )2 , where 𝑙𝑙𝑙𝑙𝑖𝑖𝑖𝑖 ∶= ∑𝑘𝑘𝑘𝑘 𝑛𝑛𝑛𝑛𝑖𝑖𝑖𝑖𝑘𝑘𝑘𝑘 is the total count of POS tags for line 𝑖𝑖𝑖𝑖
gender
the gender dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
the chi-square dissimilarity
𝜒𝜒𝜒𝜒 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
= (𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 − 𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 )2 = 1(𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 ≠ 𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 )
associated to the 207 × 28 contingency table
𝑛𝑛𝑛𝑛𝑖𝑖𝑖𝑖𝑘𝑘𝑘𝑘 (see section 2.3 d for the corresponding type dissimilarity) 𝑅𝑅𝑅𝑅𝜒𝜒𝜒𝜒
the reduced chi-square dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 obtained after aggregating all POS
tag counts into two supercategories, namely verbs and non-verbs.
The length autocorrelation index (Figure 3 left) reveals that the length of lines tends to be strongly autocorrelated over neighborhoods of size up to 5: long (short) lines tend to be surrounded at short range by long (short) lines. This might reflect the play structure, which comprises long passages declaiming general considerations on human condition, and more action-oriented passages, made of shorter lines.
Textual navigation and autocorrelation 43
Fig. 3: Autocorrelation index 𝛿𝛿𝛿𝛿 [𝑟𝑟𝑟𝑟] (circles), together with expected value (5) (solid line), and 95% confidence interval (6) (dashed lines). Left: length dissimilarity 𝐷𝐷𝐷𝐷𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑟𝑟𝑟𝑟𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙ℎ . Right: gender dissimilarity 𝐷𝐷𝐷𝐷 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑔𝑔𝑔𝑔𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 .
Fig. 4: Autocorrelation index 𝛿𝛿𝛿𝛿 [𝑟𝑟𝑟𝑟] (same setting as figure 3) for the chi-square dissimilarity 𝐷𝐷𝐷𝐷 𝜒𝜒𝜒𝜒 (left) and the reduced chi-square dissimilarity 𝐷𝐷𝐷𝐷𝑅𝑅𝑅𝑅𝜒𝜒𝜒𝜒 (right).
The strong negative gender autocorrelation observed on figure 3 (right) for a neighbourhood size of 1 shows that lines declaimed by characters of a given gender have a clear tendency to be immediately followed by lines declaimed by characters of the other gender, and vice-versa. The significant positive autocorrelation for neighbourhoods of size 2 seems to be a logical consequence of this, as well as the globally alternating pattern of the curve. Interestingly, the autocorrelation is always positive for larger neighbourhood sizes, which can be explained by two observations: (i) overall, masculine lines are considerably more frequent
44 François Bavaud, Christelle Cocco, Aris Xanthos than feminine lines (64.7% vs. 35.3%); (ii) the probability of being followed by a line of differing gender is much lower for masculine lines than for feminine ones (44.4% vs. 82.2%). These factors concur to dominate the short-range preference for alternation. The POS tag profile of lines tends to exhibit no autocorrelation, although the alignments observed in Figure 4 (left) are intriguing. The proportion of verbs (Figure 4 right) tends to be positively (but not significantly, presumably due to the small size 𝑛𝑛𝑛𝑛 = 207 of the sample) autocorrelated up to a range of 10, and negatively autocorrelated for a range between 20 and 30 – an observation whose interpretation requires further investigation.
3.2 Free navigation within documents Let the text be partitioned into documents 𝑔𝑔𝑔𝑔 = 1, … , 𝑚𝑚𝑚𝑚; 𝑖𝑖𝑖𝑖 𝑖 𝑖𝑖𝑖𝑖 denotes the membership of position 𝑖𝑖𝑖𝑖 to document 𝑔𝑔𝑔𝑔 and 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 : = ∑𝑖𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 is the relative weight of document 𝑔𝑔𝑔𝑔. Consider now the free textual navigation confined within each document 𝑙𝑙𝑙𝑙[𝑖𝑖𝑖𝑖]
𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖
∶=
1(𝑗𝑗𝑗𝑗 ∈ 𝑔𝑔𝑔𝑔[𝑖𝑖𝑖𝑖]) 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 , 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙[𝑖𝑖𝑖𝑖]
(9)
where 𝑔𝑔𝑔𝑔[𝑖𝑖𝑖𝑖] denotes the document to which position 𝑖𝑖𝑖𝑖 belongs. The associated ex𝑙𝑙𝑙𝑙 𝑙𝑙𝑙𝑙 change matrix 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 : = ∑𝑚𝑚𝑚𝑚 𝑙𝑙𝑙𝑙=1 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 is reducible, i.e. made out 𝑚𝑚𝑚𝑚 disconnected submatrices. Note that 𝑓𝑓𝑓𝑓 obtains here as the margin of 𝐸𝐸𝐸𝐸, rather as the stationary distribution of 𝑇𝑇𝑇𝑇, which is reducible and hence not regular. In this setup, the local inertia is nothing but the within-groups inertia Δloc = � 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 Δ𝑙𝑙𝑙𝑙 = : Δ𝑊𝑊𝑊𝑊 𝑙𝑙𝑙𝑙
Δ𝑙𝑙𝑙𝑙 ∶=
1 𝑙𝑙𝑙𝑙 𝑙𝑙𝑙𝑙 � 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 2 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
and hence 𝛿𝛿𝛿𝛿 = Δ𝐵𝐵𝐵𝐵 /Δ ≥ 0, where Δ𝐵𝐵𝐵𝐵 = Δ − Δ𝑊𝑊𝑊𝑊 = ∑𝑙𝑙𝑙𝑙 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 𝐷𝐷𝐷𝐷𝑙𝑙𝑙𝑙0 is the between-groups inertia, and 𝐷𝐷𝐷𝐷𝑙𝑙𝑙𝑙0 is the dissimilarity between the centroid of the group 𝑔𝑔𝑔𝑔 and the overall centroid 0. Here 𝛿𝛿𝛿𝛿, always non negative, behaves as a kind of generalised 𝐹𝐹𝐹𝐹-ratio. In practical applications, textual positional weights are uniform, and the free navigation within documents involves the familiar term-document matrix
with
𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 : = 𝑛𝑛𝑛𝑛•• � 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 1(𝑜𝑜𝑜𝑜(𝑖𝑖𝑖𝑖) = 𝑎𝑎𝑎𝑎)1(𝑖𝑖𝑖𝑖 ∈ 𝑔𝑔𝑔𝑔) 𝑖𝑖𝑖𝑖
(10)
Textual navigation and autocorrelation 45
𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 =
1 𝑛𝑛𝑛𝑛••
𝑙𝑙𝑙𝑙
𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 =
1(𝑖𝑖𝑖𝑖 ∈ 𝑔𝑔𝑔𝑔) 𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙
𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 =
𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 . 𝑛𝑛𝑛𝑛••
(11)
In particular, Δ, Δloc and 𝛿𝛿𝛿𝛿 can be computed from (7), where 𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = � 𝑙𝑙𝑙𝑙
𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑛𝑛𝑛𝑛•• 𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙
𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 =
𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎• 𝑛𝑛𝑛𝑛••
(free within–documents navigation). (12)
The significance of 𝛿𝛿𝛿𝛿 = (Δ − Δloc )/Δ can be tested by (6), where trace(𝑊𝑊𝑊𝑊 2 ) = trace(𝑊𝑊𝑊𝑊) = 𝑚𝑚𝑚𝑚 for the free within-documents navigation under the position permutation test (5). 𝑟𝑟𝑟𝑟 𝑟𝑟𝑟𝑟 When 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = ( •• + •• )1(𝑎𝑎𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎) is the weighted discrete metric (section 2.3 𝑟𝑟𝑟𝑟𝑎𝑎𝑎𝑎𝑎
𝑟𝑟𝑟𝑟𝑏𝑏𝑏𝑏𝑏
c), the autocorrelation index turns out to be 𝛿𝛿𝛿𝛿 =
𝜒𝜒𝜒𝜒 2 𝑛𝑛𝑛𝑛•• (𝑣𝑣𝑣𝑣 − 1)
(13)
where 𝑣𝑣𝑣𝑣 is the number of types and 𝜒𝜒𝜒𝜒 2 the term-document chi-square. Toy example 2: consider 𝑣𝑣𝑣𝑣 = 7 types represented by greek letters, whose 𝑛𝑛𝑛𝑛 = 𝑛𝑛𝑛𝑛•• = 20 occurrences possess the same weight 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 = 1/20, and are distributed among 𝑚𝑚𝑚𝑚 = 4 documents as "𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛿𝛿𝛿𝛿", "𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼", "𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼" and "𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖" (Figure 5). The term-document matrix, term weights and document weights read 𝜶𝜶𝜶𝜶 ⎛ ⎜ 𝜷𝜷𝜷𝜷 𝜸𝜸𝜸𝜸 �𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 � = ⎜ ⎜ 𝜹𝜹𝜹𝜹 ⎜ 𝝐𝝐𝝐𝝐 𝜻𝜻𝜻𝜻 ⎝ 𝜼𝜼𝜼𝜼
𝒈𝒈𝒈𝒈 = 𝟏𝟏𝟏𝟏 0 2 1 1 0 0 0
𝒈𝒈𝒈𝒈 = 𝟐𝟐𝟐𝟐 𝒈𝒈𝒈𝒈 = 𝟑𝟑𝟑𝟑 2 2 0 2 1 0 0 0 1 0 0 0 0 0
𝒈𝒈𝒈𝒈 = 𝟒𝟒𝟒𝟒 8 4 ⎞ 4 0 ⎟ 1 ⎛2⎞ 1 ⎜ ⎟ 1 1 0 ⎟ 𝜋𝜋𝜋𝜋 = 1 ⎟ 𝜌𝜌𝜌𝜌 = � � 0 ⎟ 20 ⎜ 5 1 ⎜2⎟ 1 ⎟ 2 2 2 ⎝1⎠ 1 ⎠
(14)
Consider three type dissimilarities, namely the "vowels" presence-absence dissimilarity 𝐷𝐷𝐷𝐷 𝐴𝐴𝐴𝐴 (section 2.3 a, with 𝐴𝐴𝐴𝐴 = {𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼} and 𝐴𝐴𝐴𝐴𝑐𝑐𝑐𝑐 = {𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛿𝛿𝛿𝛿, 𝜖𝜖𝜖𝜖, 𝜖𝜖𝜖𝜖}): 0 1 ⎛1 𝐷𝐷𝐷𝐷 𝐴𝐴𝐴𝐴 = ⎜ ⎜1 ⎜0 1 ⎝1
1 0 0 0 1 0 0
the discrete metric 𝐷𝐷𝐷𝐷𝐵𝐵𝐵𝐵 (section 2.3 b):
1 0 0 0 1 0 0
1 0 0 0 1 0 0
0 1 1 1 0 1 1
1 0 0 0 1 0 0
1 0 0⎞ ⎟ 0⎟ , 1⎟ 0 0⎠
46 François Bavaud, Christelle Cocco, Aris Xanthos
0 1 ⎛1 ⎜ 𝐷𝐷𝐷𝐷𝐵𝐵𝐵𝐵 = ⎜ 1 ⎜1 1 ⎝1
1 0 1 1 1 1 1
1 1 0 1 1 1 1
1 1 1 0 1 1 1
1 1 1 1 0 1 1
1 1 1 1 1 0 1
1 1 1⎞ ⎟ 1⎟ , 1⎟ 1 0⎠
and the weighted discrete metric 𝐷𝐷𝐷𝐷𝐶𝐶𝐶𝐶 (section 2.3 c): 0 7.5 7.5 0 ⎛ 12.5 15 ⎜ 𝐷𝐷𝐷𝐷𝐶𝐶𝐶𝐶 = ⎜ 22.5 25 ⎜ 12.5 15 12.5 15 ⎝ 22.5 25
12.5 15 0 30 20 20 30
22.5 25 30 0 30 30 40
12.5 15 20 30 0 20 30
12.5 15 20 30 20 0 30
22.5 25 ⎞ 30 ⎟ 40 ⎟ . 30 ⎟ 30 0 ⎠
The corresponding values of global inertias (7), local inertias (12) and textual autocorrelation 𝛿𝛿𝛿𝛿 are given in Table 2 below. Sganarelle, continued: consider the distribution of the 961 nouns and 1'204 verbs of the play Sganarelle among the 𝑚𝑚𝑚𝑚 = 24 scenes of the play, treated here as documents (section 3.1). The autocorrelation index (13) for nouns associated to the weighted discrete metric takes on the value 𝛿𝛿𝛿𝛿 nouns = 0.0238, lower than the expected value (5) 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 nouns ) = 0.0240. For verbs, one gets 𝛿𝛿𝛿𝛿 verbs = 0.0198 > 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 verbs ) = 0.0191. Although not statistically significant, the sign of the differences reveals a lexical content within scenes more homogeneous for verbs than for nouns. Finer analysis can be obtained from Correspondence Analysis (see e.g. Greenacre 2007), performing a spectral decomposition of the chi-square in (13).
3.3 Hypertext navigation Consider a set 𝐺𝐺𝐺𝐺 of electronic documents 𝑔𝑔𝑔𝑔 = 1, … , 𝑚𝑚𝑚𝑚 containing hyperlinks attached to a set 𝐴𝐴𝐴𝐴 of active terms and specified by a function 𝛼𝛼𝛼𝛼[𝑎𝑎𝑎𝑎] from 𝐴𝐴𝐴𝐴 to 𝐺𝐺𝐺𝐺, associating each active term 𝑎𝑎𝑎𝑎 to a target document 𝑔𝑔𝑔𝑔 = 𝛼𝛼𝛼𝛼[𝑎𝑎𝑎𝑎]. A simple model of hypertext navigation consists in clicking at each position occupied by an active term, thus jumping to the target document, while staying in the same document when meeting an inactive term; in both cases, the next position 𝑖𝑖𝑖𝑖 is selected as 𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 in (11). This dynamics generates a document to document transition matrix Φ = (𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙ℎ ), involving the term-document matrix 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 (10), as
Textual navigation and autocorrelation 47
𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙ℎ : = � 𝑎𝑎𝑎𝑎
𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 𝜏𝜏𝜏𝜏 𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 (𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ
(15)
where 𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ is the probability to jump from term 𝑎𝑎𝑎𝑎 in document 𝑔𝑔𝑔𝑔 to document ℎ and obeys 𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)• = 1. In the present setup, 𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ = 1(ℎ = 𝛼𝛼𝛼𝛼[𝑎𝑎𝑎𝑎]) for 𝑎𝑎𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎 and 𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ = 1(ℎ = 𝑔𝑔𝑔𝑔) for 𝑎𝑎𝑎𝑎 𝑎̸ 𝐴𝐴𝐴𝐴. Alternative specifications taking into account clicking probabilities, or contextual effects (of term 𝑎𝑎𝑎𝑎 relatively to its background 𝑔𝑔𝑔𝑔) could also be cast within this formalism. The document-to-document transition matrix obeys 𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙ℎ ≥ 0 and 𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙• = 1, and the broad Markovian family of hypertext navigations (15) generalizes specific proposals such as the free within-documents setup, or the Markov chain associated to the PageRank algorithm (Page 2001). By standard Markovian theory (e.g. Grinstead and Snell 1998), each document belongs to a single "communication-based" equivalence class, which is either transient, i.e. consisting of documents eventually unattainable by lack of incoming hyperlinks, or recurrent, i.e. consisting of documents visited again and again once the chain has entered into the class. The chain is regular iff it is aperiodic and consists of a single recurrent class, in which case its evolution converges to the stationary distribution 𝑠𝑠𝑠𝑠 of Φ obeying ∑𝑙𝑙𝑙𝑙 𝑠𝑠𝑠𝑠𝑙𝑙𝑙𝑙 𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙ℎ = 𝑠𝑠𝑠𝑠ℎ , which differs in general from the document weights 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 = 𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 /𝑛𝑛𝑛𝑛•• . In the regular case, textual autocorrelation for type dissimilarities (section 2.3) can be computed by means of (7), where (compare with (12))
and
𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 𝜖𝜖𝜖𝜖(𝑎𝑎𝑎𝑎𝑎)(𝑎𝑎𝑎𝑎•)
𝜖𝜖𝜖𝜖(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)(𝑎𝑎𝑎𝑎ℎ) : =
𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 = 𝜖𝜖𝜖𝜖(𝑎𝑎𝑎𝑎𝑎)(••) = � 𝑙𝑙𝑙𝑙
𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑎 𝑠𝑠𝑠𝑠 ≠ 𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛••
1 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎ℎ [𝜏𝜏𝜏𝜏 𝑠𝑠𝑠𝑠 + 𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎ℎ)𝑙𝑙𝑙𝑙 𝑠𝑠𝑠𝑠ℎ ] 2 𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛•ℎ (𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ 𝑙𝑙𝑙𝑙 (hypertextual navigation).
Toy example 2, continued: let the active terms be 𝐴𝐴𝐴𝐴 = {𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛿𝛿𝛿𝛿}, with hyperlinks 𝛼𝛼𝛼𝛼[𝛼𝛼𝛼𝛼] = 1, 𝛼𝛼𝛼𝛼[𝛼𝛼𝛼𝛼] = 2, 𝛼𝛼𝛼𝛼[𝛼𝛼𝛼𝛼] = 3 and 𝛼𝛼𝛼𝛼[𝛿𝛿𝛿𝛿] = 4 (Figure 5). The transition matrix (15) turns out to be regular. From (14), the document-document transition probability, its stationary distribution and the document weights are 0 1/2 Φ=� 1/2 1/2
1/2 1/4 1/2 0
1/4 1/4 0 0
1/4 0 � 0 1/2
1/3 1/3 𝑠𝑠𝑠𝑠 = � � 1/6 1/6
1/5 1/5 𝜌𝜌𝜌𝜌 = � � . 1/5 2/5
In a certain sense, hyperlink navigation magnifies the importance of each document 𝑔𝑔𝑔𝑔 by a factor 𝑠𝑠𝑠𝑠𝑙𝑙𝑙𝑙 /𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 , respectively equal to 5/3, 5/3, 5/6 and 5/12 for the 𝑚𝑚𝑚𝑚 =
48 François Bavaud, Christelle Cocco, Aris Xanthos 4 documents of toy example 2. Similary, the term magnification factor 𝑛𝑛𝑛𝑛•• 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 /𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑎 is 1.04 for 𝛼𝛼𝛼𝛼, 0.83 for 𝛼𝛼𝛼𝛼, 1.67 for 𝛼𝛼𝛼𝛼, 1.67 for 𝛿𝛿𝛿𝛿, 1.04 for 𝛼𝛼𝛼𝛼, 0.42 for 𝜖𝜖𝜖𝜖 and 0.42 for 𝜖𝜖𝜖𝜖.
Fig. 5: Hypertextual navigation (toy example 2) between 𝑚𝑚𝑚𝑚 = 4 documents containing |𝐴𝐴𝐴𝐴| = 4 active terms 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, and 𝛿𝛿𝛿𝛿. Table 2: (Toy example 2) terms autocorrelation is positive under free within document navigation, but negative under hypertextual navigation. Here 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿) refers to the term permutation test (8). free within-document navigation 𝑫𝑫𝑫𝑫𝑨𝑨𝑨𝑨 𝑫𝑫𝑫𝑫𝑩𝑩𝑩𝑩 𝑫𝑫𝑫𝑫𝑪𝑪𝑪𝑪
𝚫𝚫𝚫𝚫 0.25 0.38 6.00
𝚫𝚫𝚫𝚫𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥 0.18 0.31 4.94
𝜹𝜹𝜹𝜹 0.28 0.20 0.18
𝑬𝑬𝑬𝑬𝟎𝟎𝟎𝟎 (𝜹𝜹𝜹𝜹) 0.16 0.16 0.16
hypertextual navigation 𝚫𝚫𝚫𝚫 0.25 0.39 6.15
𝚫𝚫𝚫𝚫𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥 0.36 0.48 6.90
𝜹𝜹𝜹𝜹 -0.47 -0.24 -0.12
𝑬𝑬𝑬𝑬𝟎𝟎𝟎𝟎 (𝜹𝜹𝜹𝜹) -0.10 -0.10 -0.10
Table 2 summarizes the resulting textual autocorrelation for the three term dissimilarities already investigated: systematically, hyperlink navigation strongly increases the terms heterogeneity as measured by the local inertia, since each of the active terms 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼 and 𝛿𝛿𝛿𝛿 point towards documents not containing them. Textual autocorrelation in "WikiTractatus": strict application of the above formalism to real data requires full knowledge of a finite, isolated network of 𝑚𝑚𝑚𝑚 hyperlinked documents.
Textual navigation and autocorrelation 49
Fig. 6: Entry page of "WikiTractatus", constituting the document "les mots" (left). Free-within navigation weights 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 versus hypertextual weights 𝑠𝑠𝑠𝑠𝑙𝑙𝑙𝑙 , logarithmic scale (right).
The site "WikiTractatus" is a "network of aphorisms" (in French) created by André Ourednik (2010), containing 𝑛𝑛𝑛𝑛 = 27′ 172 words (tokens) distributed among 𝑣𝑣𝑣𝑣 = 5′ 487 terms and 𝑚𝑚𝑚𝑚 = 380 documents (Figure 6). Each document is identified by a distinct active term, as in a dictionary notice, and hyperlinks connect each active term in the corpus to the corresponding document. Also, each document contains hyperlinks pointing towards distinct documents, with the exception of the document "clavier" which contains no active terms. As a consequence, "clavier" acts as an absorbing state of the Markov chain (15), and all remaining documents are transient – as attested by the study of Φ𝑟𝑟𝑟𝑟 , converging for 𝑟𝑟𝑟𝑟 large towards a null matrix except for a unit column vector associated to the document "clavier". Suppressing document "clavier" together with its incoming hyperlinks makes the Markov chain regular. In contrast to Table 2, terms are positively autocorrelated under hypertextual navigation on "WikiTractatus": with the discrete metric 𝐷𝐷𝐷𝐷 and the term permutation test (8), one finds Δ = 0.495, Δloc = 0.484, 𝛿𝛿𝛿𝛿 = 0.023 and 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿) = 0.014. In the same setup, the free within-documents navigation yields very close results, namely Δ = 0.496, Δloc = 0.484, 𝛿𝛿𝛿𝛿 = 0.024 and 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿) = 0.015. Here both types of navigation have identical effects on textual autocorrelation per se, but quite different effects on the document (and type) relative weights (Figure 6 right).
50 François Bavaud, Christelle Cocco, Aris Xanthos
3.4 Semantic autocorrelation Semantic similarities have been systematically investigated in the last two decades, using in particular reference word taxonomies expressing "ontological" relationships (e.g. Resnik 1999). In WordNet (Miller et al. 1990), words, and in particular nouns and verbs, are grouped into synsets, i.e. cognitive synonyms, and each synset represents a different concept. Hyponymy expresses inclusion between concepts: the relation "concept 𝑐𝑐𝑐𝑐1 is an instance of concept 𝑐𝑐𝑐𝑐2 " is denoted 𝑐𝑐𝑐𝑐1 ≤ 𝑐𝑐𝑐𝑐2 , and 𝑐𝑐𝑐𝑐1 ∨ 𝑐𝑐𝑐𝑐2 represents the least general concept subsuming both 𝑐𝑐𝑐𝑐1 and 𝑐𝑐𝑐𝑐2 . For instance, in the toy ontology of Figure 7, cat ≤ animal and cat ∨ dog = animal.
Fig. 7: Toy noun ontology made up of 7 concepts: numbers in bold are probabilities (16), numbers in italic are similarities (17), and the underlined number is the dissimilarity between bicycle and car according to (18).
Based on a reference corpus (hereafter the Brown corpus, Kučera and Francis 1967), the probability 𝑝𝑝𝑝𝑝(𝑐𝑐𝑐𝑐) of concept 𝑐𝑐𝑐𝑐 can be estimated as the proportion of word tokens whose sense 𝐶𝐶𝐶𝐶(𝑤𝑤𝑤𝑤) is an instance of concept 𝑐𝑐𝑐𝑐. Thus, representing the number of occurrences of word 𝑤𝑤𝑤𝑤 by 𝑛𝑛𝑛𝑛(𝑤𝑤𝑤𝑤), 𝑝𝑝𝑝𝑝(𝑐𝑐𝑐𝑐): =
∑𝑤𝑤𝑤𝑤 𝑛𝑛𝑛𝑛 (𝑤𝑤𝑤𝑤) 1(𝐶𝐶𝐶𝐶(𝑤𝑤𝑤𝑤) ≤ 𝑐𝑐𝑐𝑐) ∑𝑤𝑤𝑤𝑤 𝑛𝑛𝑛𝑛 (𝑤𝑤𝑤𝑤)
(16)
Textual navigation and autocorrelation 51
Following Resnik (1999), a measure of similarity between concepts can then be defined as: 𝑠𝑠𝑠𝑠(𝑐𝑐𝑐𝑐1 , 𝑐𝑐𝑐𝑐2 ): = − log 𝑝𝑝𝑝𝑝 (𝑐𝑐𝑐𝑐1 ∨ 𝑐𝑐𝑐𝑐2 ) ≥ 0
(17)
𝐷𝐷𝐷𝐷(𝑐𝑐𝑐𝑐1 , 𝑐𝑐𝑐𝑐2 ): = 𝑠𝑠𝑠𝑠(𝑐𝑐𝑐𝑐1 , 𝑐𝑐𝑐𝑐1 ) + 𝑠𝑠𝑠𝑠(𝑐𝑐𝑐𝑐2 , 𝑐𝑐𝑐𝑐2 ) − 2𝑠𝑠𝑠𝑠(𝑐𝑐𝑐𝑐1 , 𝑐𝑐𝑐𝑐2 )
(18)
from which a squared Euclidean dissimilarity between concepts can be derived as (Bavaud et al. 2012):
For instance, based on the probabilities given in Figure 7, 𝐷𝐷𝐷𝐷(bicycle, car) = 𝑠𝑠𝑠𝑠(bicycle, bicycle) + 𝑠𝑠𝑠𝑠(car, car) − 2𝑠𝑠𝑠𝑠(bicycle, car) = − log ( 0.2) − log ( 0.4) + 2 log ( 0.7) = 1.81
According to TreeTagger (Schmid 1994), the short story The Masque of the Red Death by Edgar Allan Poe (1842) contains 497 positions occupied by nouns and 379 positions occupied by verbs. Similarities between nouns and between et al. verbs can be obtained using the WordNet::Similarity interface (Pedersen et 2004) – systematically using, in this case study, the most frequent sense of ambiguous concepts. Autocorrelation indices (for neighbourhoods of size 𝑟𝑟𝑟𝑟) calculated using the corresponding dissimilarities exhibit no noticeable pattern (Figure 8).
Fig. 8: Autocorrelation index δ[r] (same setting as Figure 3) in "The Masque of the Red Death" for the semantic dissimilarity (18) for nouns (left) and for verbs (right).
This being said, the 𝑝𝑝𝑝𝑝-dimensional coordinates 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 entering in any squared Euclidean distance 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 =∥ 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 − 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 ∥2 can be recovered by (weighted) multidimensional scaling (MDS) (e.g. Torgeson 1958; Mardia et al. 1979), yielding orthogonal
52 François Bavaud, Christelle Cocco, Aris Xanthos factorial coordinates 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎𝛼𝛼𝛼𝛼 (for 𝛼𝛼𝛼𝛼 = 1, … , 𝑣𝑣𝑣𝑣 𝑣 1) whose low-dimensional projections express a maximum proportion of (global) inertia.
Fig. 9: Screeplot for the MDS on semantic dissimilarities for nouns (left). Factorial coordinates xaα and proportion of explained inertia for α = 1,2 (right).
Fig. 10: Autocorrelation index δ[r] for nouns in the first semantic dimension (left) and in the second semantic dimension (right).
The first semantic coordinate for nouns in The Masque of the Red Death (Figure 9) clearly contrasts abstract entities such as horror, pestilence, disease, hour, mean, night, vision, or precaution, on the left, with physical entities such as window, roof,
Textual navigation and autocorrelation 53
wall, body, victim, glass, or visage, on the right, respectively defined in WordNet as "a general concept formed by extracting common features from specific examples" and "an entity that has physical existence". Figure 10 (left) shows this first coordinate to be strongly autocorrelated, echoing long-range semantic persistence, in contrast to the second coordinate (Figure 10 right), whose interpretation is more difficult.
Fig. 11: Screeplot for the MDS on semantic dissimilarities for verbs (left). Factorial coordinates xaα and proportion of explained inertia for α = 1,2 (right).
Fig. 12: Autocorrelation index δ[r] for verbs in the first semantic dimension (left) and in the second semantic dimension (right)
54 François Bavaud, Christelle Cocco, Aris Xanthos For verbs (Figure 11), the first semantic coordinate differentiates stative verbs, such as be, seem, or sound, from all other verbs, while the second semantic coordinate differentiates the verb have from all other verbs. Figure 12 reveals that the first coordinate is strongly autocorrelated, while the second coordinate is negatively autocorrelated for neighbourhood ranges up to 2. Although the latter result is not significant for 𝛼𝛼𝛼𝛼 = 0.05 according to (6), it is likely due to the use of have as an auxiliary verb in past perfect and other compound verb tenses.
4 Conclusions In this contribution, we have introduced a unified formalism for textual autocorrelation, i.e. the tendency for neighbouring textual positions to be more (or less) similar than randomly chosen positions. This approach to sequence and text analysis is based on two primitives: (i) neighbourhoodness between textual positions, as determined by a Markov model of navigation, and formally represented by the exchange matrix 𝐸𝐸𝐸𝐸; and (ii) (dis-)similarity between positions, as encoded in the (typically squared Euclidean) dissimilarity matrix 𝐷𝐷𝐷𝐷. By varying 𝐸𝐸𝐸𝐸 and or 𝐷𝐷𝐷𝐷, the proposed formalism recovers and revisits wellknown statistical objects and concepts, such as the 𝐹𝐹𝐹𝐹-ratio, the chi-square and Correspondence Analysis. It also gives a unified account of various representations commonly used for textual data analysis, in particular the sequential and bag-of-words models, as well as the term-document matrix. It can also be extended to provide a model of hypertext navigation, where hyperlinks act as magnifying (or reducing) glasses, modifying the relative weights of documents, and altering (or not) textual autocorrelation. This approach is applicable to any form of sequence and text analysis that can be expressed in terms of dissimilarity between positions (or between types). The presented case studies have aimed at illustrating this versatility by addressing lexical, morphosyntactic, and semantic properties of texts. As shown in the latter case, squared Euclidean dissimilarities can be visualised and decomposed into factorial components by multidimensional scaling; the textual autocorrelation of each component can in turn be analysed and interpreted – yielding in particular a new means of dealing with semantically related problems.
Textual navigation and autocorrelation 55
References Anselin, Luc. 1995. Local indicators of spatial association. Geographical Analysis 27(2). 93– 115. Bavaud, François. 2013. Testing spatial autocorrelation in weighted networks: The modes permutation test. Journal of Geographical Systems 15(3). 233–247. Bavaud, François, Christelle Cocco & Aris Xanthos. 2012 Textual autocorrelation: Formalism and illustrations. In Anne Dister, Dominique Longrée & Gérald Purnelle (eds.), 11èmes journées internationales d'analyse statistique des données textuelles, 109–120. Liège : Université de Liège. Bavaud, François & Aris Xanthos. 2005. Markov associativities. Journal of Quantitative Linguistics 12(2-3). 123–137. Cliff, Andrew D. & John K. Ord. 1981. Spatial processes: Models & applications. London: Pion. Cressie, Noel A.C. 1991. Statistics for spatial data. New York: Wiley. Greenacre, Michael. 2007. Correspondence analysis in practice, 2nd edn. London: Chapman and Hall/CRC Press. Grinstead, Charles M. & J. Laurie Snell. 1998. Introduction to probability. American Mathematical Society. Kučera, Henry & W. Nelson Francis. 1967. Computational analysis of present-day American English. Providence: Brown University Press. Lebart, Ludovic. 1969. Analyse statistique de la contigüité. Publication de l'Institut de Statistiques de l'Université de Paris 18. 81–112. Le Roux, Brigitte & Henry Rouanet. 2004. Geometric data analysis. Kluwer: Dordrecht. Mardia, Kanti V., John T. Kent & John M. Bibby. 1979. Multivariate analysis. New York: Academic Press. Miller, George A., Richard Beckwith, Christiane Fellbaum, Derek Gross & Katherine Miller. 1990. WordNet: An on-line lexical database. International Journal of Lexicography 3(4). 235–244. Moran, Patrick & Alfred Pierce. 1950. Notes on continuous stochastic phenomena. Biometrika 37(1-2). 17–23. Ourednik, André. 2010. Wikitractatus. http://wikitractatus.ourednik.info/ (accessed December 2012). Page, Lawrence. 2001. Method for node ranking in a linked database. U.S. Patent No 6,285,999. Pedersen, Ted, Siddharth Patwardhan & Jason Michelizzi. 2004. WordNet:Similarity – Measuring the relatedness of concepts. In Susan Dumais, Daniel Marcu & Salim Roukos (eds.), Proceedings of HLT-NAACL 2004: Demonstration Papers, 38–41. Boston: Association for Computational Linguistics. Resnik, Philip. 1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research 11. 95–130. Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees. Proceedings of international conference on new methods in language processing. Manchester, UK. Torgeson, Warren S. 1958. Theory and methods of scaling. New York: Wiley.
Martina Benešová and Radek Čech
Menzerath-Altmann law versus random model 1 Introduction The Menzerath-Altmann law belongs among the in linguistics most used and the best corroborated empirically elaborated linguistic laws. Firstly, it was enunciated by Menzerath (1928), later transformed into its mathematical formula by Altmann (1980), and recently it has been proved by the labour of Hřebíček (1995), Andres (2010) and Andres et al. (2012) to be closely related with the fractal quality which can be seen in texts. On the basis of already executed experiments when we tried to examine the fractality of the text we observed (Andres – Benešová 2011, Benešová 2011) that the gained results differ significantly depending on the manner in which the units of the sequence segmentation had been chosen and set; for example, the word was grasped graphically as a sequence of graphemes in between two blanks in a sentence, or it was regarded as a sound unit, semantic unit and so on. Importantly, although all used manners of segmentation are substantiated linguistically, the differences of results (e.g., the value of b parameter in formula (1)) are so striking that an analysis of the relationship between the segmentation and the character of the model representing the Menzerath-Altmann law is strongly needed, in our opinion. As the first step, we decided to scrutinize this relationship by using a random model of data building based on an original text sample. Surprisingly enough, despite the fact that testing models representing real systems of any kind by random models is quite usual in empirical science, the Menzerath-Altmann law has been tested by the random model only once in linguistics (Hřebíček, 2007), to our knowledge1. However, this test lacks appropriate amount of experiments to be supported. In sum, we pursue two aims. Firstly, the validity (or non-validity) of the Menzerath-Altmann law in the text segmented randomly will be tested; secondly, because three random models are used (cf. Section 3) the character of results of all models with regard to the real text will be analyzed.
1 In biology, Baixeries et al. (2012) tested the Menzerath-Altmann law by a random model.
58 Martina Benešová and Radek Čech
2 The Menzerath-Altmann law The Menzerath-Altmann law (MAL) enunciates that “the longer the language construct is, the shorter its constituents are”. The construct is considered to be a unit of language (for the purpose of our experiment we chose the sentence) which is constituted by its constituents (here the clauses); the constituent has to be measurable in terms of the units on the immediately lower linguistic level which constitute it (here the words). Specifically, the MAL applied in our experiment predicts that the longer the sentence is, the shorter its clauses are; the length of the clause is computed in the number of words. It should be emphasized that the MAL model is stochastic. The MAL can be mathematically formulated in a truncated formula of a power model as follows: 𝑦𝑦𝑦𝑦 = 𝐴𝐴𝐴𝐴 ∙ 𝑥𝑥𝑥𝑥 −𝑏𝑏𝑏𝑏
where x is the length of the construct measured in the number of its constituents (in our experiments it is the length of sentences measured in the number of their clauses; x∈N), y is the average length of the constituents measured in the units on the immediately lower language level, i.e. in the constituent’s constituents (in our experiment it is the average length of clauses measured in the number of words they are constituted of; y∈Q), A,b are positive real parameters. Graphically, parameter A determines how far from the x-axis the graph representing the particular MAL realization is positioned. However, parameter b is responsible firstly for the steepness of the curve, and secondly, more essentially, for the rising or falling tendency of the curve; when b>0 or b0 supplied with the respective coefficients of determination and confidence intervals. Model
random sample
the coefficient of determination R2
1
3
0.4340
2
2
0.5196
2
3
0.1500
2
4
0.4771
2
5
0.0050
3
4
0.1268
confidence interval 〈−0.0521; 0.0758〉 〈−0.0048; 0.0961〉 〈−0.1093; 0.1355〉 〈−0.0043; 0.0468〉 〈−0.0514; 0.0536〉 〈−0.0216; 0.0430〉
6 Conclusion The results of the experiment reveal that the data under examination generated by random models does not fulfil the MAL. Consequently, the results can be viewed as another argument supporting the assumption considering that the MAL expresses one of important mechanisms controlling human language behavior. Secondly, we wanted to explore which method of random modelling can construct the best data in terms of the MAL. We came to the finding that the biggest number of mathematical models showing the closest properties to the original text sample in terms of the MAL is designed by means of M2; this result is not surprising so much because M2 shares more characteristics of the original text than the other two random models.
66 Martina Benešová and Radek Čech
Acknowledgments Martina Benešová’s and Radek Čech's contribution is supported by the project CZ.1.07/2.3.00/30.0004 POSTUP and the project Linguistic and lexicostatistic analysis in cooperation of linguistics, mathematics, biology and psychology, grant no. CZ.1.07/2.3.00/20.0161, which is financed by the European Social Fund and the National Budget of the Czech Republic, respectively.
References Altmann, Gabriel. 1980. Prolegomena to Menzerath’s law. In Rüdiger Grotjahn (ed.), Glottometrika 2, 1–10. Bochum: Brockmeyer. Andres, Jan. 2010. On a conjecture about the fractal structure of language. Journal of Quantitative Linguistics 17(2). 101–122. Andres, Jan & Martina Benešová. 2011. Fractal analysis of Poe’s Raven. Glottometrics 21. 73– 100. Andres, Jan, Martina Benešová, Lubomír Kubáček & Jana Vrbková, J. 2012. Methodological note on the fractal analysis of texts. Journal of Quantitative Linguistics 19(1). 1–31. Baixeries, Jaume, Antoni Hernández-Fernández & Ramon Ferrer-i-Cancho. 2012. Random models of Menzerath-Altmann lawin genomes. BioSystems 107(3). 167–173. Barabási, Albert-László & Réka Albert. 1999. Emergence of Scaling in Random Networks. Science 286(5439). 509-512. Benešová, Martina. 2011. Kvantitativní analýza textu se zvláštním zřetelem k analýze fraktální [Quantitative analysis of text with special respect to fractal analysis]. Olomouc: Palacký University dissertation. Cohen, Avner, Rosario N. Mantegna & Shlomo Havlin. 1997. Numerical analysis of word frequencies in artificial and natural language texts. Fractals 5(1). 95–104. Cramer, Irene M. 2005. The parameters of the Altmann-Menzerath law. Journal of Quantitative Linguistics 12(1). 41–52. Ferrer i Cancho, Ramon. 2010. Network theory. In Patrick Colm Hogan (ed.), The Cambridge encyclopaedia of the language sciences, 555–557. Cambridge: Cambridge University Press. Ferrer i Cancho, Ramon & Ricard V. Solé. 2002. Zipf’s law and random texts. Advances in Complex Systems 5(1). 1–6. Ferrer i Cancho, Ramon & Brita Elvevåg. 2010. Random texts do not exhibit the real Zipf’s lawlike rank distribution. PLoS ONE 5(3). e9411. Hřebíček, Luděk. 1995. Text levels. Language constructs, constituents and the Menzerath-Altmann law. Trier: WVT. Hřebíček, Luděk. 2007. Text in semantics. The principle of compositeness. Praha: Oriental Institute of the Academy of Sciences of the Czech Republic. Kelih, Emmerich. 2010. Parameter interpretation of the Menzerath law: Evidence from Serbian. In Peter Grzybek, Emmerich Kelih & Ján Mačutek (eds.), Text and Language. Structures, functions, interrelations, quantitative perspectives, 71–79. Wien: Praesens.
Menzerath-Altmann law versus random model 67 Li, Wientian. 1992. Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE Transactions on Information Theory 38(6). 1842–1845. Mandelbrot, Benoit. 1953. An informational theory of the statistical structure of language. In Willis Jackson (ed.), Communication theory, 486–504. London: Butterworths. Menzerath, Paul. 1928. Über einige phonetische Probleme. In Actes du premier congrés international de linguistes, 104–105. Leiden: Sijthoff. Miller, George A. 1957. Some effects of intermittent silence. The American Journal of Psychology 70(2). 311–314. Miller, George A. & Noam Chomsky. 1963. Finitary models of language users. In R. Duncan Luce, Robert R. Bush & Eugene Galanter (eds.), Handbook of mathematical psychology, 419–491. New York: Wiley. Mitzenmacher, Michael. 2003. A brief history of generative models for power law and lognormal distributions. Internet Mathematics 1(2). 226–251.
Appendix Table 4: Model 1 – random samples 1–5: 𝑥𝑥𝑥𝑥 the length of sentences (in clauses) – 𝑧𝑧𝑧𝑧 their frequency – 𝑦𝑦𝑦𝑦 the mean length of clauses (in words). Random sample 1
Random sample 2
Random sample 3
𝒙𝒙𝒙𝒙 1
408
𝒛𝒛𝒛𝒛
9.9975
𝒚𝒚𝒚𝒚
𝒙𝒙𝒙𝒙 1
408
𝒛𝒛𝒛𝒛
9.8701
𝒚𝒚𝒚𝒚
𝒙𝒙𝒙𝒙 1
408
𝒛𝒛𝒛𝒛
9.8529
2
260
9.5000
2
260
9.6115
2
260
9.5135
3
176
9.3977
3
176
9.3598
3
176
9.7973
4
94
10.2154
4
94
10.2872
4
94
9.4601
5
57
9.5754
5
57
9.5193
5
57
8.8632
6
20
10.0417
6
20
10.2750
6
20
9.6500
7
14
10.0408
7
14
10.0918
7
14
9.9592
Random sample 4
Random sample 5
𝒙𝒙𝒙𝒙 1
𝒛𝒛𝒛𝒛
408
𝒚𝒚𝒚𝒚
9.1152
𝒙𝒙𝒙𝒙 1
𝒛𝒛𝒛𝒛
𝒚𝒚𝒚𝒚
408
9.0613
2
260
9.1885
2
260
9.6077
3
176
9.5663
3
176
9.5000
4
94
9.5319
4
94
9.7261
5
57
9.7649
5
57
9.0526
6
20
10.0833
6
20
9.3417
𝒚𝒚𝒚𝒚
68 Martina Benešová and Radek Čech Random sample 4 7
14
9.1531
Random sample 5 7
14
7.5918
Tab. 5: Model 2 – random samples 1–5: 𝑥𝑥𝑥𝑥 the length of sentences (in clauses) – 𝑧𝑧𝑧𝑧 their frequency – 𝑦𝑦𝑦𝑦 the mean length of clauses (in words). Random sample 1
Random sample 2
Random sample 3
𝒙𝒙𝒙𝒙 1
𝒛𝒛𝒛𝒛
408
𝒚𝒚𝒚𝒚
9.3676
𝒙𝒙𝒙𝒙 1
𝒛𝒛𝒛𝒛
408
𝒚𝒚𝒚𝒚
10.1838
𝒙𝒙𝒙𝒙 1
𝒛𝒛𝒛𝒛
𝒚𝒚𝒚𝒚
408
9.4853
2
260
9.7327
2
260
9.7462
2
260
9.4635
3
176
9.6705
3
176
9.2405
3
176
10.1477
4
94
9.9548
4
94
9.6649
4
94
9.2872
5
57
8.8035
5
57
8.9298
5
57
9.5228
6
20
9.1000
6
20
9.6250
6
20
8.1917
7
14
10.2959
7
14
9.2449
7
14
10.2449
Random sample 4
Random sample 5
𝒙𝒙𝒙𝒙 1
408
𝒛𝒛𝒛𝒛
9.7206
𝒚𝒚𝒚𝒚
𝒙𝒙𝒙𝒙 1
408
𝒛𝒛𝒛𝒛
9.2696
𝒚𝒚𝒚𝒚
2
260
9.5808
2
260
9.8904
3
176
9.4678
3
176
9.5473
4
94
9.6702
4
94
9.7128
5
57
9.6035
5
57
9.4281
6
20
9.3833
6
20
9.0333
7
14
9.1429
7
14
9.7245
Table 6: Model 3 – random samples 1–5: 𝑥𝑥𝑥𝑥 the length of sentences (in clauses) – 𝑧𝑧𝑧𝑧 their frequency – 𝑦𝑦𝑦𝑦 the mean length of clauses (in words). Random sample 1
Random sample 2
Random sample 3
𝒙𝒙𝒙𝒙 1
408
𝒛𝒛𝒛𝒛
9.5490
𝒚𝒚𝒚𝒚
𝒙𝒙𝒙𝒙 1
408
𝒛𝒛𝒛𝒛
9.7010
𝒚𝒚𝒚𝒚
𝒙𝒙𝒙𝒙 1
408
𝒛𝒛𝒛𝒛
9.5588
𝒚𝒚𝒚𝒚
2
260
9.4923
2
260
9.4038
2
260
9.5731
3
176
9.5303
3
176
9.4091
3
176
9.6174
4
94
9.4229
4
94
9.7473
4
94
9.4309
Menzerath-Altmann law versus random model 69
Random sample 1
Random sample 2
Random sample 3
5
57
9.8000
5
57
9.6175
5
57
9.5404
6
20
9.9167
6
20
10.0083
6
20
9.6750
7
14
9.7143
7
14
9.3776
7
14
9.7959
Random sample 4
Random sample 5
𝒙𝒙𝒙𝒙 1
𝒛𝒛𝒛𝒛
408
𝒚𝒚𝒚𝒚
9.6471
𝒙𝒙𝒙𝒙 1
𝒛𝒛𝒛𝒛
𝒚𝒚𝒚𝒚
408
9.4314
2
260
9.7308
2
260
9.8212
3
176
9.5341
3
176
9.3277
4
94
9.2739
4
94
9.5239
5
57
9.6175
5
57
9.5754
6
20
9.7667
6
20
9.6583
7
14
9.3061
7
14
10.1327
Radek Čech
Text length and the lambda frequency structure of a text 1 Frequency structure of a text The frequency structure of a text is usually considered to be one of the important aspects for the evaluation of language usage, and at first sight it does not seem to be a very problematic concept. The type-token ratio or several other methods for the measurement of vocabulary “richness” are usually used for analyses of this kind (e.g. Weizman 1971; Tešitelová 1972; Ratkowsky et al. 1980; Hess et al. 1986, 1989; Richards 1987; Tuldava 1995; Wimmer, Altmann 1999; Müller 2002; Wimmer 2005; Popescu 2007; Popescu et al. 2009; Martynenko 2010; Covington, McFall 2010). However, all approaches face the problem of the undesirable impact of text length. Obviously, if one’s aim is to compare two or more texts with regard to their frequency structure (e.g. for a comparison of authors or genres), one has to somehow eliminate the role of text length. The simplest method is to use only a part of the text (e.g. the first 100 or 1000 words) for the comparison. However, this method neglects the “integrity” of the text; for example, if the author has the intention of writing a longer text, (s)he can consciously use many new words at the end of the text which can radically change the frequency structure. Consequently, the cutting of the text cannot be considered a suitable method for the measurement of frequency structure and only entire texts should be analyzed. Another way to eliminate the impact of text length consists in some kind of normalization. One of the most recent attempts at this approach was proposed by Popescu, Mačutek, Altmann (2010), and in more detail by Popescu, Čech, Altmann (2011). The authors introduced a method based on the properties of the arc length, defined as 𝑉𝑉𝑉𝑉−1
1
𝐿𝐿𝐿𝐿 = �[(𝑓𝑓𝑓𝑓𝑟𝑟𝑟𝑟 − 𝑓𝑓𝑓𝑓𝑟𝑟𝑟𝑟−1 )2 + 1]2 𝑟𝑟𝑟𝑟=1
(1)
where fr are the ordered absolute word frequencies (𝑟𝑟𝑟𝑟 = 1,2, . . . 𝑉𝑉𝑉𝑉 𝑉 1) and V is the highest rank (= vocabulary size), i.e. L consists of the sum of Euclidean distances between ranked frequencies. The frequency structure is measured by the so-called lambda indicator
72 Radek Čech
𝛬𝛬𝛬𝛬 =
𝐿𝐿𝐿𝐿(𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙10 𝑁𝑁𝑁𝑁) 𝑁𝑁𝑁𝑁
(2)
The authors state that this indicator successfully eliminates the impact of text length, meaning that it can be used for a comparison of genre, authorship, authorial development, language typology and so on; an analysis of 1185 texts in 35 languages appears to prove the independence between lambda and text length (cf. Popescu, Čech, Altmann 2011, p. 10 ff.). However, if particular languages are analyzed separately, a dependence of lambda on text length emerges, as presented in this study. Moreover, this study reveals that the relationship between lambda and text length is not straightforward, unlike the relationship between text length and type-token ratio. Further, the study presents a method for the empirical determination of the interval in which lambda should be independent of text length; within this interval lambda could be used for its original purpose, i.e. comparison of genre, authorship, authorial development etc. (cf. Popescu et al. 2011).
2 The analysis of an individual novel The analysis of the relationship between lambda and text length within an individual novel has the following advantages; first, a maximum of boundary conditions (e.g. authorship, genre, year) which can influence the relationship between observed characteristics is eliminated; second, the novel can be viewed to a certain extent as a homogeneous entity with regard to its theme and style. Consequently, one can expect that the novel represents one of the best types of material for the observation of the relationship between frequency characteristics and text length. Since the majority of novels are divided into chapters, the length of the chapter and lambda can be taken as parameters. However, for an analysis of this kind one has to use a very long novel with many chapters of different lengths. This analysis uses the Czech novel The Fateful Adventures of the Good Soldier Švejk During the World War, written by Jaroslav Hašek, and Oliver Twist, written by Charles Dickens. The novels are very long (N(Hašek) = 199,120; N(Dickens) = 159,017) and contain 27 (for Hašek) and 53 (for Dickens) chapters of different lengths; the length of the chapters lies in the interval N(Hašek) ∈ , N(Dickens) ∈ . Figure 1 and Figure 2, and Table 1 and Table 2, present the relationships between the lambda of individual chapters and their length. Obviously, lambda
Text Length and the lambda frequency structure of a text 73
decreases depending on text length. Moreover, if one observes the relationship between lambda and cumulative N (cumulative values of N are added by chapter), the tendency of decreasing lambda with regard to the text length is even more evident, as is shown in Figure 3 and Figure 4 and Table 3 and Table 4.
Fig. 1: The relationship between lambda of individual chapters and their length in the novel The Fateful Adventures of the Good Soldier Švejk During the World War written by Jaroslav Hašek.
Table 1: The length and lambda of individual chapters in the novel The Fateful Adventures of the Good Soldier Švejk During the World War written by Jaroslav Hašek Chapter 1 2 3 4 5 6 7 8 9
N
2810 2229 1725 1289 2103 2881 1458 4610 6095
Lambda Chapter 1.6264 1.7761 1.8092 1.7880 1.8084 1.7484 1.8587 1.7812 1.6935
10 11 12 13 14 15 16 17 18
N
6191 2530 2007 3914 12045 3342 5500 18550 17009
Lambda Chapter 1.6246 1.8599 1.7584 1.7578 1.6473 1.7310 1.6894 1.5772 1.5685
19 20 21 22 23 24 25 26 27
N
7092 12183 16426 15195 14986 14852 7504 2734 11860
Lambda 1.7107 1.4956 1.5733 1.5524 1.6202 1.5521 1.7106 1.8112 1.6319
74 Radek Čech
Fig. 2: The relationship between lambda of individual chapters and their length in the novel Oliver Twist written by Charles Dickens.
Table 2: The length and lambda of individual chapters in the novel Oliver Twist written by Charles Dickens. Chapter 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
N
1124 3976 3105 2589 4073 1747 2341 3282 2348 1818 2627 3433 2841 4002 2378 3538 3232 3013
Lambda Chapter 1.5275 1.2629 1.3246 1.3688 1.3375 1.4540 1.3413 1.3365 1.2879 1.4883 1.3294 1.3256 1.4036 1.1101 1.3504 1.3210 1.2858 1.4018
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
N
3423 3000 2197 2493 2740 1998 2259 4526 2518 3438 1394 2360 3989 3377 3295 3729 2823 998
Lambda Chapter 1.2748 1.2772 1.4481 1.3979 1.3156 1.4026 1.3994 1.3238 1.3819 1.3369 1.4498 1.3538 1.1882 1.2650 1.2828 1.2423 1.2438 1.4436
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
N
3617 3601 5276 2577 3595 3720 3784 2383 1228 3588 2530 3488 3579 4275 4858 3322 1562
Lambda 1.3137 1.2882 1.2553 1.1221 1.2040 1.1990 1.2151 1.2170 1.3474 1.2435 1.2529 1.4025 1.2004 1.3574 1.1045 1.2684 1.5089
Text Length and the lambda frequency structure of a text 75
Fig. 3: The relationship between lambda of individual chapters and cumulative length in the novel The Fateful Adventures of the Good Soldier Švejk During the World War written by Jaroslav Hašek.
Table 3: The cumulative length and lambda of individual chapters in the novel The Fateful Adventures of the Good Soldier Švejk During the World War written by Jaroslav Hašek. Chapter N (cum) Lamda Chapter N (cum) Lambda Chapter N (cum) Lambda 1 1-2 1-3 1-4 1-5 1-6 1-7 1-8 1-9
2810 5039 6764 8053 10156 13037 14495 19105 25200
1.6264 1.5838 1.5727 1.5668 1.5408 1.5109 1.5170 1.5063 1.4744
1-10 1-11 1-12 1-13 1-14 1-15 1-16 1-17 1-18
31391 33921 35928 39842 51887 55229 60729 79279 96288
1.4267 1.4251 1.4140 1.3966 1.3557 1.3378 1.3174 1.2731 1.2344
1-19 1-20 1-21 1-22 1-23 1-24 1-25 1-26 1-27
103380 115563 131989 147184 162170 177022 184526 187260 199120
1.2208 1.1878 1.1609 1.1350 1.1178 1.0975 1.0887 1.0863 1.0718
76 Radek Čech
Fig. 4: The relationship between lambda of individual chapters and cumulative length in the novel Oliver Twist written by Charles Dickens.
Table 4: The cumulative length and lambda of individual chapters in the novel Oliver Twist written by Charles Dickens. Chapter N (cum.) Lambda Chapter N (cum.) Lambda Chapter N (cum.) Lambda 1 1-2 1-3 1-4 1-5 1-6 1-7 1-8 1-9 1-10 1-11 1-12 1-13 1-14 1-15 1-16 1-17 1-18
1134 5111 8216 10804 14877 16624 18965 22247 24595 26413 29040 32473 35314 39316 41694 45232 48464 51477
1.5321 1.2489 1.1712 1.1287 1.0882 1.0703 1.0327 1.0064 0.9805 0.9758 0.9619 0.9378 0.9258 0.8888 0.8740 0.8566 0.8432 0.8352
1-19 1-20 1-21 1-22 1-23 1-24 1-25 1-26 1-27 1-28 1-29 1-31 1-31 1-32 1-33 1-34 1-35 1-36
54900 57900 60097 62590 65330 67328 69587 74113 76631 80069 81463 83823 87812 91189 94484 98213 101036 102034
0.8236 0.8104 0.8090 0.8014 0.7943 0.7909 0.7851 0.7784 0.7732 0.7663 0.7639 0.7581 0.7490 0.7436 0.7370 0.7285 0.7216 0.7193
1-37 1-38 1-39 1-40 1-41 1-42 1-43 1-44 1-45 1-56 1-47 1-48 1-49 1-50 1-51 1-52 1-53
105651 109252 109252 117105 120700 124420 128204 130587 131815 135403 137933 141421 145000 149275 154133 157455 159017
0.7148 0.7099 0.7099 0.6960 0.6880 0.6816 0.6765 0.6725 0.6706 0.6675 0.6633 0.6628 0.6581 0.6576 0.6504 0.6472 0.6465
Text Length and the lambda frequency structure of a text 77
3 Analysis of individual languages (Czech and English) The findings presented in Section 2 seriously undermine the assumption of the independence of lambda on text length, and they open up questions about the relationship between these two text parameters (i.e. N and Λ) in general. For a more detailed analysis, texts of different length in the same language were used and the relationship between N and Λ was observed. For Czech, I analyzed 610 texts whose length lies in the interval N ∈ , and for English, 218 texts whose length lies in the interval N ∈ . The results are presented in Figure 5 and Figure 6. In both cases lambda first increases along with N, and then it decreases in form of a concave bow. The question is whether there is some interval of N in which lambda is independent of text length. Theoretically, one can assume that in a very short text the author cannot control the frequency structure because of the lack of “space”. In other words, the author cannot consciously manage the proportion of particular words in short texts because the length of the text simply does not make it possible. With increasing N the possibilities to control the frequency structure increase (for instance, one can try to avoid repetition of words). On the other hand, there should be probably some maximum of N where the capability of the author to influence the frequency structure diminishes (which could be caused by some limits of human mental ability) and the frequency structure is ruled by a self-regulating mechanism; Popescu et al. (2012, pp. 126– 127) describe this mechanism as follows: The longer the text, the more the writer loses his subconscious control over some proportions and keeps only the conscious control over contents, grammar, his aim, etc. But as soon as parts of control disappear, the text develops its own dynamics and begins to abide by some laws which are not known to the writer but work steadily in the background. The process is analogous to that in physics: if we walk, we consider our activity as something normal; but if we stumble, i.e. lose the control, gravitation manifests its presence and we fall. That means, gravitation does not work ad hoc in order to worry us maliciously, but it is always present, even if we do not realize it consciously. In writing, laws are present, too, and they work at a level which is only partially accessible. One can overcome their working, but one cannot eliminate them. On the other hand, if the writer slowly loses his control of frequency structuring, a new order begins to arise by self-organization or by some not perceivable background mechanism.
Furthermore, a long text (such as a novel) is written with many breaks, which also leads to the author’s loss of control of frequency characteristics. In other words, if one interrupts the process of writing (for any reason, e.g. sleep-
78 Radek Čech ing, eating, traveling), one’s mental state is probably changed; even if the author repeatedly reads the text, this change influences the character of frequency structuring. In short, I would like to emphasize the difference between homogeneous text types (e.g. a personal letter, e-mail, poem, or short story) which are written relatively continuously, on the one hand, and long texts written under different circumstances, on the other.
Fig. 5: The relationship between lambda and text length in Czech (610 texts).
Fig. 6: The relationship between lambda and text length in English (218 texts).
In sum, it can be assumed that in the case of texts that are too short or too long, the author cannot control the frequency structure. So, the task is to find the
Text Length and the lambda frequency structure of a text 79
interval in which lambda is independent of N (if it exists at all). It can be supposed that the differences among lambda values of individual texts in this interval should be caused by some pragmatic reasons, such as authorship, genre and so on (cf. Popescu et al 2011). The next section presents a method for deriving the interval empirically.
4 The interval of the relative independence of lambda on text length For the empirical determination of the interval in which lambda should be independent of text length, the following method was used. First, the data were pooled into intervals and for each interval the mean lambda was computed (see Table 5 and 6). Table 5: The mean lambdas of particular intervals of N in Czech; 615 texts were used for the analysis. Interval
n
113 140 39 28 27 69 47 35 26 20 18 13 17 20
mean(Λ) 1.5337 1.6950 1.7999 1.7970 1.8602 1.8543 1.8351 1.8180 1.8210 1.7184 1.7127 1.7142 1.5250 1.2233
s^2(Λ)
0.023246 0.018871 0.012658 0.011724 0.009719 0.012765 0.015120 0.024705 0.027314 0.013041 0.026641 0.028925 0.020573 0.014912
s^2(Λ)/n
0.000206 0.000135 0.000325 0.000419 0.000360 0.000185 0.000322 0.000706 0.001051 0.000652 0.001480 0.002225 0.001210 0.000746
Table 6: The mean lambdas of particular intervals of N in English; 218 texts were used for the analysis. Interval
n
40 15 12 14
mean(Λ) 1.2107 1.4911 1.4889 1.4659
s^2(Λ)
0.036552 0.021244 0.028356 0.011208
s^2(Λ)/n
0.000914 0.001416 0.002363 0.000801
80 Radek Čech Interval
n
9 12 18 14 29 14 16 11 8 6
mean(Λ) 1.4437 1.4174 1.3281 1.3088 1.3058 1.2821 1.2238 1.0022 0.8297 0.5432
s^2(Λ)
0.017859 0.026552 0.019375 0.007136 0.008702 0.014297 0.020234 0.009905 0.019885 0.006244
s^2(Λ)/n
0.001984 0.002213 0.001076 0.000510 0.000300 0.001021 0.001265 0.000900 0.002486 0.001041
Then, the differences of lambdas between subsequent intervals were tested by the asymptotic u-test 𝛬𝛬𝛬𝛬1 − 𝛬𝛬𝛬𝛬2 𝑢𝑢𝑢𝑢 = (3) 𝑠𝑠𝑠𝑠 2 𝑠𝑠𝑠𝑠 2 �� 1 + 2 � 𝑛𝑛𝑛𝑛1 𝑛𝑛𝑛𝑛2
Specifically, for the comparison of the first and second interval from Table 5 we obtain
𝑢𝑢𝑢𝑢 =
1.6950 − 1.5337
√0.000135 − 0.000206
= 8.73
Because multiple u-tests in subsequent intervals are performed (which inflates probability of Type I error), the Bonferroni correction is used for an appropriate adjustment (cf. Miller 1981). Specifically, the critical value for a rejection of null hypothesis is determined as follows 𝑝𝑝𝑝𝑝𝑖𝑖𝑖𝑖 =
𝛼𝛼𝛼𝛼/2 𝑛𝑛𝑛𝑛
(4)
where α is the significance level and n is a number of performed tests. For the significance level α = 0.05 and n = 13 (we perform 13 tests, cf. Table 7 and 8) we get ucorrected ≤ 2.89 as a critical value. Thus we can state that between first and second interval (cf. Table 5) is a significant difference (at the significance level α = 0.05). All results are presented in Table 7 and 8 and graphically in Figure 7 and 8, where subsequent intervals with non-significant differences are linked. Table 7: The mean lambdas of particular intervals of N in Czech; 615 texts were used for the analysis. Interval
mean(Λ) 1.5337
u
8.73
Interval
mean(Λ) 1.6950
Text Length and the lambda frequency structure of a text 81
Interval
mean(Λ) 1.6950 1.7999 1.7970 1.8602 1.8543 1.8351 1.8180 1.8210 1.7184 1.7127 1.7142 1.5250
u
4.89 0.11 2.26 0.25 0.85 0.53 0.07 2.49 0.12 0.02 3.23 6.82
Interval
mean(Λ) 1.7999 1.7970 1.8602 1.8543 1.8351 1.8180 1.8210 1.7184 1.7127 1.7142 1.5250 1.2233
Table 8: The differences between subsequent intervals in English. The values of significant differences (at the significance level α = 0.05, ucorrected ≤ 2.89) are boldfaced. Interval
mean(Λ) 1.2107 1.4911 1.4889 1.4659 1.4437 1.4174 1.3281 1.3088 1.3058 1.2821 1.2238 1.0022 0.8297
u
5.81 0.04 0.41 0.42 0.41 1.56 0.48 0.11 0.65 1.22 4.76 2.96 4.82
Interval
mean(Λ) 1.4911 1.4889 1.4659 1.4437 1.4174 1.3281 1.3088 1.3058 1.2821 1.2238 1.0022 0.8297 0.5432
82 Radek Čech
Fig. 7: The differences between subsequent intervals in Czech. Subsequent intervals with nonsignificant differences are connected by the lines. The line does not mean that all the intervals inside the line are not different; the line expresses non-significant differences only between pairs of subsequent intervals.
As is seen in Figure 7 and 8, there is a relatively wide interval (the black points) in which no significant differences between particular subsequent intervals appear; for Czech N ∈ and for English N ∈ . As for the lower endpoint of these intervals, it can be assumed that this is related to the synthetic/analytic character of the two languages. Specifically, the more analytic the language, the lower the endpoint of this interval, and vice versa. As for the upper endpoint, it can be assumed that it should be relatively independent of the type of the language because this endpoint should be a consequence of human mental ability (cf. Section 3). The difference between this endpoint in Czech (N = 6500) and English (N = 7000) is rather because of the different character of the sample (cf. Table 5 and 6). Obviously, only further research based on an analysis of more texts and more languages could bring a more reliable result.
Text Length and the lambda frequency structure of a text 83
Fig. 8: The differences between subsequent intervals in English. Subsequent intervals with non-significant differences are connected by the lines. The line does not mean that all the intervals inside the line are not different; the line expresses non-significant differences only between pairs of subsequent intervals.
5 Maximum of lambda (theoretical) and empirical findings Another way to observe the properties of the relationship between the text length and lambda consists in a comparison of the maximum theoretical value of lambda and the empirical findings. Because lambda is based on the property
84 Radek Čech of the arc length L (cf. Section 1), it is first necessary to derive the theoretical maximum of the arc length, which is given as 𝐿𝐿𝐿𝐿𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑉𝑉𝑉𝑉 − 1 + 𝑓𝑓𝑓𝑓(1) − 1 = 𝑁𝑁𝑁𝑁 − 1
(5)
where V is the number of word types (i.e. maximum rank) and f(1) is the maximum frequency. The theoretical maximum of lambda is 𝛬𝛬𝛬𝛬𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 =
𝐿𝐿𝐿𝐿𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑁𝑁𝑁𝑁
log(𝑁𝑁𝑁𝑁) =
�
(𝑁𝑁𝑁𝑁 − 1)
practically N >> 1, hence
𝑁𝑁𝑁𝑁
1
� log(𝑁𝑁𝑁𝑁) = �1 − � log(𝑁𝑁𝑁𝑁)
𝛬𝛬𝛬𝛬𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = log(𝑁𝑁𝑁𝑁)
𝑁𝑁𝑁𝑁
(6)
(7)
The comparison of the maximum theoretical value of lambda and the empirical findings is presented in Figure 9 and 10. It is clear that shorter texts are closer to the maximum than longer ones. Moreover, with increasing text length the difference between Λmax and Λ grows. Obviously, this result reveals that with increasing text length the language user's ability to control the frequency structure of the text decreases, and consequently it is governed by some selfregulating mechanism (cf. Popescu et al. 2012).
Fig. 9: Comparison of the maximum theoretical value of lambda (line) and the empirical findings (dots) in Czech.
Text Length and the lambda frequency structure of a text 85
Fig. 10: Comparison of the maximum theoretical value of lambda (line) and the empirical findings (dots) in English.
6 Conclusion The study revealed the evident dependence of the lambda indicator on text length. Consequently, this finding undermines the main methodological advantage of the lambda measurement and casts doubt upon many of the results presented by Popescu at al. (2011). However, the specific relationship between lambda and the text length, which is expressed graphically by a concave bow (cf. Section 3), allows us to determine the interval of the text length in which the lambda measurement is meaningful and can be used for original purposes. Moreover, this determination is connected to theoretical reasoning which may be enhanced by psycholinguistic or cognitive explanations. To summarize, the lambda indicator can on the one hand be ranked among all previous attempts which have tried unsuccessfully to eliminate the impact of text length; on the other hand, however, its specificity means that the method offers a potential use in comparisons of texts. Of course, only further analyses can reveal the potential meaningfulness or meaninglessness of this method.
86 Radek Čech
Acknowledgments This study was supported by the project Linguistic and lexicostatistic analysis in cooperation of linguistics, mathematics, biology and psychology, grant no. CZ.1.07/2.3.00/20.0161, which is financed by the European Social Fund and the National Budget of the Czech Republic.
References Covington, Michael A. & Joe D. McFall. 2010. Cutting the Gordian Knot: The Moving Average Type-Token Ratio (MATTR). Journal of Quantitative Linguistics 17(2). 94–100. Hess, Carla W., Karen M. Sefton & Richard G. Landry. 1986. Sample size and type-token ratios for oral language of preschool children. Journal of Speech and Hearing Research, 29. 129– 134. Hess, Carla W., Karen M. Sefton & Richard G. Landry. 1989. The reliability of type-token ratios for the oral language of school age children. Journal of Speech and Hearing Research, 32(3). 536–540 Martynenko, Gregory. 2010. Measuring lexical richness and its harmony. In Peter Grzybek, Emmerich Kelih & Ján Mačutek (eds.), Text and language. Structures, Functions, Interrelations, 125-132. Wien: Praesens Verlag. Miller, Rupert G. 1981. Simultaneous Statistical Inference. Berlin, Heidelberg: Springer. Müller, Dieter. 2002. Computing the type token relation from the a priori distribution of types. Journal of Quantitative Linguistics 9(3). 193-214. Popescu, Ioan-Iovitz. 2007. Text ranking by the weight of highly frequent words. In Peter Grzybek & Reinhard Köhler (eds.), Exact methods in the study of language and text, 555565. Berlin-New York: de Gruyter. Popescu, Ioan-Iovitz, Gabriel Altmann, Peter Grzybek, Bijapur Dayaloo Jayaram, Reinhard Köhler, Viktor Krupa, Ján Mačutek, Regina Pustet, Ludmila Uhlířová & Matumnal N. Vidya. 2009. Word frequency studies. Berlin-New York: Mouton de Gruyter. Popescu, Ioan-Iovitz, Ján Mačutek & Gabriel Altmann. 2010. Word forms, style and typology. Glottotheory, 3. 89-96. Popescu, Ioan-Iovitz, Rakek Čech, & Gabriel Altmann. 2011. The lambda-structure of texts. Lüdenscheid: RAM-Verlag. Popescu, Ioan-Iovitz, Radek Čech & Gabriel Altmann. 2012. Some geometric properties of Slovak poetry. Journal of Quantitative Linguistics 19(2). 121-131. Ratkowsky, David A., Maurice H. Halstead & Linda Hantrais. 1980. Measuring vocabulary richness in literary works. A new proposal and a re-assessment of some earlier measures. Glottometrika 2. 125-147. Richards, Brian. 1987. Type/token ratios: what do they really tell us? Journal of Child Language 14(2). 201-209. Tešitelová, Marie. 1972. On the so-called vocabulary richness. Prague Studies in Mathematical Linguistics 3. 103-120.
Text Length and the lambda frequency structure of a text 87 Tuldava, Juhan. 1995. On the relation between text length and vocabulary size. In Juhan Tuldava (ed.), Methods in quantitative linguistics, 131-150. Trier: WVT. Weizman, Michael. 1971. How useful is the logarithmic type-token ratio? Journal of Linguistics 7(2). 237-243. Wimmer, Gejza. 2005.Type-token relation. In Reinhard Köhler, Gabriel Altmann & Rajmund G. Piotrowski (eds.), Handbook of Quantitative Linguistics, 361-368. Berlin: de Gruyter. Wimmer, Gejza & Gabriel Altmann.1999. On vocabulary richness. Journal of Quantitative Linguistics 6(1). 1-9.
Reinhard Köhler
Linguistic Motifs 1 Introduction Quantitative linguistics has been concerned with units, properties, and their relations mostly in a way where syntagmatic resp. sequential behaviour of the objects under study was ignored. The employed mathematical means and models reflect, in their majority, mass phenomena treated as samples taken from some populations - even if texts and corpora do not possess the statistical properties which are needed for many of the common methods. Nevertheless, with some caution, good and valid results can be obtained using probability distributions, functions, differential and difference equations, etc. The present volume gives an overview of alternative methods which can be applied if the sequential structure of linguistic expressions, in general texts, are in the focus of an investigation. Here, a recently presented new unit will be introduced in order to find a method which can give information about the sequential organisation of a text with respect to any linguistic unit and to any of its properties – without relying on a specific linguistic approach or grammar. Moreover, the method brings with it several advantages, which will be described below. The construction of this unit, the motif (originally called segment or sequence, cf. Köhler 2006, 2008a,b; Köhler/Naumann 2008, 2009, 2010; ) was inspired by the so-called F-motiv for musical “texts” (Boroda 1982). Boroda was in search for a unit which could replace the word as used in linguistics for frequency studies in musical pieces. Units common in musicology were not usable for his purpose, and so he defined the "F-Motiv" with respect to the duration of the notes of a musical piece.
2 The new unit 'motif' As a much more general approach is needed, the linguistic motif is defined as the longest continuous sequence of equal or increasing values representing a quantitative property of a linguistic unit.
Thus,
90 Reinhard Köhler A L-motif is a continuous series of equal or increasing length values (e.g. of morphs, words or sentences). A F-motif is a continuous series of equal or increasing frequency values (e.g. of morphs, words or syntactic construction types). A P-motif is a continuous series of equal or increasing polysemy values (e.g. of morphs or words). A T-motif is a continuous series of equal or increasing polytextuality values (e.g. of morphs, words or syntactic construction types).
An example of a L-motif segmentation is the following. The sentence “Word length studies are almost exclusively devoted to the problem of distributions.”
is, according to the above-given definition, represented by a sequence of 5 L-motifs: (1-1-2) (1-2-4) (3) (1-1-2) (1-4) if the definition is applied to word length measured in the number of syllables. Similarly, motifs can be defined for any linguistic unit (phone, phrase [type], clause [type], etc.) and for any linguistic property (poly-functionality, complexity, familiarity etc.). Variants of investigations based on motifs can be generated by changing the direction in which these units are segmented, i.e. beginning from the first unit in the text/discourse and proceeding forward or beginning from the last item and applying the definition in the opposite direction and by replacing “increasing” by “decreasing” values of the given property in the definition of the motif1. We do not expect statistically significant differences in the results. In contrast, different operationalisations of properties will affect the results in many cases, e.g. if word length is measured in the number of letters or in the average duration in ms in speech. Some of the advantageous properties of the new units are the following: 1. Following the definition, any text or discourse can be segmented in an unambiguous way. In many cases, the segmentation can be done automatically using simple computer programs. Frequency motifs can be determined in two ways: (1) the frequency of the units under study are counted in the studied text or corpus itself or (2) taken from a frequency dictionary. The same
1 It may be, e.g. appropriate to go from right to left when a language with syntactic left branching preference is analyzed.
Linguistic Motifs 91
holds for polytextuality motifs, only that a single text does not suffice, of course, whereas polysemy must be looked up in a dictionary. Lengths motifs can be determined automatically to the extent to which the writing system reflects the units in which length is to be counted. Alphabetic scripts are good conditions for character counts whereas syllabic scripts abet syllable counts. Special circumstances such as with Chinese are also helpful if syllables are to be counted. Syntactic complexity, depth of embedding and other more complicated properties can also be used to form motifs but determining them automatically presupposes previously annotated text material. Even if segmentation into motifs cannot be done automatically, the important advantage remains that the result does not depend on any interpretation but is objective and unambiguous. 2. Segmentation in motifs is always exhaustive, i.e. no rest will remain. The successor of a numerical value in a sequence is always (1) larger than or equal to the given value – or (2) smaller. In the first case, the successor belongs to the current motif, in the second case, it starts a new motif. The last value in a text does not cause any problem; for the first one, we have to add the only additional rule: It starts the first motif. 3. Motifs have an appropriate granularity. They can always be operationalised in a way that segmentation takes place in the same order of magnitude as the phenomena under analysis. 4. Motifs are scalable with respect to granularity. One and the same definition can be iteratively applied: It is possible to form motifs on the basis of length or frequency values etc. of motifs. A closer look unveils that there are two different modes in a scaled analysis. The first level of motifs is based on a property of genuine linguistic units, e.g. word length, which is counted in terms of e.g., syllables, phones, characters, morphs, or even given as duration in milliseconds. On the second level, on which the length of length motifs is determined, only numbers exist. Thus, the concept of length is a different one than that on the first level. The length motifs of length motifs of word length or higher levels, however, do not add new aspects. It goes without saying that corresponding differences exist if other properties or units are used to form motifs. The scaling mechanism can be used to generate infinitely many new kinds of motifs. Thus, frequency motifs of word lengths motifs can be formed as well as length motifs of frequency motifs of morph polysemy motifs. This may sound confusing and arbitrary; nevertheless, a number of such cross-category motifs was used and proved to be useful for text classification purposes (cf. Köhler, Naumann 2010).
92 Reinhard Köhler
5.
In the cited work, a notation was introduced to avoid long-winded expressions. For length motifs of frequency motifs of word lengths, the symbol LFL was used etc. In a more general context, a symbolic representation of the basic unit should be added, e.g. the symbol LFP(m) could be appropriate for length motifs of frequency motifs of morph polysemy. Motifs display a rank-frequency distribution of the Zipf-Mandelbrot type, i.e. they behave in this respect in a way similar to other, more intuitive units of linguistic analysis. The empirical parameters of this law, e.g. applied to the distribution of frequency motifs of word frequencies, were also used for text classification (l.c.).
3 Application Let us consider a text, e.g. one of the end-of-year speeches of Italian presidents, in which the words have been replaced by their lengths measured in syllables. The beginning of the corresponding sequence looks as follows: 2 2 1 1 2 1 3 2 3 2 2 1 1 1 4 1 5 1 1 2 1 2 2 1 2 1 4 3 2 1 1 1 4 1 3 4 2 1 1 4 2 2 2 2 4 2 1 4 ... Forming L-motifs according to the definition given above yields (we do not need the parentheses): 2-2 1-1-2 1-3 2-3 2-2 1-1-1-4 1-5 1-1-2 1-2-2 1-2 1-4 3 2 1-1-1-4 1-3-4 2 1-1-4 ...
Linguistic Motifs 93
The rank-frequency distribution of the motifs in the complete text is given in Table 1. Table 1: Rank-frequency distribution of the L-motifs (syllables) in an Italian text. Rank
Motif
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Frequency Rank
1-3 1-2 1-4 2 1-1-3 1-2-3 1-1-2 1-2-2 2-3 1-1-4 3 2-2 1-5 1-2-4 1-3-4 1-3-3 2-4 1-1-1-3 1-1-5 1-2-5 1-2-2-3 2-2-2 1-2-2-2 1-1-3-3 1-1-2-3 1-1-2-2 2-5 1-6 2-2-3 1-2-2-4 1-1-1-2 1-1-1-4 4 1-3-5 2-3-3 1-1-2-4 1-1-3-4 1-1-2-2-2 1-4-4 1-1-2-2-3
297 218 151 141 86 74 62 53 52 50 49 47 35 31 26 25 24 24 21 21 17 17 17 16 16 15 15 14 13 12 12 10 10 9 9 8 8 8 7 7
Motif
Frequency
71 2-3-4 72 1-1-3-3-5 73 3-3-4 74 2-6 75 1-1-2-2-3-4 76 2-2-2-4 77 1-3-4-4 78 2-4-4 79 1-1-1-2-2-2 80 1-1-2-2-5 81 3-4 82 1-2-4-4 83 1-1-1-1-2-2 84 2-2-2-2-4 85 1-1-2-2-2-2-3 86 2-2-2-2-3-3 87 2-2-2-2-3 88 1-2-2-2-2-2-2-4 89 1-1-1-1-1-1-2-2 90 1-1-1-1-1-2 91 1-1-1-1-1-5 92 1-1-2-2-3-3 93 2-2-3-3 94 4-4-4 95 2-2-2-2-2 96 1-1-2-2-2-3 97 1-1-2-2-2-2 98 4-4 99 1-2-2-2-2-2-2-3 100 1-1-3-6 101 1-1-2-2-5-5 102 1-1-1-1-2-2-2 103 1-2-3-3-4 104 1-2-2-6 105 2-2-2-5 106 1-1-1-1-1-3 107 1-1-1-1-1-3-3-3 108 1-3-3-3-4 109 1-4-7 110 5
2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
94 Reinhard Köhler Rank
Motif Frequency Rank Motif Frequency 41 3-3 7 111 1-1-2-3-4 1 42 1-2-2-2-2 6 112 1-1-1-2-2 1 43 1-2-3-3 6 113 1-1-1-2-2-2-3 1 44 1-2-2-2-3 6 114 1-1-1-1-6 1 45 1-1-1-1-4 5 115 1-2-2-3-4 1 46 2-2-2-2 5 116 2-2-2-3-3 1 47 1-2-6 5 117 1-4-4-5 1 48 1-1-4-4 5 118 1-1-1-3-4 1 49 1-1-1-1-3 5 119 1-4-5 1 50 2-3-3-3 4 120 1-5-5 1 51 1-3-3-4 4 121 1-1-2-7 1 52 1-3-3-3 4 122 2-2-2-6 1 53 1-2-2-2-4 4 123 1-2-2-2-3-4 1 54 1-1-2-5 4 124 2-4-5 1 55 1-1-1-5 4 125 2-2-2-3-4 1 56 1-2-3-4 3 126 1-2-3-7 1 57 1-1-2-6 3 127 1-1-4-4-4 1 58 1-2-2-5 3 128 1-1-4-4-4-4 1 59 1-1-2-3-3 3 129 1-1-1-1-3-4 1 60 1-1-1-3-3 3 130 1-2-2-3-5 1 61 1-1-1-2-4 3 131 2-2-4-4 1 62 2-2-2-3 3 132 1-1-2-2-2-4 1 63 1-1-1-2-3 3 133 1-2-2-2-2-2 1 64 1-1-3-3-3 3 134 1-2-4-5 1 65 2-2-4 3 135 1-2-2-2-5 1 66 1-1-6 3 136 1-3-3-3-5 1 67 1-1-1-1-2-4 2 137 1-1-1-2-5 1 68 1-2-2-2-2-2-2 2 138 1-1-3-5 1 69 1-3-6 2 139 1-2-2-3-3-3 1 70 2-2-5 2 140 1-1-2-3-3-4 1
Fitting the Zipf-Mandelbrot distribution to the data yields an excellent result: The value of the Chi-square statistic is 30.7651 with 131 degrees of freedom; the probability cannot be distinguished from 1.0. The parameters are estimated as a = 1.6468, b = 3.1218. Figure 1 gives an impression of the fit. So far, the empirical description of one of the aspects of the statistical structure of motif sequences in texts was shown. Such results can be used for various purposes, because the differences of the parameters of two empirical distributions can be tested for significance, e.g. for the comparison of authors, texts, and text sorts and for classification purposes (cf. Köhler, Naumann 2010). As to the question, which theoretical probability distribution should be expected for motifs, a general hypotheses was set up in Köhler (2006). It states that the frequency
Linguistic Motifs 95
distributions of motifs are similar to the distributions of the basic units. This hypothesis was successfully tested in the cited work, in the cited papers by Köhler and Naumann, and in Mačutek (2009).
Fig. 1: Bi-logarithmic graph of the Zipf-Mandelbrot distribution as fitted to the data from Table 1.
The following is an example, which may illustrate that the underlying probability distribution of the basic units resp. their properties and of the corresponding motifs can be theoretically derived and thus explained. In Köhler, Naumann (2009), sentence length was studied. As opposed to most studies on sentence length, this quantity was measured in terms of the number of clauses. There are two main reasons to do so. First, the word, the commonly used unit for this purpose, is not the immediate constituent of the sentence (in the sense of the Menzerath-Altmann Law). Second, frequency counts of sentence length based on the number of words, display ragged distribution shapes with notoriously under-populated classes. The data are usually pooled into intervals of at least ten but do not display smooth distributions nevertheless. The length motifs formed on the basis of clause counts yielded an inventory of motif types, which turned out to be distributed according to the Zipf-Mandelbrot distribution, as expected. The next step was the attempt to theoretically find the probability distribution of the length of these length motifs. The corresponding considerations were as follows: [1] In a given text, the mean sentence length, the estimation of the mathematical expectation of sentence length, can be interpreted as the sentence length intended by the text expedient (speaker/writer).
96 Reinhard Köhler [2]
Shorter sentences are formed in order to decrease decoding/processing effort (the requirement minD in synergetic linguistics) within the sentence. This tendency will be represented by the quantity D. [3] Longer sentences are formed where they help to compactify what otherwise would be expressed by two or more sentences and where the more compact form decreases processing effort with respect to the next higher (inter-sentence) level. This will be represented by H. [2] and [3] are causes for deviations from the mean length value while they, at the same time, compete with each other. This interdependence can be expressed using Altmann’s approach (Köhler, Altmann 1966): The probability of a sentence length x is proportional to the probability of sentence length x-1. The function 𝑃𝑃𝑃𝑃𝑥𝑥𝑥𝑥 =
𝐷𝐷𝐷𝐷 𝑃𝑃𝑃𝑃 𝑥𝑥𝑥𝑥 + 𝐻𝐻𝐻𝐻 𝐻 1 𝑥𝑥𝑥𝑥𝑥𝑥
represents the above-sketched relations: D has an increasing influence on this relation whereas H has a decreasing one. The probability class x itself has also a decreasing influence, which reflects the fact that the probability of long sentences decreases with the length. This equation leads to the hyper-Poisson distribution (Wimmer/Altmann 1999, 281): 𝑃𝑃𝑃𝑃𝑥𝑥𝑥𝑥 =
𝑎𝑎𝑎𝑎 𝑥𝑥𝑥𝑥 , (𝑥𝑥𝑥𝑥) 1𝐹𝐹𝐹𝐹1 (1; 𝑏𝑏𝑏𝑏; 𝑎𝑎𝑎𝑎)𝑏𝑏𝑏𝑏
𝑥𝑥𝑥𝑥 = 0,1,2, … , 𝑎𝑎𝑎𝑎 𝑎 0, 𝑏𝑏𝑏𝑏 => 0
where 1F1(1; b; a) is the confluent hyper-geometric function 1𝐹𝐹𝐹𝐹1
∞
(1; 𝑏𝑏𝑏𝑏; 𝑎𝑎𝑎𝑎) = � 𝑗𝑗𝑗𝑗=0
𝑎𝑎𝑎𝑎𝑗𝑗𝑗𝑗 𝑏𝑏𝑏𝑏𝑗𝑗𝑗𝑗
And 𝑏𝑏𝑏𝑏 (𝑥𝑥𝑥𝑥) = 𝑏𝑏𝑏𝑏(𝑏𝑏𝑏𝑏 + 1) … (𝑏𝑏𝑏𝑏 + 𝑥𝑥𝑥𝑥 𝑥 1). According to this derivation, the hyper-Poisson distribution, which plays a basic role with word length distributions (Best 1997), should therefore also be a good model of L-motif length on the sentence level although motifs on the word level, regardless of the property considered (length, polytextuality, frequency), follow the hyper-Pascal distribution (Köhler 2006); Köhler/Naumann 2008). Figure 2 shows one example of the fitting tests, which support the specific hypothesis derived above as well as the general hypothesis that motif distributions resemble the distributions of the basic units resp. properties. Similarly, in Köhler (2006) a theoretical model of the distribution of the length of L-motifs was derived, which yielded the hyper-Pascal distribution. Empirical tests confirmed this hypothesis.
Linguistic Motifs 97
Fig. 2: Fitting the hyper-Poisson distribution to the frequency distribution of the lengths of Lmotifs on the sentence level.
4 Motifs on the basis of categorical sequences Linguistic data are often categorical. The definition of motifs given in the introduction prevents such data from forming motifs: metrical or ordinal comparisons are not defined for variables on a nominal/categorical scale. An example of categorical data was discussed in Beliankou, Köhler, Naumann (2013), where argumentation structures were in the focus. The argumentation relations such as circumstance, condition, concession, evidence, elaboration, contrast, evaluation, are obviously categorical values. For the purpose of the cited study, a definition of motifs was given which allowed for the formation of two kinds of motifs. The first one is based on the repetition of a value, i.e. the current motif ends as soon as one of the values in the sequence is a repetition of a previous one. This kind of motif was called R-motif: An R-motif is an uninterrupted sequence of unrepeated elements.
98 Reinhard Köhler An example of the segmentation of a text fragment taken from one of the annotated newspaper commentaries of the Potsdam corpus2 (represented as a sequence of argumentative relations) into R-motifs is the following: ["elaboration"], ["elaboration”, "concession"], ["elaboration", "evidence", "list", "preparation", "evaluation", "concession"], ["evidence", "elaboration", "evaluation].
The first R-motif consists of a single element because the following relation is a repetition of the first; the second one ends also where one of its elements occurs again etc. On this basis, the lengths of these R-motifs in the Potsdam commentary corpus were determined. The distribution of the motif lengths turned out to abide by the hyper-Binomial distribution (cf. Figure 3):
Fig. 3: The distribution of the length of R-motifs in a Corpus, which was annotated for argumentation relations (cf. Beliankou, Köhler, Naumann 2013).
Another variant of categorical motifs was introduced because the argumentation data are organised not only in a sequence but also in a tree. Therefore, the D-motif was defined as A D-motif is an uninterrupted depth-first path of elements in a tree structure.
2 Stede (2004).
Linguistic Motifs 99
Each motif begins at the root of the tree and follows one of the possible paths down until a terminal element is reached. The length of motifs determined in this way displays a behaviour that differs considerably from that of the R-motifs. A linguistically interpretable theoretical probability distribution which can be fitted to the empirical frequency distribution is the mixed negative binomial distribution (cf. Figure 4).
Fig. 4: Fitting the mixed negative binomial distribution to the D-motif data
This distribution was justified in the paper by the assumption that it is the result of a combination of two processes, viz. the combination of two diversifications, which result both in the negative binomial distribution but with different parameters. We will apply the R-motif method to the Italian text we studied above but this time with respect to the sequences of part-of-speech tags3. Replacing the words in the text by the symbols for their resp. part-of-speech yields the sequence (for the tags cf. Table 2): A N CONG A N DET N CONG V A N V PREP DET N PREP N PRON V V PREP V...
3 Again, many thanks to Arjuna Tuzzi.
100 Reinhard Köhler Table 2: Tags as used in the annotations to the Italian text. Tag
Part-of-speech
N PREP V A DET CONG PRON AVV NM NUM ESC
noun preposition verb adjective article conjunction pronoun adverb proper name number interjection
We determine the R-motifs and fit the Zipf-Mandelbrot distribution to the frequency distribution of these motifs. The motifs and their frequency are shown in Table 3. The result of the fitting is excellent. The probability of the Chi-square value is given as 1.0, the parameters are a = 0.9378, b = 0.9553, and n = 602. The number of degrees of freedom is 473 (caused by pooling classes with low frequency). The graph of the distribution and the data is given in Figure 5. Table 3: Empirical rank-frequency distribution of part-of-speech based R-motifs from an Italian Text. Rank Part-of-speech based R-motifs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
V N N-PREP PREP PREP-N N-V PRON V-PREP V-CONG PRON-V AVV N-A A N-PRON-V V-AVV V-A N-DET N-PRON
Freq. 155 86 84 39 36 34 34 31 29 28 27 26 24 19 19 19 18 18
Rank Part-of-speech based R-motifs 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319
A-N-PREP-NM AVV-V-N-CONG-PREP AVVAVV-PREP-PRON V-PREP-N-CONG-AVV V-A-DET-N-CONG CONG-A-PRON-AVV-V-N-DET CONG-A-AVV-PRON-V-N A-PREP-DET-N A-N-ESC-AVV AVV-DET-A-N-CONG-PRON-V A-N-AVV-PREP N-PREP-A-V-DET CONG-PRON-V-A-N N-A-AVV V-PREP-A-N-CONG-DET-NM N-AVV-V-A AVV-PRON-V-N-PREP
Freq. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Linguistic Motifs 101
Rank Part-of-speech based R-motifs 19 N-AVV-V 20 V-DET-N 21 V-DET-N-PREP 22 V-DET 23 N-PREP-A 24 N-CONG 25 A-N 26 DET 27 N-V-DET 28 AVV-V 29 V-PREP-N 30 A-N-V 31 PREP-A 32 V-PRON 33 A-N-PREP 34 N-A-PREP 35 V-PREP-DET-N 36 DET-N-PREP 37 V-CONG-PRON 38 V-A-CONG 39 PREP-V 40 A-V 41 PREP-DET-N 42 PRON-V-CONG 43 A-PREP 44 CONG-PRON-V 45 PREP-PRON 46 PRON-V-AVV 47 PREP-NM 48 N-CONG-DET 49 CONG-V 50 V-N 51 PREP-A-N 52 N-PREP-PRON 53 N-AVV 54 V-DET-N-PRON 55 N-CONG-AVV 56 V-A-N-PREP 57 CONG 58 V-DET-N-CONG 59 DET-A 60 PREP-N-V 61 V-A-PREP-N 62 V-A-AVV 63 A-N-DET
Freq. 16 16 15 14 14 14 13 13 11 10 10 10 10 10 10 10 8 8 8 8 8 8 7 7 7 7 7 6 6 6 6 6 6 5 5 5 5 5 5 5 5 4 4 4 4
Rank Part-of-speech based R-motifs 320 V-A-N-AVV-DET 321 N-AVV-A 322 A-N-PRON-V-AVV 323 V-PREP-DET-N-A 324 DET-N-A-PREP-V 325 N-DET-NM-V 326 DET-N-PRON 327 A-PRON-V-CONG-DET-N-PREP 328 V-N-A-CONG 329 CONG-V-DET-N-ESC 330 V-CONG-DET-PRON 331 PRON-V-PREP-DET-N 332 V-DET-N-A-ESC-PREP 333 DET-N-PREP-PRON-V-A 334 V-A--CONG-N 335 N-PREP-DET-A 336 PREP-N-PRON 337 PRON-V-DET-N-PREP-AVV 338 N-AVV-V-PREP-A 339 V-AVV-PREP-NM-CONG 340 PREP-A-N-AVV-V 341 AVV-DET 342 PREP-NM-V-AVV-DET-N-CONG 343 NM-PREP-DET 344 NM-V-A 345 V-CONG-DET-N-PREP-NM 346 V-AVV-DET 347 N-NM 348 N-PREP-NM-PRON-V-CONG 349 N-NM-A-CONG-PRON-V-PREP 350 NM-PRON-V-PREP 351 V-N-A-AVV 352 V-PRON-DET-N-A 353 V-AVV-PRON-PREP-N-DET 354 PRON-V-PREP-N-CONG 355 DET-N-A-V-PREP 356 PREP-N-A-CONG-PRON 357 PREP-V-A-N-DET 358 V-N-CONG-A 359 AVV-A-N-CONG 360 V-DET-N-CONG-ESC-PREP-PRON 361 AVV-PREP-A-N 362 N-ESC-DET 363 V-N-A-DET 364 A-DET-N
Freq. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
102 Reinhard Köhler Rank Part-of-speech based R-motifs 64 V-CONG-AVV 65 N-CONG-V 66 A-CONG 67 DET-N 68 PREP-DET-A-N 69 V-PREP-PRON 70 PRON-V-PREP 71 N-PREP-DET 72 V-N-CONG-DET 73 N-A-V 74 V-AVV-PRON 75 N-PREP-V 76 N-PREP-NM 77 DET-N-A 78 N-AVV-V-DET 79 N-ESC 80 PREP-N-DET 81 N-CONG-PRON-V 82 V-A-PRON 83 V-A-N 84 N-A-V-DET 85 N-AVV-PREP 86 PREP-N-A 87 V-N-PRON 88 V-DET-N-PREP-A 89 AVV-PREP-N 90 A-CONG-PRON-V 91 DET-N-V 92 DET-NM-V 93 DET-V 94 A-PREP-N 95 PREP-DET 96 CONG-PRON 97 V-PREP-N-DET 98 PRON-DET-V 99 N-AVV-DET 100 AVV-CONG-V 101 V-A-N-DET 102 DET-N-PREP-NM 103 PREP-A-N-PRON-V 104 DET-N-V-PREP 105 V-PREP-N-PRON 106 N-V-PREP 107 A-AVV-V 108 AVV-V-A
Freq. 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Rank Part-of-speech based R-motifs 365 N-AVV-PRON 366 PRON-V-PREP-N-DET 367 N-CONG-AVV-DET 368 N--DET 369 N-V-AVV-CONG-PREP 370 AVV-N 371 V-DET-N-AVV-A-CONG 372 V-CONG-PREP-N-PRON-A 373 N-AVV-V-CONG 374 PREP-N-A-DET 375 V-PREP-A-N 376 PRON-CONG-AVV-V-DET-N-A 377 N-AVV-A-PREP 378 V-DET-N-A-PRON-PREP 379 N-NM-DET-PRON-V 380 PREP-N-CONG-AVV-V-A 381 PREP-N-M 382 V-AVV-PREP-DET-N-PRON 383 DET-N-V-PRON 384 N-PREP-V-CONG 385 V-CONG-DET 386 N-AVV-CONG-PRON 387 AVV-V-PRON 388 V-PRON-AVV 389 V-DET-N-CONG-AVV 390 V-CONG-AVV-N-M-N 391 AVV-V-PREP-NM 392 PREP-N-AVV 393 PREP-DET-N-A 394 N-NM-PRON-V-CONG 395 N-A-CONG-AVV-V 396 AVV-DET-N-PREP 397 N-NM-V-CONG-PRON 398 N-A-PRON-AVV-V 399 N--V 400 A-CONG-PRON-V-AVV 401 DET-A-N-V 402 PRON-DET-N-V 403 V-CONG-DET-N-A-AVV 404 CONG-PREP-N-AVV-V 405 CONG-V-AVV 406 AVV-V-N-PREP 407 N-AVV-PREP-DET 408 N-PRON-AVV 409 N-PREP-DET-PRON
Freq. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Linguistic Motifs 103
Rank Part-of-speech based R-motifs 109 V-CONG-DET-N 110 CONG-DET-N-V 111 PREP-PRON-V 112 V-DET-N-A 113 AVV-V-DET-N-PREP 114 V-PREP-N-CONG 115 V-A-N-PRON 116 N-PREP-PRON-V 117 N-CONG-V-A 118 N-V-PREP-DET 119 N-CONG-PREP 120 PRON-CONG 121 N-PREP-DET-NM 122 N-A-PRON 123 CONG-V-N 124 N-PRON-PREP 125 A-N-V-PREP 126 PRON-N 127 PREP-A-N-CONG 128 PRON-V-AVV-CONG 129 DET-N-PRON-V 130 V-A-N-CONG 131 CONG-DET-N-PREP 132 N-PRON-V-DET 133 V-DET-N-CONG-A 134 CONG-DET 135 N-V-A 136 A-N-V-DET 137 V-DET-N-PRON-AVV 138 PRON-V-N-CONG 139 V-N-PREP 140 A-AVV 141 A-CONG-DET-N-PREP 142 N-V-AVV 143 PREP-N-CONG 144 AVV-V-A-N 145 PREP-A-N-V 146 N-V-DET-A 147 V-AVV-PREP-N 148 PREP-N-A-AVV 149 CONG-PRON-V-PREP 150 PRON-V-PREP-A-N 151 PREP-V-DET-N 152 V-AVV-A-N 153 N-PREP-V-DET
Freq. 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Rank Part-of-speech based R-motifs 410 V-PREP-N-CONG-DET 411 PREP-N-A-V 412 V-AVV-PREP-A-N 413 AVV-PREP-N-M 414 N-M-N-CONG-V-DET 415 N-A-PRON-V-CONG 416 A-CONG-PRON-V-PREP 417 V-N-CONG-AVV-PRON 418 V-ESC-PREP-N-CONG 419 PREP-V-CONG-A-N 420 V-N-PRON-AVV 421 V-PREP-ESC-PRON 422 V-AVV-DET-N-PREP 423 V-PREP-ESC 424 PREP-N-PRON-AVV-V 425 CONG-V-A 426 CONG-N-M-V 427 DET-N-CONG 428 PREP-PRON-N-M-N-AVV 429 CONG-AVV-V 430 V-A-CONG-DET-N 431 V-AVV-A-CONG-DET-N-PREP 432 PREP-PRON-DET 433 DET-N-CONG-PREP-V 434 V-DET-A 435 N-A-CONG 436 PREP-A-N-DET 437 V-PREP-PRON-AVV 438 PREP-V-A-AVV 439 A-N--V 440 CONG-AVV-PRON-V 441 V-PREP-DET-A-PRON 442 A-AVV-DET-N-M-N-PREP 443 N-A-V-AVV-PREP 444 CONG-V-PREP 445 N-V-CONG-AVV-DET 446 A-PRON-V-DET-N 447 V-PREP-N-A-CONG-PRON 448 PREP-N-NM-AVV-DET 449 PREP-N-AVV-A-CONG 450 PRON-PREP-N 451 PREP-N-PRON-V-DET 452 PREP-PRON-A-AVV-V 453 PREP-N-CONG-V 454 A-N-PRON-V
Freq. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
104 Reinhard Köhler Rank Part-of-speech based R-motifs 154 V-CONG-A 155 V-N-CONG-AVV 156 PRON-V-DET-N-A-CONG 157 PREP-N-A-CONG 158 A-N-PRON 159 A-N-AVV 160 PRON-V-PREP-N 161 V-PRON-PREP 162 CONG-PREP 163 N-CONG-A 164 N-A-CONG-DET 165 N-PREP-V-A 166 N-V-AVV-A 167 AVV-CONG 168 N-PRON-V-PREP 169 V-DET-N-AVV-PRON 170 AVV-PRON 171 N-A-V-PREP 172 V-DET-A-PREP 173 DET-N-V-A 174 V-N-CONG 175 PRON-V-AVV-PREP-N 176 PREP-N-CONG-PRON-V 177 CONG-AVV 178 V-DET-A-N 179 N-AVV-CONG-PREP 180 V-PREP-A-N-DET 181 PRON-V-DET-N-CONG 182 V-A-PREP 183 AVV-A-N-PREP 184 AVV-PREP 185 V-AVV-CONG 186 CONG-N-PREP 187 V-CONG-PREP 188 N-A-CONG-PREP 189 PREP-NM-CONG 190 V-DET-N-A-PREP 191 PREP-A-DET 192 PRON-PREP 193 N-CONG-V-DET 194 N-AVV-CONG 195 A-N-CONG 196 V-AVV-DET-N 197 AVV-A 198 AVV-PRON-V-DET-A-N
Freq. 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
Rank Part-of-speech based R-motifs 455 PRON-V-AVV-A 456 CONG-PRON-V-DET-N 457 CONG-V-N-M-N-DET 458 AVV-CONG-PRON-V 459 CONG-AVV-V-DET-N 460 V-DET-N-PREP-N-M 461 N-A-AVV-V 462 AVV-V-PREP-DET-N-A 463 N-PRON-AVV-V-A 464 N-A-PREP-V 465 N--PRON-V-CONG 466 PRON-V-DET-A-N-CONG 467 DET-V-CONG-N-A 468 N-V-CONG-DET 469 CONG-V-N-A-PREP 470 N-CONG-PREP-PRON-V 471 V-PREP-NM-CONG-AVV 472 V-AVV-N-M-N-A-PRON 473 PREP-N-M-N 474 V-AVV-N-M 475 V-PREP-N-A-PRON 476 A-CONG-PREP-NM-V-DET-N 477 NM-V-DET-N-PREP 478 NM-DET-N-V-A 479 A-PRON-V 480 V-A-PREP-PRON 481 PREP-PRON-DET-N 482 V-AVV-PRON-PREP 483 V-PREP-N-AVV-A-CONG 484 V-AVV-PREP 485 AVV-V-DET-N-CONG-A-PREP-PRON 486 V-AVV-PREP-NM 487 CONG-PREP-NM-V 488 A-N-CONG-PREP 489 A-PREP-N-M-N 490 PREP-DET-N-AVV-A 491 V-PRON-DET-N 492 N-PRON-V-A-AVV-PREP 493 V-N-M 494 DET-N-PREP-NM-V 495 V-PREP-NM-CONG 496 PREP-NM-DET-N 497 NM-A-PREP 498 PREP-NM499 NM-N-M-PREP
Freq. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Linguistic Motifs 105
Rank Part-of-speech based Freq. R-motifs 199 AVV-A-CONG 1 200 A-DET-N-V 1 201 V-CONG-PRON-AVV-PREP 1 202 N-V-PREP-A 1 203 ESC-V-DET-N 1 204 PRON-V-A 1 205 A-PRON-V-AVV 1 206 PREP-PRON-N-V-DET 1 207 V-ESC-CONG-PREP-N-DET 1 208 N-V-AVV-A-CONG 1 209 V-CONG-N 1 210 PRON-V-CONG-DET-N-PREP-A 1 211 N-CONG-V-PREP 1 212 N-CONG-AVV-V-PREP 1 213 V-N-A-AVV-CONG 1 214 A-NM-V-DET-N 1 215 PREP-A-N-PRON-V-DET 1 216 NM-PREP 1 217 N-V-CONG-PREP-DET 1 218 PREP-NM-PRON-A-N 1 219 PRON-V-DET 1 220 N-AVV-V-DET-A 1 221 N-DET-AVV-A 1 222 V-A-PRON-AVV 1 223 AVV-A-N-V 1 224 DET-N--CONG 1 225 CONG-V-PREP-N-PRON 1 226 A-CONG-AVV-V-DET-N-PREP 1 227 N-PRON-V-A 1 228 V-DET-A-N-PRON 1 229 V-ESC-PREP-A 1 230 N-A-AVV-V-DET 1 231 V-PREP-N-M-N 1 232 PREP-N-PRON-V-AVV-CONG 1 233 AVV-PRON-V-DET 1 234 V-AVV-CONG-PRON 1 235 V-CONG-PRON-N 1 236 N-PREP-DET-NM-CONG 1 237 A-CONG-AVV-V-PREP 1 238 PREP-DET-N-CONG 1 239 PREP-N-DET-A 1 240 A-N-ESC-PREP-PRON 1 241 ESC-PREP-PRON 1 242 CONG-AVV-PRON 1 243 V-CONG-DET-N-PREP-A 1
Rank Part-of-speech based R-motifs 500 NM 501 N-M-N-DET 502 N-M-PREP 503 N-M-V-A-CONG 504 V-DET-N-NM 505 PREP-NM-V 506 AVV-V-N-CONG-DET 507 A-CONG-PREP-PRON-V 508 V-CONG-PREP-NM 509 PREP-V-DET-N-A-PRON 510 PRON-V-AVV-A-N 511 V-PREP-DET-N-CONG-AVV 512 AVV-V-PREP-PRON-A 513 PREP-DET-N-V 514 AVV-PRON-V-N-CONG-A 515 V-PREP-DET 516 V-N-CONG-PREP-NM-PRON-AVV 517 V-PREP-NM 518 NM-V 519 V-N-AVV-A 520 V-A-AVV-PREP-PRON-DET 521 CONG-PREP-N 522 PREP-V-DET 523 NM-A-N-AVV 524 A-N-PREP-DET-N-M 525 N-CONG-AVV-N-M-DET 526 N-PREP-PRON-AVV 527 N-V-CONG-NM 528 V-NM-AVV 529 AVV-V-PREP 530 PREP-DET-N-V-A 531 N-PREP-NM-PRON-V 532 PREP-DET-V-AVV 533 PREP-V-N 534 N-PREP-AVV-V 535 AVV-V-CONG 536 PREP-NM-A-AVV-PRON-V 537 V-DET-NM-PREP 538 NM-PREP-PRON 539 PRON-V-DET-N-PREP 540 PREP-A-N-PRON-DET 541 N-PRON-DET-V 542 V-DET-A-N-PRON-AVV 543 N-DET-A 544 V--NM-AVV
Freq. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
106 Reinhard Köhler Rank Part-of-speech based R-motifs 244 PRON-PREP-N-A-V 245 PRON-V-PREP-AVV 246 V-CONG-DET-N-PREP 247 V-DET-N-PREP-N-M-CONG 248 A-N-CONG-PRON-V 249 A-N-AVV-DET 250 CONG-DET-NM 251 DET-A-N-CONG 252 DET-N-PREP-A 253 N-CONG-DET-V 254 CONG-AVV-A-N-DET 255 AVV-A-V-CONG 256 PRON-AVV-V-A-N 257 N-AVV-V-A-N-M 258 V-DET-N-AVV-A 259 N-AVV-A-CONG 260 V-DET-N-PRON-CONG-A 261 AVV-V-A-CONG 262 PREP-DET-NM-V 263 N-CONG-PRON 264 V-A--AVV 265 V-AVV-CONG-PREP-N 266 PREP-N-V-DET 267 A-N-PRON-V-PREP-DET 268 PRON-N-PREP 269 N-A-V-CONG 270 A-N-AVV-V 271 CONG-A-N 272 V-DET-N-PREP-PRON 273 V-PRON-DET 274 AVV-V-N 275 A-AVV-PREP-DET-N-CONG 276 DET-N-A-CONG-PRON-V 277 AVV-A-PREP-N 278 N-CONG-V-A-PREP 279 V-PREP-DET-N-A-AVV 280 A-CONG-PREP 281 PRON-V-PREP-DET-N-A 282 PRON-V-DET-N-A-AVV 283 N-A-PREP-PRON-V 284 A-CONG-V 285 N-PRON-AVV-V 286 V-CONG-N-PREP 287 N-PREP-NM-CONG-DET 288 DET-A-N-PREP
Freq. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Rank Part-of-speech based Freq. R-motifs 545 DET-N-A-PREP 1 546 N-NM-A 1 547 PREP-NM-N 1 548 NM-A-CONG-DET 1 549 V-A-DET-N 1 550 V-N-PREP-PRON 1 551 PRON-V-AVV-CONG-PREP-N 1 552 CONG-PREP-N-PRON-V-DET 1 553 A-V-AVV-PREP 1 554 PREP-V-AVV 1 555 PREP-PRON-N 1 556 PRON-AVV-V-N-M-N-A-PREP-CONG 1 557 A-N-CONG-DET 1 558 A-PREP-N-M 1 559 N-CONG-AVV-V-PREP-PRON 1 560 N-PREP-NM-ESC-DET 1 561 PREP-N-PRON-V 1 562 PRON-V-DET-A-N 1 563 A-PREP-V 1 564 A-ESC-N-PREP-NM-CONG 1 565 ESC-PRON 1 566 PREP-A-N-CONG-V 1 567 A-CONG-AVV 1 568 N-A-CONG-V 1 569 DET-V-A 1 570 CONG-A 1 571 CONG-PREP-DET-N-A 1 572 N-PRON-V-CONG 1 573 A-CONG-DET-N-PRON-V 1 574 N-V-PREP-NM-A 1 575 PREP-DET-N-A-CONG 1 576 A-V-CONG-PREP 1 577 PREP-V-PRON-A-NM 1 578 A-CONG-V-DET 1 579 PREP-PRON-CONG-V-A-N 1 580 V-DET-N-A-PRON 1 581 PRON-V-CONG-PREP-DET 1 582 N--AVV-V-A-PREP 1 583 AVV-DET-N-M-PREP 1 584 N-M-N-AVV 1 585 AVV-N-M-N 1 586 N-V-N-M 1 587 N-PREP-CONG-DET 1 588 PRON-A-V-AVV-DET-N 1 589 PREP-V-DET-N-CONG-PRON 1
Linguistic Motifs 107
Rank Part-of-speech based R-motifs 289 A-PRON-V-CONG-PREP 290 PRON-A-N 291 A-N-PRON-V-DET 292 PRON-CONG-A-N-V 293 A-DET-NM-V 294 A-AVV-PREP-N-PRON 295 PREP-AVV 296 PREP-A-N-CONG-V-DET 297 PREP-N-PRON-DET 298 AVV-N-PREP-NM-DET 299 PREP-N-A-CONG-AVV 300 A-PRON-V-PREP 301 N-ESC-A
Freq. 1 1 1 1 1 1 1 1 1 1 1 1 1
Rank Part-of-speech based R-motifs 590 V-N-M-N 591 N-M-N 592 N-M-PRON-V-A 593 PRON-V-A-PREP 594 A-PREP-PRON 595 N-AVV-PRON-V-DET 596 V-CONG-PREP-A-N-M-N 597 V-PREP-AVV 598 CONG-N-PRON-V-PREP 599 PRON-V-AVV-A-N-PREP 600 V-DET-A-N-PREP 601 PREP-PRON-CONG-A-N 602 N-PREP-V-AVV-A
Freq. 1 1 1 1 1 1 1 1 1 1 1 1 1
Fig. 5: Fitting the Zipf-Mandelbrot distribution to the data in Table 3. Both axes are logarithmic.
It goes without saying that also R-motifs, D-motifs, and maybe other variants of motifs which are formed from categorical data can be used as the basis for forming F-, L- and other kinds of motifs. The procedure can be recursively continued until the point is reached where too few elements are left. Thus, motifs provide a means to analyse text with respect to their sequential structure with respect to all kinds of linguistic units and properties; even categorical properties can be studied in this way. The granularity of an investigation can be adjusted by iterative application of motif-formation, and proven statistical methods can be used for the evaluation. The full potential of this approach has not yet been explored.
108 Reinhard Köhler
References Beliankou, Andrei, Reinhard Köhler & Sven Naumann. (2013). Quantitative properties of argumentation motifs. In Ivan Obradović, Emmerich Kelih & Reinhard Köhler (eds.), Methods and applications of quantitative linguistics, 33–43. Belgrade: Academic Mind. Best, Karl-Heinz. 1997. Zum Stand der Untersuchungen zu Wort- und Satzlängen. In Third International Conference on Quantitative Linguistics, 172–176. Helsinki. Köhler, Reinhard. 2006. The frequency distribution of the lengths of length sequences. In Jozef Genzor & Martina Bucková (eds.), Favete linguis. Studies in honour of Viktor Krupa, 145– 152. Bratislava: Slovak Academic Press. Köhler, Reinhard. 2008a. Word length in text. A study in the syntagmatic dimension. In Sybila Mislovičová (ed.), Jazyk a jazykoveda v pohybe, 416–421. Bratislava: Veda. Köhler, Reinhard. 2008b. Sequences of linguistic quantities. Report on a new unit of investigation. Glottotheory 1(1). 115–119. Köhler, Reinhard & Gabriel Altmann. 1996. “Language forces” and synergetic modelling of language phenomena. In Peter Schmidt (ed.), Glottometrika 15, 63–76. Trier: WVT. Köhler, Reinhard & Sven Naumann. 2008. Quantitative text analysis using L-, F- and T-segments. In Christine Preisach, Hans Burkhardt, Lars Schmidt-Thieme & Reinhold Decker (eds.), Data Analysis, Machine Learning and Applications, 635–646. Berlin & Heidelberg: Springer. Köhler, Reinhard & Sven Naumann. 2009. A contribution to quantitative studies on the sentence level. In Reinhard Köhler (ed.), Issues in Quantitative Linguistics, 34–57. Lüdenscheid: RAM-Verlag. Köhler, Reinhard & Sven Naumann. 2010. A syntagmatic approach to automatic text classification. Statistical properties of F- and L-motifs as text characteristics. In Peter Grzybek, Emmerich Kelih & Ján Mačutek (eds.), Text and Language. Structures, functions, interrelations, quantitative perspectives, 81–89. Wien: Praesens. Mačutek, Ján. 2009. Motif richness. In Reinhard Köhler (ed.), Issues in Quantitative Linguistics, 51–60. Lüdenscheid: RAM-Verlag. Stede, Manfred. 2004. The Potsdam commentary corpus. In Bonnie Webber & Donna Byron (eds.), Proceedings of the 2004 ACL workshop on discourse annotation, 96–102. Wimmer, Gejza & Gabriel Altmann. 1999. Thesaurus of univariate discrete probability distributions. Essen: Stamm
Reinhard Köhler and Arjuna Tuzzi
Linguistic Modelling of Sequential Phenomena The role of laws
1 Introduction A number of textual aspects can be represented in form of linear sequences of linguistic units and/or their properties. Some well-known examples of simple models of such phenomena are the (dynamic variant) of the type-token relation (TTR) representing the gradual development of the 'lexical richness' of a text, i.e. how the amount of types increases as the number of tokens increases, and other properties of a text, the linguistic motifs, which can be applied with any degree of granularity, and time-series models such as ANOVA (cf. the contributions about motifs and time series in this volume). Before conclusions are drawn on the basis of the application of a mathematical model, some important issues should be taken into account. One of the most important ones is the validity of the model, i.e. the question whether a model does really represent what we think it does. The simplest possible quantitative model of a linguistic phenomenon is a single number which results from a measurement. Combining two or more numbers by arithmetic operations yields an index, e.g. a quotient. Another way to reflect more than just one property in order to represent a complex property is forming a vector, which consists of as many numbers as dimensions are relevant. The appropriate definition of the measure of a property is fundamental to all the rest of the modelling procedure. Every measure can be used to represent a whole text (such as the original TTR) or to scrutinize the dynamic behaviour of the properties under study from text position to text position, adding a temporal, or more generally, a sequential dimension to the model. The validity of the result depends, of course, on the validity of the basic measure(s) and on the validity of the mathematical function which is set up as a model of its dynamic development. A function which, e.g. increases boundlessly with text position cannot serve as a valid model of any text property because there are no infinitely long texts and because there is not any linguistic property which would yield infinite values if measured in an appropriate way. The same is true of a measure such as the entropy, which gives a reliable estimation of the 'true' value of a property only for an infinitely long
110 Reinhard Köhler and Arjuna Tuzzi sequence of symbols. In general, the domain of the function must agree with the range and the possible values of the linguistic property. The appropriateness of ANOVA models should also be checked carefully; it is not obvious that this kind of model correctly reflects linguistic reality. Sometimes, categorical data are represented by numbers e.g., when stressed and unstressed syllables are mapped to the numbers "1" and "0" without any explication why not the other way round or to "4" and "-14,000.22". And why it is correct to calculate with these numbers, which still represent the non-numerical categories "stressed' and "unstressed". Besides validity, every model should be checked also for some other important properties: reliability, interpretability, and simplicity (cf. Altmann 1978, 1988). Moreover, appropriateness of the scale level of the measure should be considered, i.e. whether the property to be modelled is measured on a nominal, an ordinal, or one of the metrical scales. The choice of the scale level decides on the mathematical operations which are applicable on the measured data and hence also which kind of mathematical function would agree with the numbers obtained as results of the measurement. Another aspect which must not be neglected when a combination of properties is used to form an index or a vector is the independence of the component variables. An example may illustrate this issue: The most commonly used index of reading ease (the 'Flesh formula') for English texts consists of a linear combination of sentence length (SL) and word length (WL) 𝐹𝐹𝐹𝐹 = (0.39 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) + (11.8 𝑊𝑊𝑊𝑊𝑆𝑆𝑆𝑆) − 15.59
This may look like a good idea until it turns out that word length is an indirect function of sentence length as follows from the well-established MenzerathAltmann Law (Altmann 1980, 1983b; Cramer 2005) and was empirically confirmed over and over again. Thus, this index combines sentence length with sentence length, which is certainly at least redundant. It is not too difficult to set up a measure of a linguistic or textual property, however, many simple measures and also more complex indexes suffer from problems of various kinds. Some of these problems are well-known in the scientific community such as the dependence of the type-token relation on text length. Therefore, attempts are made (Wimmer & Altmann 1999, Köhler 1993, Kubat & Milička 2013, Covington & McFall 2010) to find a method to circumvent this problem because this measure is easy to apply and hoped to be useful to differentiate text sorts or to serve as a technique for authorship attribution and for other stylistic tasks. Another good reason not to drop this measure is the fact that not many measures of dynamic, sequential text phenomena are known.
Linguistic modelling of sequential phenomena 111
2 The role of laws for text analysis What frequently remains problematic even if problems such as the above-mentioned one could be solved is the lack of knowledge of the statistical properties of a measure. An objective evaluation of a measurement, of the difference between two numbers, of the agreement between a measurement and previous expectations or of the parameters of a model, however, cannot be obtained without the knowledge of the theoretical probability distribution of the measure. Otherwise, the researcher cannot be sure whether e.g., a number is significant or could be the result of a random effect. Valid and reliable, theoretically justified models are indispensable also when linguistic objects are to be characterised without any significance requirements. Researchers which are interested in quantitative text analysis but without much linguistic background tend to use methods and measures they are familiar with, e.g. from information theory, mechanics, statistics or quantum physics. These methods may sometimes work for practical purposes even if one cannot be sure whether they really represent the properties the researcher aims at. However, very often, they are inappropriate when applied to language and make unrealistic assumptions: infinite length of texts, infinite size of lexicons, normal distribution of the data etc. Moreover, they are rarely interpretable in linguistic terms. For a linguistically meaningful and valid analysis of linguistic objects, linguistic models are required. The most reliable linguistic models are, of course, laws of language and text. Laws are the principal components of a theory – without laws, there is no theory. Furthermore, laws are the only statements which can explain and predict facts, and new phenomena and interrelations can be discovered by logical deduction of consequences from laws, whereas rules and classifications are descriptive statements, which do not entail anything. All this is well-known from the philosophy of science (cf. e.g., Bunge 1967, 2005). Nevertheless, when practitioners in computational or corpus linguistics learn about the existence of laws not only in the natural sciences but also in language and text often ask whether there is any benefit or application affecting their work. We will illustrate the usefulness of linguistic laws on a practical example.
3 Hypothesis and method The application presented here is inspired by previous work (Trevisani & Tuzzi 2012, 2013). The corpus examined in previous studies included 63 end-of-year messages delivered by all Presidents of the Italian Republic over the period from
112 Reinhard Köhler and Arjuna Tuzzi 1949 to 2011. Main aims of these studies were identifying a specific temporal pattern, i.e. a curve, for each word and clustering curves portraying similar temporal patterns. The authors proposed a flexible wavelet-based model for curve clustering in the frame of functional data analysis approaches. However, clear specific patterns could not be observed as each word possesses its own irregular series of frequency values (cf. Fig. 1). In their conclusions the authors highlighted: some critical points in representing the temporal pattern of words as functional objects, the weaknesses of an explorative approach and the lack of a linguistic theory to justify and interpret such a complex and extremely sophisticated model.
Fig. 1: Temporal patterns of six words (taken from Trevisani &Tuzzi 2013)
In the present paper, we will emphasize the linguistic point of view: The material under study is not just a corpus or a set of texts but a sequence of texts representing an Italian political-institutional discourse. The temporal trajectory of the frequency of a word can thus be considered as an indicator of the communicative relevance of a concept at a given point in time. We will not abandon the original idea of characteristic patterns but instead of expecting a general time-depending behaviour we will set up the specific hypothesis that the temporal behaviour of the frequency of a word is discourse-specific. A ready-made model of this kind of phenomenon is not available but there is an established law of a related phenomenon: the logistic function, in linguistics called the Piotrowski (or Piotrowski-Altmann) law. It does not make any statements about
Linguistic modelling of sequential phenomena 113
frequency but about the dispersal of units or properties over a community along the time axis. We will argue in the following way: degree of dispersion and frequency are, with respect to communication systems, two sides of a coin. The more frequently a unit is used the higher the probability to make the unit more familiar to other members of the community – and vice versa: the greater the degree of dispersion the higher the probability of occurrence. We will therefore adopt this law for our purposes and assume that it is a good model of the dynamics of the frequencies of words within a discourse. The basic form of the Piotrowski-Altmann law (Altmann 1983a) is given as 𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡 =
𝐶𝐶𝐶𝐶 1 + 𝑎𝑎𝑎𝑎𝑒𝑒𝑒𝑒 −𝑏𝑏𝑏𝑏𝑡𝑡𝑡𝑡
(1)
Altmann (1983a) also proposed a modified version for the case of developments where the parameter b is a function of time (2): 𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡 =
1 2 1 + 𝑎𝑎𝑎𝑎𝑒𝑒𝑒𝑒 −𝑏𝑏𝑏𝑏𝑡𝑡𝑡𝑡+𝑐𝑐𝑐𝑐𝑡𝑡𝑡𝑡
𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡 =
𝐶𝐶𝐶𝐶 2 1 + 𝑎𝑎𝑎𝑎𝑒𝑒𝑒𝑒 −𝑏𝑏𝑏𝑏𝑡𝑡𝑡𝑡+𝑐𝑐𝑐𝑐𝑡𝑡𝑡𝑡
(2)
We adopted the law in form of function (3), which has an additional parameter because we cannot expect that a word approaches probability 1: (3)
Here, the dependent variable pt represents the relative probability of a word at time t. On the basis of a plausible, linguistically motivated model, we can further assume that the data, which seem, at a first look, to disagree with any assumption of regularity, display a statistical spread around the actual trend. Therefore, the data, i.e. the relative frequencies are smoothed in order to enables us to detect the trends and to use them for testing our hypothesis. As smoothing technique, moving averages at interval size 7 was applied: The interval size was chosen in a way which yielded a sufficiently smooth sequence of values and, on the other hand, keeps as many individual values as possible. The smoothing method works as follows. The first window starts with the first pair of (y, x) – or, in our case of (pt, t) – values and ends with the 7th. The next window starts with the second data set, i.e. (pt+1, t+1) and ends with the pair (pt+21, t+7) etc. until we reach the window from (pt+n-7, t+n). For each window, the mean of the 7 pti are calculated. The means of all the windows form the new data.
114 Reinhard Köhler and Arjuna Tuzzi
4 Results Not surprisingly, many words do not display a specific trajectory over time. Word usage depends to a great deal on grammatical and stylistic circumstances – in particular the function words – which should not change along the political or social development in a country. There are, however, function words which or whose form change over time. One example was described by Best and Kohlhase (1983): The German word-form ward "became" was replaced by wurde within the time period from 1445 to 1925 following the function (1). In our data, a similar process can be observed: The frequency of the adverb ci (at the cost of vi) is shown in Fig. 2.
Fig. 2: The increase of the frequency of "ci". Smoothing averages with window size 20.
Also content words often show an irregular behaviour with respect to their frequency because their usage depends on unpredictable thematic circumstances. On the other hand, we assume that there are also many words which reflect to some extent the relevance of a concept within the dynamics of the discourse. We selected some words to illustrate this fact. Figures 3 and 4 and the corresponding data in Table 1 show examples of words which follow the typical logistic growth function. The part-of-speech tags which are attached to the words in the captions have the following meanings: "N" stands for noun and "NM" for proper name.
Linguistic modelling of sequential phenomena 115
Fig. 3 & 4: The increase of the frequency of "Europa" and "storia". Smoothing averages with window size 7.
116 Reinhard Köhler and Arjuna Tuzzi
Fig. 5-7: Temporal behaviour of the frequencies of selected content words in the sequence of the presidential speeches.
Linguistic modelling of sequential phenomena 117
Fig. 8: Temporal behaviour of the frequencies of selected content words in the sequence of the presidential speeches [continued]
Figures 5-8 show graphs of words where only the rapidly increasing branch or the decreasing branch of a reversible development of the curve occurs while Figures 9 and 10 are cases of fully reversible developments.
Fig. 9: Temporal behaviour of the frequencies of selected content words in the sequence of the presidential speeches – fully reversible developments
118 Reinhard Köhler and Arjuna Tuzzi
Fig. 10: Temporal behaviour of the frequencies of selected content words in the sequence of the presidential speeches – fully reversible developments
Europa pt
0.4296455 0.4296455 0.4296455 0.4296455 0.4296455 0.4296455 0.4296455 0.4296455 0 0.1356668 0.1356668 0.6767923 0.6767923 0.8571675 0.9312251 0.9312251 0.9312251 0.9098441 0.3687185 0.3687185
t
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.9 2
0 0 0.1612383 0.1612383 0.1612383 0.2753415 0.2753415 0.2753415 0.2753415 0.2753415 0.2332501 0.2332501 0.3549342 0.240831 0.3148886 0.3148886 0.3148886 0.3148886 0.3033149 0.1816308
storia pt 0.2148228 0.4676673 0.7901439 1.200064 1.200064 1.200064 1.200064 1.3352174 1.1203946 0.8675501 0.5450735 0.1351534 0.1351534 0.225341 0.4475138 0.3123604 0.3123604 0.4266462 0.4266462 0.4266462
valore pt 0.2148228 0.341245 0.341245 0.341245 0.341245 0.341245 0.5348184 0.5348184 0.3199956 0.1935734 0.1935734 0.1935734 0.1935734 0.283761 0.1642452 0.1642452 0.1642452 0.1642452 0.1642452 0.2688258
futuro pt 2.4899569 1.8800021 2.5249553 2.3137574 2.4914404 2.947853 2.9791225 2.8812408 2.6664181 2.6756626 2.2690032 2.2649273 2.4522966 2.2664468 1.8338419 2.1087914 2.6675546 2.5318878 1.8877499 1.6272782
progresso pt
terrorismo pt
1.9641072 0 1.9862636 0 2.147502 0 2.9673421 0 3.2353556 0 4.0340776 0 4.4212243 0 4.4212243 0 4.2987985 0 3.540265 0 3.6173205 0 2.7974803 0 2.2181187 0 2.231085 0 2.658572 0 3.2038283 0 3.4553719 0 3.6839433 0 3.9467174 0 4.4012067 0.1045806
problema pt 0.4296455 0.4296455 0.4296455 0.6346056 0.6346056 0.6346056 0.6346056 0.6346056 0.20496 0.20496 0.20496 0 0 0.450938 0.7471684 0.7471684 0.7471684 0.9757398 1.2616615 1.4708228
violenza pt
Table 1: The (smoothed) sequences of relative frequency values of some words and the results of fitting function (3) to the data. The numbers which represent the years have been transformed ((year-1948)/10) in order to keep the numerical values small enough to be calculated as arguments of the exponential function.
Linguistic modelling of sequential phenomena 119
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6
t
0.3664694 0.4837811 0.6699546 0.6699546 0.6063634 0.663828 0.8162903 1.2427295 1.1251618 0.9337925 0.815939 0.8758372 1.9458839 2.4434529 2.2909907 2.5860522 2.6712168 2.7108443 2.8790428 3.0246945 2.1604293 1.8557085 2.5187896 2.4170595 2.6806693 3.1786039
Europa pt 0.2706938 0.1966362 0.258694 0.258694 0.3600831 0.4175477 0.3099746 0.3099746 0.2814699 0.2814699 0.287732 0.3476302 0.5451055 0.7011154 0.7011154 0.8041869 0.8407772 0.8804047 0.9134498 1.1276181 0.8287537 0.7225562 0.8699075 0.9217787 1.0975185 1.3266721
storia pt 0.5145847 0.2924118 0.2924118 0.2924118 0.1781261 0.1781261 0.1781261 0.1781261 0.0605583 0.1664569 0.2347769 0.3545733 0.9523019 0.9523019 0.9523019 1.0553734 1.1405381 1.1535221 1.1189904 1.2047439 0.6640099 0.6640099 0.885037 1.0918508 0.9461278 0.8272452
valore pt 0.1786382 0.1045806 0.1045806 0.1045806 0.1045806 0.1045806 0.1045806 0 0.0605583 0.0605583 0.1288784 0.1288784 0.2035944 0.2035944 0.2035944 0.2035944 0.1916104 0.1916104 0.1232904 0.2603237 0.1856076 0.1856076 0.3329589 0.3329589 0.3526067 0.5541925
futuro pt 1.6239046 1.4757893 0.9305331 0.4768118 0.5275063 0.5275063 0.5275063 0.4229257 0.3979699 0.5038685 0.6405085 0.715161 1.0380469 1.0807418 1.0807418 1.0807418 0.8385084 0.7722373 0.6693857 0.4896911 0.1161108 0.0734159 0.2944429 0.4493856 0.654052 0.6144245
progresso pt 4.2129598 3.4940106 2.9487543 2.4950331 2.3171562 2.8283836 2.5056641 1.5644385 1.0621138 2.0254146 2.3670147 2.9208526 2.8701581 2.4367516 2.4367516 2.5398231 2.4187064 1.7956234 1.4878117 1.1029983 1.3594743 1.1390639 1.4337666 1.330695 1.3989172 1.0302099
problema pt 0.1936437 0.5763822 1.3210762 2.0563704 2.715399 3.8072276 4.0740365 3.9694559 4.1226262 3.7398877 3.0635137 2.3282196 1.6691909 0.5773624 0.3105534 0.3620892 0.1684301 0.1684301 0.1001101 0.1001101 0.1001101 0.1001101 0.1001101 0.1260457 0.1456935 0.280084
terrorismo pt 1.1980109 1.3802036 1.5043192 1.6093613 1.5328734 1.3545249 1.2469517 1.0377905 1.2230144 0.7445913 0.6204756 0.5753318 0.4979643 0.5406592 0.5406592 0.6952665 0.3804908 0.4201183 0.4201183 0.56577 0.5195512 0.4768563 0.4768563 0.322249 0.3418968 0.4366599
violenza pt
120 Reinhard Köhler and Arjuna Tuzzi
4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 a b c C R2
t
storia pt
3.3993747 1.225307 3.6684333 1.1885448 3.5319045 1.1885448 3.7355181 1.1461145 3.3911367 1.0784381 3.0216998 1.00694 2.7369698 0.7340515 2.5425393 0.522459 2.3872737 0.5828637 2.07488 0.3996311 4.069 3.0989 -0.9838 -0.8346 -0.4588 -0.3999 2.9841 0.96890 0.8729 0.7581
Europa pt 1.0678304 0.8622806 1.2850748 1.4796147 1.3382626 1.4456002 1.7571073 1.9286731 1.7147043 1.8228475 1.0116 -1.1818 -0.2691 2.0321 0.6155
valore pt
progresso pt
problema pt
0.7371082 0.8550096 1.2707951 0.6791764 0.934111 1.1549315 0.7991236 1.0540582 1.0184027 0.9288169 1.1189049 1.1123369 0.7814655 0.8978778 1.056659 1.0317993 0.7429352 1.2235482 1.2127828 0.6005702 1.4045317 1.5258945 0.6577588 1.5660909 1.5845977 0.5041947 1.4125268 1.6136394 0.4250933 1.4165387 8.4092 57069.8938 43788.8045 -1.2552 -0.2303 0.1958 -0.2775 0.0338 0.0802 4.4388 163948.013 128716.588 0.9363 0.7419 0.6253
futuro pt
violenza pt
0.4629997 0.4366599 0.6212025 0.23111 0.6212025 0.2026127 0.8157424 0.2674593 0.8954173 0.2674593 0.9848352 0.3509039 0.9789145 0.3449832 0.8445239 0.2677813 0.722013 0.2677813 0.5638102 0.3759244 62452861.8 2222666.92 11.3231 1.6940 2.0271 0.3525 31.1565 337354.494 0.7908 0.5609
terrorismo pt
Linguistic modelling of sequential phenomena 121
122 Reinhard Köhler and Arjuna Tuzzi
5 Conclusions The presented study is an illustration of the fact that data alone do not give an answer to a research question. Only a theoretically grounded hypothesis, tested on appropriate data, produces new knowledge. We assume that the individual kinds of dynamics in fact reflect the relevance of the corresponding concepts in the political discourse but we are not going to propose political interpretations of the findings. In a follow-up study, the complete vocabulary of the presidential discourse will be analysed, and on this basis, it will be possible to find out whether conceptually related words follow similar temporal patterns.
Acknowledgments The authors would like to thank IQLA for providing data for this study.
References Altmann, Gabriel. 1978. Zur Verwendung der Quotiente in der Textanalyse. Glottometrika 1. 91106. Altmann, Gabriel. 1980. Prolegomena to Menzerath’s law. Glottometrika 2(2). 1-10. Altmann, Gabriel. 1983a. Das Piotrowski-Gesetz und seine Verallgemeinerungen. In Karl-Heinz Best & Jörg Kohlhase (eds.), Exakte Sprachwandelforschung, 59-90. Göttingen: edition herodot. Altmann, Gabriel. 1983b. H. Arens’ „Verborgene Ordnung“ und das Menzerathsche Gesetz. In Manfred Faust, Roland Harweg, Werner Lehfeldt & Wienold Götz (eds.), Allgemeine Sprachwissenschaft, Sprachtypologie und Textlinguistik, 31-39. Tübingen: Gustav Narr. Altmann, Gabriel. 1988. Linguistische Meßverfahren. In Ulrich Ammon, Norbert Dittmar & Klaus J. Mattheier (eds.), Sociolinguistics. Soziolinguistik, 1026-1039. Berlin, New York: Walter de Gruyter. Bunge, Mario. 1967. Scientific Research I, II. Berlin, Heidelberg, New York: Springer. Bunge, Mario. 1998. Philosophy of science. From problem to theory. New Brunswick, London: Transaction Publishers. Bunge, Mario. 2007 [1998]. Philosophy of science. From explanation to justification, 4th edn. New Brunswick, London: Transaction Publishers. Covington, Michael A. & Joe D. McFall (2010). Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR). Journal of Quantitative Linguistics 17(2). 94-100.
Linguistic modelling of sequential phenomena 123 Cramer, Irene. 2005. Das Menzerathsche Gesetz. In Reinhard Köhler, Gabriel Altmann & Rajmond G. Piotrowski (eds.), Quantitative Linguistik. Ein internationales Handbuch. Quantitative Linguistics. An International Handbook, 659-688. Berlin, New York: Walter de Gruyter. Köhler, Reinhard & Matthias Galle. 1993. Dynamic aspects of text characteristics. In Ludĕk Hřebíček & Gabriel Altmann (eds.), Quantitative text analysis (Quantitative Linguistics 52), 46-53. Trier: Wissenschaftlicher Verlag. Kubát, Miroslav & Jiří Milička. 2013. Vocabulary Richness Measure in Genres. Journal of Quantitative Linguistics 20(4). 339-349. Trevisani, Matilda & Arjuna Tuzzi. 2012. Chronological analysis of textual data and curve clustering: preliminary results based on wavelets. In Società Italiana di Statistica, Proceedings of the XLVI Scientific Meeting. Padova: CLEUP. Trevisani, Matilda & Arjuna Tuzzi. 2013. Shaping the history of words. In Ivan Obradović, Emmerich Kelih & Reinhard Köhler (eds.), Methods and Applications of Quantitative Linguistics: Selected papers of the VIIIth International Conference on Quantitative Linguistics (QUALICO), Belgrade, Serbia, April 16-19, 2012, 84-95. Belgrade, Serbia: Akademska Misao. Wimmer, Gejza & Gabriel, Altmann. 1999. On Vocabulary Richness. Journal of Quantitative Linguistics 6(1). 1-9.
Ján Mačutek and George K. Mikros
Menzerath-Altmann Law for Word Length Motifs 1 Introduction Motifs are relatively new linguistic units which make possible an in-depth investigation of sequential properties of texts (for the general definition cf. Köhler, this volume, pp. 89-90). They were studied in a handful of papers (Köhler 2006, 2008a,b, this volume, pp. 89-108; Köhler and Naumann 2008, 2009, 2010, Mačutek 2009, Sanada 2010, Milička, this volume, pp. 133-145). Specifically, a word length motif is a continuous series of equal or increasing word lengths (measured here in the number of syllables, although there are also other options, like, e.g., morphemes). In the papers cited above it is supposed that motifs should have properties similar to those of their basic units, i.e., words in our case. Indeed, word frequency and motif frequency, as well as word length (measured in the number of syllables) and motif length (measured in the number of words) can be modelled by the same distributions (power laws, like, e.g., the Zipf-Mandelbrot distribution, and Poisson-like distributions, respectively; cf. Wimmer and Altmann 1999). Also the type-token relations for words and motifs display similar behaviour, differing only in parameters values, but not in models. We enlarge the list of analogous properties of motifs and their basic units, demonstrating (cf. Section 3.1) that for word length motifs also the MenzerathAltmann law (cf. Cramer 2005; MA law henceforth) is valid. The MA law describes the relation between sizes of the construct, e.g., a word, and its constituents, e.g., syllables. It states that the larger the construct (the whole), the smaller its constituents (parts). In particular, for our data it holds the longer is the motif (in the number of words), the shorter the mean length of words (in the number of syllables) which constitute the motif. In addition, in Section 3.2 we show that for randomly generated texts the MA law is valid as well, but its parameters differ from those obtained from real texts.
126 Ján Mačutek and George K. Mikros
2 Data In order to study the MA law for word length motifs we compiled a Modern Greek literature corpus totaling 236,233 words. The corpus contains complete versions of literary texts from the same time period and has been strictly controlled for editorial normalization. It contains five novels from four widely known Modern Greek writers published from the same publishing house (Kastaniotis Publishing House). All the novels were best-sellers in the Greek market and belong to the “classics” of the Modern Greek literature. More specifically, the corpus consists of: – The mother of the dog, 1990, by Matesis (47,852 words). – Murders, 1991, by Michailidis (72,475 words). – From the other side of the time, 1988, by Milliex [1] (77,692 words). – Dreams, 1991, by Milliex [2] (9,761 words) - Test novel. – The dead liqueur, 1992, by Xanthoulis (28,453 words). The basic descriptive statistics of the corpus appear in table 1: Table 1: Basic descriptive statistics of the data. Authors number of words number of motifs different motifs mean word length in syllables mean motif length in words mean motif length in syllables
Matesis Michailidis Milliex [1] Milliex [2] Xanthoulis 47,852 72,475 77,692 9,761 28,453 19,283 29,144 32,034 4,022 11,236 316 381 402 192 289 2.09 2.03 2.10 2.13 2.07 2.48 2.49 2.43 2.43 2.53 5.20 5.05 5.09 5.17 5.23
3 Results 3.1 MA law for word length motifs in Modern Greek texts The results obtained confirmed our expectation that the MA law should be valid also for word length motifs. The tendency of mean word length (measured in the number of syllables) to decrease with the increasing motif length (measured in the number of words) is obvious in all five texts investigated, cf. Table 2. We modelled the relation by the function
Menzerath-Altmann law for word length motifs 127
(1)
𝑦𝑦𝑦𝑦(𝑥𝑥𝑥𝑥) = 𝛼𝛼𝛼𝛼𝑥𝑥𝑥𝑥 𝑏𝑏𝑏𝑏
where 𝑦𝑦𝑦𝑦(𝑥𝑥𝑥𝑥) is the mean length of words which occur in motifs consisting of 𝑥𝑥𝑥𝑥 words; 𝛼𝛼𝛼𝛼 and 𝑏𝑏𝑏𝑏 are parameters. Given that 𝑦𝑦𝑦𝑦(1) = 𝛼𝛼𝛼𝛼,
we replaced 𝛼𝛼𝛼𝛼 with the mean length of words from motifs of length 1, i.e., motifs consisting of one word only (cf. Kelih 2010, Mačutek and Rovenchak 2011). In order to avoid too strong fluctuations, only motif lengths which appeared in particular texts at least 10 times were taken into account (cf. Kelih 2010). The appropriateness of the fit was assessed in terms of the determination coefficient R2 (values higher than 0.9 are usually considered satisfying, cf., e.g., Mačutek and Wimmer 2013). The numerical results (values of R2 and parameter values for which R2 reach its maximum) are presented in Table 2. Table 2: Fitting function (1) to the data. ML- motif length, MWL0- observed mean word length, MWLt theoretical mean word length resulting from (1).
ML 1 2 3 4 5 6 7 8 9
Matesis
Michailidis
Milliex 1
Milliex 2
Xanthoulis
MWL0 MWLt MWL0 MWLt MWL0 MWLt MWL0 MWLt MWL0 MWLt 2.30 2.30 2.32 2.32 2.34 2.34 2.39 2.39 2.42 2.42 2.13 2.15 2.08 2.13 2.13 2.18 2.16 2.22 2.13 2.19 2.08 2.07 2.00 2.03 2.08 2.09 2.11 2.12 2.02 2.06 2.04 2.01 1.97 1.96 2.04 2.02 2.09 2.05 2.01 1.97 1.96 1.97 1.90 1.90 1.99 1.98 2.01 2.01 1.96 1.91 1.94 1.94 1.88 1.86 1.94 1.94 1.94 1.97 1.92 1.86 1.98 1.91 1.87 1.83 1.94 1.91 1.96 1.88 1.74 1.82 1.81 1.88 1.69 1.80 1.84 1.88 1.84 1.77 b=−0.096 R2=0.9195
b=−0.123 R2=0.9126
b=−0.105 R2=0.9675
b=−0.109 R2=0.9582
b=−0.147 R2=0.9312
3.2 MA law for word length motifs in random texts The results presented in the previous section show that longer motifs contain shorter words and vice versa. The relation between the lengths (i.e., the MA law) can be modelled by a simple power function. One cannot, however, apriori exclude the possibility that the observed regularities are necessary in the sense that they could be only a consequence of some other laws. In this particular case, it seems reasonable to ask whether MA law remains valid if the distribution of word
128 Ján Mačutek and George K. Mikros length is kept, but the sequential structure of word length is deliberately forgotten. Randomization (i.e., random generating of texts – or, only some properties of texts – by means of computer programs) is a useful tool for finding answers to questions of this type. It is slowly finding its way to linguistic research (cf., e.g., Benešová and Čech, this volume, pp. 57-69, and Milička, this volume, pp. 133-145 for other analyses of the MA law; Liu and Hu 2008 applied randomization to refute claims that small-world and scale-free complex language networks automatically give rise to syntax). In order to get rid of the sequential structure of word lengths, while at the same time preserving the word length distribution, we generated random numbers from the distribution of word length in each of the five texts under our investigation. The number of generated random word lengths is always equal to the text length of the respective real text (e.g., we generated 47,852 random word lengths for the text by Matesis, as it contains 47,852 words, cf. Table 1). Then, we fitted function (1) to the randomly generated data. The outcomes of the fitting can be found in Table 3. The generated data were truncated at the same points as their counterparts from real texts, cf. Section 3.1. Table 3: Fitting function (1) to the data. ML- motif length, MWLr - mean word length from randomly generated data, MWLf - fitted values resulting from (1).
ML 1 2 3 4 5 6 7 8 9
Matesis
Michailidis
Milliex 1
Milliex 2
Xanthoulis
MWLr MWLf MWLr MWLf MWLr MWLf MWLr MWLf MWLr MWLf 2.41 2.41 2.36 2.36 2.44 2.44 2.42 2.42 2.42 2.42 2.27 2.19 2.20 2.13 2.28 2.20 2.29 2.20 2.26 2.18 2.11 2.07 2.06 2.01 2.15 2.07 2.15 2.08 2.12 2.05 2.02 1.99 1.96 1.93 2.02 1.99 2.07 2.00 1.98 1.97 1.94 1.93 1.87 1.87 1.94 1.92 1.97 1.94 1.89 1.90 1.86 1.88 1.84 1.82 1.83 1.87 1.86 1.89 1.82 1.85 1.81 1.84 1.77 1.78 1.82 1.83 1.72 1.85 1.76 1.81 1.78 1.81 1.69 1.74 1.73 1.79 1.66 1.71 b=−0.138 R2=0.9685
b=−0.146 R2=0.9682
b=−0.148 R2=0.9553
b=−0.137 R2=0.8951
b=−0.150 R2=0.9595
It can be seen that the MA law holds also in this case; however, parameters 𝑏𝑏𝑏𝑏 in random texts are different from the ones from real texts. The parameters in the random texts have always larger absolute values, which means that the respective curves are steeper, i.e., they decrease more quickly.
Menzerath-Altmann law for word length motifs 129
As an example, in Fig. 1, we present data from the first text (by Matesis, cf. Section 2) together with the mathematical model (1) fitted to the data.
Fig. 1: Data from the text by Matesis (circles) and from the respective random text (diamonds), together with fitted models (dashed line – the model for the real text, solid line – the model for the random text)
The other texts behave similarly. The validity of the law in random texts (created under condition that their word length distribution in the same as in the real ones) can be deductively explained directly from the definition of word length motifs (cf. Section 1). A word length motif is ended at the place where the sequence of word lengths decreases. Hence, if a long word appears in a motif, it is likely to be the last element of the motif (the probability that the next word would be of the same length – or even longer – is small, because long words occur relatively seldom). Consequently, if a long word appears soon, the motif tends to be short in terms of words, but the mean word length in such a motif will be high (because the length of one long word will have a higher weight in a short motif than in a long one). Differences in values of parameters 𝑏𝑏𝑏𝑏 in real and randomized texts indicate that, regardless of the obvious impact of the word length distribution, also the
130 Ján Mačutek and George K. Mikros sequential structure of word lengths plays an important role in the MA law for word length motifs.
4 Conclusions The paper brings another confirmation that word length motifs behave in the same way as other, more “traditional” linguistic units. In addition to the motif frequency distribution and the distribution of motif length (which were shown to follow the same patterns as words), also the MA law is valid for word length motifs, specifically, the more words a motif contains, the shorter mean syllabic length of words in the motif. The MA law can be observed also in random texts, if word lengths distributions in a real text and in its random counterpart are the same. The validity of the law in random texts can be explained deductively from word length distribution. However, parameters in the exponents of the power function which is a mathematical model of the MA law are different for real and random texts. The power functions corresponding to random texts are steeper. The difference in parameter values proves that not only word length distribution, but also the sequential structure of word lengths has an impact on word length motifs. It remains an open question whether parameters of the MA law can be used as characteristics of languages, genres or authors. In case of the positive answer, they could possibly be applied to language classification, authorship attribution and similar fields.
Acknowledgments J. Mačutek was supported by VEGA grant 2/0038/12.
References Benešová, Martina & Radek Čech. 2015. Menzerath-Altmann law versus random models. This volume, pp. 57-69. Cramer, Irene M. 2005. Das Menzerathsche Gesetz. In Reinhard Köhler, Gabriel Altmann & Rajmund G. Piotrowski (eds.), Quantitative Linguistics. An international handbook, 659– 688. Berlin & New York: de Gruyter.
Menzerath-Altmann law for word length motifs 131 Liu, Haitao & Fengguo Hu. 2008. What role does syntax play in a language network? EPL 83. 18002. Kelih, Emmerich. 2010. Parameter interpretation of the Menzerath law: Evidence from Serbian. In Peter Grzybek, Emmerich Kelih & Ján Mačutek (eds.), Text and language. Structures, functions, interrelations, quantitative perspectives, 71–79. Wien: Praesens. Köhler, Reinhard. 2006. The frequency distribution of the lengths of length sequences. In Jozef Genzor & Martina Bucková (eds.), Favete linguis. Studies in honour of Viktor Krupa, 145– 152. Bratislava: Slovak Academic Press. Köhler, Reinhard. 2008a. Word length in text. A study in the syntagmatic dimension. In Sybila Mislovičová (ed.), Jazyk a jazykoveda v pohybe, 416–421. Bratislava: Veda. Köhler, Reinhard. 2008b. Sequences of linguistic quantities. Report on a new unit of investigation. Glottotheory 1(1). 115–119. Köhler, Reinhard. 2015. Linguistic motifs. This volume, pp. 89-108. Köhler, Reinhard & Sven Naumann. 2008. Quantitative text analysis using L-, F- and T-segments. In Christine Preisach, Hans Burkhardt, Lars Schmidt-Thieme & Reinhold Decker (eds.), Data analysis, machine learning and applications, 635–646. Berlin & Heidelberg: Springer. Köhler, Reinhard & Sven Naumann. 2009. A contribution to quantitative studies on the sentence level. In Reinhard Köhler (ed.), Issues in quantitative linguistics, 34–57. Lüdenscheid: RAM-Verlag. Köhler, Reinhard & Sven Naumann. 2010. A syntagmatic approach to automatic text classification. Statistical properties of F- and L-motifs as text characteristics. In Peter Grzybek, Emmerich Kelih & Ján Mačutek (eds.), Text and language. Structures, functions, interrelations, quantitative perspectives, 81–89. Wien: Praesens. Mačutek, Ján. 2009. Motif richness. In Reinhard Köhler (ed.), Issues in quantitative linguistics, 51–60. Lüdenscheid: RAM-Verlag. Mačutek, Ján & Andrij Rovenchak. 2011. Canonical word forms: Menzerath-Altmann law, phonemic length and syllabic length. In Emmerich Kelih, Victor Levickij & Yuliya Matskulyak (eds.), Issues in quantitative linguistics 2, 136–147. Lüdenscheid: RAM-Verlag. Mačutek, Ján & Gejza Wimmer. 2013. Evaluating goodness-of-fit of discrete distribution models in quantitatitive linguistics. Journal of Quantitative Linguistics 20(3). 227–240. Milička, Jiří. 2015. Is the distribution of L-motifs inherited from the word lengths distribution? This volume, pp. 133-145. Sanada, Haruko. 2010. Distribution of motifs in Japanese texts. In Peter Grzybek, Emmerich Kelih & Ján Mačutek (eds.), Text and Language. Structures, functions, interrelations, quantitative perspectives, 183–194. Wien: Praesens. Wimmer, Gejza & Gabriel Altmann. 1999. Thesaurus of univariate discrete probability distributions. Essen: Stamm
Jiří Milička
Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 1 Introduction An increasing number of papers1 shows that word length sequences can be successfully analyzed by means of L-motifs, which are a very promising attempt to discover the syntagmatic relations of the word lengths in a text. The L- motif2 has been defined by Reinhard Köhler (2006a) as: (...) the text segment which, beginning with the first word of the given text, consists of word lengths which are greater or equal to the left neighbour. As soon as a word is encountered which is shorter than the previous one the end of the current L-Segment is reached. Thus, the fragment (1) will be segmented as shown by the L-segment sequence (2): Azon a tájon, ahol most Budapest fekszik, már nagyon régen laknak emberek. 2,122,13,2,12223
The main advantage of such segmentation is that it can be applied iteratively, i.e. L-motifs of the L-motifs can be obtained (so called LL-motifs). Applying the method several times results in not very intuitive sequences, which, however, follow lawful patterns3 and they are even practically useful, e.g. for automatic text classification (Köhler – Naumann 2010).
2 Hypothesis However it needs to be admitted that the fact that the L-motifs follow lawful patterns does not imply that the L-motifs reflect a syntagmatic relation of the word lengths, since these properties could be merely inherited from the word 1 See Köhler – Naumann (2010), Mačutek (2009), Sanada (2010). 2 Former term was L-segments, see Köhler (2006a). 3 E.g. the rank-frequency relation of the L-motifs distribution can be successfully described by the Zipf-Mandelbrot distribution, which is a well-established law for the word types rank-frequency relation.
134 Jiří Milička length distribution in the text, which has not been tested yet. The paper focuses on the most important property of L-motifs – the frequency distribution of their types and tests the following hypothesis: The distribution of L-motifs measured on the text T differs from the distribution of L-motifs measured on a pseudotext T’. The pseudotext T’ is created by the random transposition of all tokens of the text T within the text T.4
3 Data The hypothesis was tested on three Czech and six Arabic texts: Table 1: The list of texts Tag
Author
Title
[Zer]
Milan Kundera
Žert
Cent. 20
[Kat]
Kohout
Katyně
[Bab]
Božena Němcová
Babička
[Ham]
Lang.
Tokens
Czech
88435
20
Czech
99808
19
Czech
70140
al-Ḥāzimī al-Hamadānī Al-ʾIʿtibār fi ʼn-nāsiḫ wa-ʼl-mansūḫ
15
Arabic
71482
[Sal]
ibn aṣ-Ṣallāḥ
Maʿrifatu ʾanwāʿi ʿulūmi ʼl-ḥadīṯ
13
Arabic
54915
[Zam]
ibn abī Zamanīn
Uṣūlu ʼs-sunna
11
Arabic
18607
[Maw] al-Mawwāq
Tāǧ wa-l-ʿiklīl 2
15
Arabic
274840
[Baj]
al-Bāǧī al-ʿAndalūsī
Al-Muntaqī 2
11
Arabic
301232
[Bah]
Manṣūr al-Bahūtī
Šarḥ muntahīyu ʼl-irādāt 2
17
Arabic
263175
The graphical word segmentation was respected when determining the number of syllables in the Arabic texts. In the Czech texts zero syllabic words (e.g. s, z, v, k) were merged with the following words according to the conclusion in Antić et al. (2006), to maintain the compatibility with other studies in this field (e.g. Köhler 2006b).
4 The null hypothesis is: “The distribution of L-motifs measured on the text T is the same as the distribution of L-motifs measured on a pseudotext T’. The pseudotext T’ is created by random transposition of all tokens of the text T within the text T.”
Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 135
4 Motivation One of those texts [Kat] was randomized for one million5 times and the rank– frequency relation (RFR) of L-motifs was measured for every randomised pseudotext. Then these RFRs were averaged. This average RFR can be seen on the following chart, accompanied by the RFR of L-structures measured on the real text:
Fig. 1: RFR of the L-motifs, [Kat].
Visually, the RFR of the L-motifs for the real text does not differ from the average pseudotext RFR of the L-motifs very much. This impression is supported by the Chi-squared discrepancy coefficient 𝐶𝐶𝐶𝐶 = 0.0008.6 Also the fact, that both the real text’s L-motifs RFR and the randomized text’s L-motifs RFR can be successfully fitted by the Right truncated Zipf-Alexeev distribution with similar parameters7 encourages us to assume that the RFR of L-motifs is given by the word length distribution in the text.
5 1 million has been arbitrarily chosen as a “sufficiently large number”, which makes it the weakest point of our argumentation. 2 6 𝐶𝐶𝐶𝐶 = Χ ⁄𝑁𝑁𝑁𝑁 , where N is the sample size (Mačutek – Wimmer 2013). 7 For the real data: 𝑎𝑎𝑎𝑎 = 0.228; 𝑏𝑏𝑏𝑏 = 0.1779; 𝑛𝑛𝑛𝑛 = 651; 𝛼𝛼𝛼𝛼 = 0.1089; 𝐶𝐶𝐶𝐶 = 0.0066. For the randomized pseudotexts: 𝑎𝑎𝑎𝑎 = 0.244; 𝑏𝑏𝑏𝑏 = 0.1761; 𝑛𝑛𝑛𝑛 = 651; 𝛼𝛼𝛼𝛼 = 0.1032; 𝐶𝐶𝐶𝐶 = 0.0047. Altmann Fitter was used.
136 Jiří Milička Very similar results can be obtained for LL-motifs, LLL-motifs8 etc. (the Zipf-Mandelbrot distribution fits the distribution of the higher orders L-motifs better than the right truncated Zipf-Alexeev distribution). But these results do not answer the question asked. The next section proceeds to the testing of the hypothesis.
5 Methods Not only L-motifs as a whole, but every single L-motif has the distribution of its frequencies within those one million randomized pseudotexts. For example the number of pseudotexts (randomized [Bab]), where the L-motif (1, 1, 2, 2, 2) occurred 72 times, is 111. From this distribution we can obtain confidence intervals (95%) as depicted on the following chart:
Fig. 2: Distribution of one of the L-motif types in one million pseudotexts (randomized [Bab]) vs. the frequency of the L-motif in the real text.
In this case, the frequency of the motif (1, 1, 2, 2, 2) measured on the real text [Bab] is 145, which is above the upper confidence interval limit (in this case 125). But the frequencies of many other L-motifs are within these intervals, such as the motif (1, 1, 1, 2, 2):
8 For the LL-motifs: 𝐶𝐶𝐶𝐶 = 0.0009; for the LLL-motifs: 𝐶𝐶𝐶𝐶 = 0.0011.
Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 137
Fig. 3: Distribution of one of the L-motif types in one million pseudotexts (randomized [Bab]) vs. the frequency of the L-motif in the real text.
The fact that the frequencies are not independent from each other does not allow us to test them separately as multiple hypotheses, and moves us to merge all values of the distribution into one number. The following method was chosen: 1. The text is many times randomized (in this case 1 million times) and for each pseudotext frequencies of L-motifs are measured. The average frequency of every L-motif is calculated. The average frequency of the motif (indexed by the variable 𝑖𝑖𝑖𝑖, 𝑁𝑁𝑁𝑁 is the maximal 𝑖𝑖𝑖𝑖) will be referred as 𝑚𝑚𝑚𝑚 𝑚 𝑖𝑖𝑖𝑖 . 2. The total distance (𝐷𝐷𝐷𝐷) between the frequencies of each motif (𝑚𝑚𝑚𝑚𝑖𝑖𝑖𝑖 ) in the text 𝑇𝑇𝑇𝑇 and their average frequencies in the randomized pseudotexts (𝑚𝑚𝑚𝑚 𝑚 𝑖𝑖𝑖𝑖 ) are calculated: 𝑁𝑁𝑁𝑁
3.
𝐷𝐷𝐷𝐷 = �|𝑚𝑚𝑚𝑚 𝑚 𝑖𝑖𝑖𝑖 − 𝑚𝑚𝑚𝑚𝑖𝑖𝑖𝑖 | 𝑖𝑖𝑖𝑖𝑖𝑖
All total distances (𝐷𝐷𝐷𝐷𝐷) between the frequencies of each motif (𝑚𝑚𝑚𝑚𝑚𝑖𝑖𝑖𝑖 ) in one million pseudotexts 𝑇𝑇𝑇𝑇𝑇 (these pseudotexts must be different from those that were measured in the step 1) and their average frequencies in the randomized pseudotexts (𝑚𝑚𝑚𝑚 𝑚 𝑖𝑖𝑖𝑖 ) (must be the same as in the previous step) are calculated: 𝑁𝑁𝑁𝑁
𝐷𝐷𝐷𝐷𝐷 = �|𝑚𝑚𝑚𝑚 𝑚 𝑖𝑖𝑖𝑖 − 𝑚𝑚𝑚𝑚𝑚𝑖𝑖𝑖𝑖 | 𝑖𝑖𝑖𝑖𝑖𝑖
4. The distribution of the 𝐷𝐷𝐷𝐷𝐷 distances is obtained.
138 Jiří Milička 5.
The upper confidence limit is set. The distance D significantly lower than the distances D’ would mean that the real distribution is even closer to the distribution generated by random transposing tokens than another distributions measured on randomly transposed tokens. This would not reject the null hypothesis. Considering this, the lower confidence limit is senseless and the test can be assumed to be one-tailed. 6. D is compared with the upper confidence limit. If D is larger than the upper confidence limit, then the null hypothesis is rejected.
An example result of this method follows (applied on L-motifs of [Bab]):
Fig. 4: The distribution of the variable D in one million pseudotexts (randomized [Bab]) vs. the value of the D variable in the real text. Here, for example, 4371 out of one million randomized texts have D’ equal to 1500.
As the D is larger than the upper confidence limit, we shall assume that the distribution of the L-motifs measured on [Bab] is more distant from the average distribution of L-motifs measured on pseudotexts (derived by the random transposition of tokens in [Bab]), than the distribution of L-motifs measured on another pseudotexts (also derived by the random transposition of tokens in [Bab]).
Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 139
6 Results In the following charts, one column represents the 𝐷𝐷𝐷𝐷’ values compared to the measured D value (like in the Fig. 4, but in a more concise form) for 7 orders of motifs. Confidence limits of the 95% confidence intervals are indicated by the error bars.
Fig. 5: The D-value of the distribution of L-motifs (in the [Zer]) is significantly different from the D’-value measured on randomly transposed tokens of the same text. Notice that the LL-motifs distribution D-value is also close to the upper confidence limit.
140 Jiří Milička
Fig. 6: The D-values of the distributions of L-motifs and LL-motifs (in the [Kat]) are significantly different from the D’-values measured on randomly transposed tokens of the same text. Notice that the LL-motifs distribution D-value is very close to the upper confidence limit.
Fig. 7: The D-values of the distributions of L-motifs, LL-motifs and LLLL-motifs (in the [Bab]) are significantly different from the D’-values measured on randomly transposed tokens of the same text. Consider that the LLLL-motifs distribution can be different just by chance.
Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 141
Fig. 8: The D-values of the distributions of L-motifs, LL-motifs and LLL-motifs (in the [Ham]) are significantly different from the D’-values. The ratios between the D-values and the upper confidence limits are more noticeable than those measured on the Czech texts (the y-axis is log scaled). As the size of these texts is comparable, it seems that the L-motif structure is more substantial for the Arabic texts than for the Czech ones.
Fig. 9: The D-values of the distributions of L-motifs and LL-motifs (in the [Sal]) are significantly different from the D’-values.
142 Jiří Milička
Fig. 10: The D-values of the distributions of L-motifs, LL-motifs and LLL-motifs (in the [Zam]) are significantly different from the D’-values despite of the fact, that the text is relatively short.
Fig. 11: The D-values of the distributions of L-motifs, LL-motifs, LLL-motifs and LLLL-motifs (in the [Maw]) are significantly different from the D’-values despite of the fact, that the text is relatively large and incoherent.
Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 143
Fig. 12: The D-values of the distributions of L-motifs, LL-motifs and LLL-motifs (in the [Baj]) are significantly different from the D’-values.
Fig. 13: The D-values of the distributions of L-motifs, LL-motifs and LLL-motifs (in the [Bah]) are significantly different from the D’-values.
144 Jiří Milička Table 2: Exact figures presented in the charts. U stands for the upper confidence limit. All D; D>U are marked bold.
L-motifs
LL-motifs
LLL-motifs
LLLL-motifs
LLLLL-motifs
LLLLLL-motifs
LLLLLLL-motifs
[Zer] 𝐷𝐷𝐷𝐷 𝐷𝐷𝐷𝐷′ 𝑈𝑈𝑈𝑈 𝐷𝐷𝐷𝐷 𝐷𝐷𝐷𝐷′ 𝑈𝑈𝑈𝑈 𝐷𝐷𝐷𝐷 𝐷𝐷𝐷𝐷′ 𝑈𝑈𝑈𝑈 𝐷𝐷𝐷𝐷 𝐷𝐷𝐷𝐷′ 𝑈𝑈𝑈𝑈 𝐷𝐷𝐷𝐷 𝐷𝐷𝐷𝐷′ 𝑈𝑈𝑈𝑈 𝐷𝐷𝐷𝐷 𝐷𝐷𝐷𝐷′ 𝑈𝑈𝑈𝑈 𝐷𝐷𝐷𝐷 𝐷𝐷𝐷𝐷′ 𝑈𝑈𝑈𝑈
[Kat]
2930 1860 2072 1705 1584 1719 839.1 792.8 876 474.3 481.3 532 260.3 269.4 300 156.1 148.5 167 82.7 80.1 91
[Bab]
2943 2084 2312 1825 1635 1780 874.5 869.4 959 539.3 525.0 580 297.2 293.6 327 156.6 163.0 183 85.9 88.4 100
2331 1512 1699 1649 1391 1510 721.5 690.6 764 483.5 417.7 463 225.6 232.3 260 118.7 127.4 144 73.0 68.5 78
[Ham] [Sal] 7712 1659 1851 2207 1364 1486 1035 716.9 792 441.1 430.9 477 230.5 239.4 267 117.7 131.6 148 75.6 70.8 80
3909 1547 1716 1394 1171 1278 631.0 620.5 687 352.1 370.5 411 209.0 204.8 229 113.1 112.2 127 63.3 59.6 68
[Zam] [Maw]
[Baj]
2673 21899 867.6 4065.1 967 4448 1057 6133.9 603.2 2779.0 666 3023 487.2 2444.2 339.2 1568.0 378 1719 201.9 1085.3 195.6 957.3 219 1051 107.3 504.3 105.7 544.0 119 601 56.6 310.2 56.1 308.4 64 343 28.9 166.2 29.0 171.7 33 193
[Bah]
25437 4108.6 4506 6798.9 2980.9 3234 3279.7 1627.4 1785 1072.0 1000.2 1097 572.2 569.4 629 332.8 322.8 359 186.5 180.1 202
18070 4094.7 4472 4245.5 2654.6 2895 1909.4 1551.8 1701 911.3 942.0 1034 527.3 534.7 591 316.8 303.1 337 177.1 168.6 189
7 Conclusion The null hypothesis was rejected for the L-motifs (all texts) and for LL-motifs (except [Zer]) and was not rejected for L-motifs of higher orders (LLL-motifs etc.) in Czech, but was rejected also for LLL-motifs in Arabic (except [Sal]). As typetoken relation and distribution of lengths are to some extent dependent on the frequency distribution, similar results for these properties can be expected, but proper tests are needed. Our methodology can be also used for testing F-motifs and other types and definitions of motifs. It needs to be said that non-rejecting the null hypothesis does not mean, that the L-motifs of higher orders are senseless – even if their distribution was inherited from the distribution of word lengths in the text (which is still not sure), it still could be used as a tool mediating to see the distribution of the word lengths from another point of view. However, it turns out that if we wish to use the L-
Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 145
motifs to examine the syntagmatic relations of the word lengths, the structure inherited from the word length distribution must be taken into account.
Acknowledgements The author is grateful to Reinhard Köhler for helpful comments and suggestions. This work was supported by the project Lingvistická a lexikostatistická analýza ve spolupráci lingvistiky, matematiky, biologie a psychologie, grant no. CZ.1.07/2.3.00/20.0161 which is financed by the European Social Fund and the National Budget of the Czech Republic.
References Antić, Gordana, Emmerich Kelih & Peter Grzybek. 2006. Zero-syllable words in determining word length. In Peter Grzybek (ed.), Contributions to the science of text and language. Word length studies and related issues, 117–156. Dordrecht: Springer. Köhler, Reinhard. 2006. The frequency distribution of the lengths of length sequences. In Jozef Genzor & Martina Bucková (eds.), Favete linguis. Studies in honour of Victor Krupa, 145– 152. Bratislava: Slovak Academic Press. Köhler, Reinhard. 2008. Word length in text. A study in the syntagmatic dimension. In Sibyla Mislovičová (ed.), Jazyk a jazykoveda v pohybe, 416–421. Bratislava: VEDA. Köhler, Reinhard & Sven Naumann. 2009. A contribution to quantitative studies on the sentence level. In Reinhard Köhler (ed.), Issues in quantitative linguistics, 34–57. Lüdenscheid: RAM-Verlag. Köhler, Reinhard & Sven Naumann. 2010. A syntagmatic approach to automatic text classification. Statistical properties of F- and L-motifs as text characteristics. In Peter Grzybek, Emmerich Kelih & Ján Mačutek (eds.). Text and language. Structures, functions, interrelations, quantitative perspectives, 81–89. Wien: Praesens. Mačutek, Ján. 2009. Motif richness. In Reinhard Köhler (ed.), Issues in Quantitative Linguistics, 51–60. Lüdenscheid: RAM-Verlag. Mačutek, Ján & Gejza Wimmer. 2013. Evaluating goodness-of-fit of discrete distribution models in quantitative linguistics. Journal of Quantitative Linguistics 20(3). 227–240. Sanada, Haruko. 2010. Distribution of motifs in Japanese texts. In Peter Grzybek, Emerich Kelih & Ján Mačutek (eds.), Text and language. Structures, functions, interrelations, quantitative perspectives, 183–193. Wien: Praesens.
Adam Pawłowski and Maciej Eder
Sequential Structures in “Dalimil’s Chronicle” Quantitative analysis of style variation
1 The problem and the aim of the study When in the 1960s Jerzy Woronczak analysed the text of the Old-Czech Chronicle of So-called Dalimil (Staročeská Kronika tak řečeného Dalimila, henceforth Dalimilova Kronika or Dalimil’s Chronicle), he claimed to have observed the process of gradual disintegration of its regular verse texture in subsequent chapters (Woronczak 1993[1963]). He called this phenomenon “prosaisation”. To elucidate his findings, he formulated a hypothesis that it is easier to compose a versified text if the historical events that are the subject of the narrative lay beyond the memory of the author’s contemporaries (e.g. in the distant mythical past and/or in pagan times). Indeed, as the gap between the years being described in Dalimilova Kronika and the times of the author’s life decreases, thus moving from the beginning towards the end of the Chronicle, its style seems increasingly prose-like. Woronczak offers a convincing explanation of this gradual shift in versification. He claims that the contemporaneous events described in the Chronicle were known to the public and for this reason needed to be reported faithfully. In other words, historical facts could not be simply adjusted to the constraints imposed by versification and the annalist had less freedom in selecting the appropriate metrical and/or prosodic structures when composing his text.1 One could put forward another argument in support of this hypothesis: the initial parts of the Chronicle probably echoed traces of oral tradition involving the use of repetitive paratactic segments (referred to as formulae) and equivalent rhythmical structures facilitating memorisation. At the eve of the Middle Ages, when the culture of scripture (and a fortiori of print) began to dominate the entire realm of social communication in Europe, formulaic elements started to disappear from literature and historiography. However,
1 This explanation was transmitted to prof. Pawłowski during a conversation with prof. Woronczak and so far has not been publicised in print.
148 Adam Pawłowski and Maciej Eder they persisted as a living heritage of the oral tradition in a form of intertextual borrowings in literary and paraliterary works. Woronczak’s idea was simple and – in a sense – universal since the interplay between historical facts and imagination has been present in historiography and historical fiction since time immemorial. His methodological approach was appropriate and convincing as it relied on empirical evidence and statistical tools. However, Woronczak, who was conducting his research in the middle of the twentieth century, could not efficiently work on a digitalised text or take full advantage of computerised numerical methods. In particular, he did not use trend analysis or time series methods to verify the existence of repetitive patterns in the text of the Chronicle. Instead, he analysed the distribution of verse lengths by means of the runs test method, which is useful for testing the randomness of a series of observations. The advantage of this method – quite suitable in this case – is that it does not require the processing of a digitalised text or extensive computations. The objective of this study is thus to scrutinise some of Woronczak’s assumptions by means of tests performed on a variety of sequential text structures in the Chronicle. The following data will be analysed: (1) series of chapter lengths (in letters, syllables and words); (2) series of verse lengths (in letters, syllables letters and words); (3) alternations and correlations of rhyme pairs; (4) quantity-based series of syllables (binary coding); (4) stress-based series of syllables (binary coding). The method we use had been successfully applied in our earlier studies concerning text rhythm, carried out on texts in both modern (French, Polish, Russian) and classical languages (Latin, and Greek) (cf. Pawłowski 1998, 2003; Pawłowski, Eder 2001; Pawłowski, Krajewski, Eder 2010; Eder 2008). We intend to explore the rhythm patterns in the Chronicle’s text and thus either corroborate – or indeed challenge – Woronczak’s hypothesis. The results we expect to obtain apply, first of all, to Dalimil’s Chronicle and medieval historiography. But the debate concerning the questions raised by the characteristics of and relations between orality and literacy – historically the two most important forms of verbal interaction – go far beyond the medieval universum. Our argument is all the more momentous that it relies on solid, quantitative research.
Sequential structures in “Dalimil’s Chronicle” 149
2 Description of the “Chronicle” The Chronicle of [So-called] Dalimil (Staročeská Kronika tak řečeného Dalimila) is one of the oldest and certainly most important monuments of medieval Czech and is often referred to as the foundation myth of the Czech Kingdom. It was created at the beginning of the fourteenth century by an anonymous author (individual or collective), subsequently named Dalimil. The Chronicle includes original parts, some of them being transcriptions of older Latin texts (e.g. Chronica bohemica), and later additions. It is written in irregular rhymed verse. Since its creation the Chronicle has had two major critical editions. In this study both the one from 1882 (Jireček 1882) and the one from 1988 (Daňhelka et al. 1988) – served as a data source. In its definitive version the Chronicle consists of one hundred and six chapters of length varying from sixteen to ninety four verses. The Chronicle describes the history of the Kingdom of Bohemia from its mythical origins (the construction of the Tower of Babel) to the year 1314. The order of chapters follows the chronological development of the narrated story. There is no evidence, however, that the subsequent chapters or passages were composed according to the timeline of the events described. Although it is hardly imaginable that any historical prose, and especially a medieval chronicle, could be written from its final passages back to the beginning (or, for that matter, obey another sequence of episodes), it is still possible, or perhaps even likely, that some chapters of the Chronicle were composed in non-linear order or that passages were rearranged after their composition. Another fact that one ought to bear in mind is the uncertain authorship of the Chronicle. A mixed authorship cannot be reliably ruled out and even if one assumes the existence of a single author named Dalimil one still does not know how many passages were actually written by the author and how many were adopted from earlier (oral and written, literary and non-literary) texts. The only element that seems beyond all doubt is that the times presented in the opening chapters are located far beyond the memory of the author’s contemporaries, gradually approaching the events that could have been experienced firsthand or secondhand by the Chronicle’s first readers. Again, it is highly probable that the author’s knowledge of the past was founded on texts composed by his/her predecessors (including Latin sources) as well as on the popular oral literature of his/her times. These traces of the oral tradition include, inter alia, metatextual intrusions, and a characteristic narrative redundancy. Yet, the fact that the text may contain borrowings from earlier sources is in no way a critical obstacle to the undertaking of an analysis of sequential
150 Adam Pawłowski and Maciej Eder structures in Dalimilova Kronika as a whole. Even if some passages were rewritten or adopted from other authors, the Chronicle can be viewed as a more or less coherent “story” that develops both in historical time and in longitudinal textual representation.
3 Hypothesis It has been postulated that in the text of the Chronicle a gradual shift in versification can be observed, consisting in the growing dissimilation in the length of adjacent verses and chapters. Called by Jerzy Woronczak “prosaisation” (Woronczak 1993[1963]: 70), this process is presumably related to the evolution of the author’s linguistic preferences and to his/her attitude towards documented historical facts. It may also reflect the shift from the ancient oral tradition, based on very regular verse structures that enhanced memorisation, to the culture of script, based on literacy and its complex versification rules that satisfied the readers’ growingly refined taste. Apart from the above hypothesis, following directly from Woronczak’s considerations, additional experiments in modelling longitudinal structures based on stress and quantity were carried out. The main reason for this was that a similar method, applied to Latin and ancient Greek texts, produced good results (cf. Pawłowski, Eder 2001; Pawłowski, Krajewski, Eder 2010).
4 Method Quantitative text analysis is a combination of applied mathematics and linguistics. It relies on scientific methodology but its epistemological foundations are arguable, since they reduce the complexity of language to a small number of parameters, simplify the question of its ontological status and often avoid critical dialogue with non-quantitative approaches. Despite these reservations quantitative methods have proven to be efficient in resolving many practical questions. They also support cognitive research, revealing new facets of human verbal communication, especially statistical laws of language and synergetic relations. As Reinhard Köhler says, “A number of linguistic laws has been found during the last decades, some of which could successfully be integrated into a general model, viz. synergetic linguistics. Thus, synergetic linguistics may be considered as a first embryonic linguistic theory. [...] According to the results of the philosophy of science, there is one widely
Sequential structures in “Dalimil’s Chronicle” 151
accepted type of explanation, the deductive-nomologic one.” (Köhler 2005: 764). The present study of Dalimil’s Chronicle is a typical example of the approach combining philological apparatus and quantitative techniques of sequential text modelling. On the one hand, our reflections and conclusions rely on an empirical analysis of the medieval versification of a given text, carried out with the help of quantitative techniques. On the other hand, they touch upon the relations between facts and fiction in ancient historiography, as well as on the passage from orality to literacy in European culture. The nature of a versified text requires a combination of conventional statistics and tools appropriate in the analysis of sequential data. Since the publication of the monumental study The Advanced Theory of Language as Choice and Chance by the British physicist and linguist Gustav Herdan (Herdan 1966), the first approach is referred to as the “analysis of language in the line” and the other one as the “analysis of language in the mass” (ibid. 423). The principal tools of sequential analysis are the time series analysis (conducted in the time domain and/or in the frequency domain), information theory and the theory of stochastic processes. All these methods have been used so far, but the time series analysis in the time domain has turned out to be the most efficient and the most appropriate in text analysis (Pawłowski 1998, 2001). The idea of sequential text analysis can be explained as follows. Let us assume that in a hypothetical language L there are two types of lexical units, noted as A and B that form sequential structures called texts according to the rules of syntax. Given is a corpus of texts: (1)
AAAAAABBBBBB, (2) AAABBBAAABBB, (3) AABBAABBAABB, (4) ABABABABABAB, (5) AABBABABABAB, (6) BABBAABBBAAA
From the linguistic point of view these sequences should be qualified as different syntagmatic structures, assuming specific meanings in a normal communicative interaction. Nevertheless, conventional statistics such as positional parameters and distributions will not reveal any differences between the sequences because the frequencies of As and Bs remain the same in every piece of “text”. In such cases, sequential methods taking into account the order of units are more effective than conventional ones (unless frequencies of ngrams are processed). It must be emphasised that syntagmatic dependencies in a line of text, referred to as textual series spanned over syntagmatic time, are not the only longitudinal structures modelled with sequential methods. There also exists a possibility to treat as a time series a sequence of separate texts or long sections,
152 Adam Pawłowski and Maciej Eder sampled from a large body of texts. Good examples of such series are successive works of a given author, periodical press articles or chronological samples of public discourse (cf. Pawłowski 2006). As in such cases the process of quantification concerns foremostly lexical units, such sequences can be referred to as lexical series. In our study the method of time series analysis was applied. It comprises several components, such as the autocorrelation function (ACF) and the partial autocorrelation function (PACF), as well as several models of stochastic processes, such as AR(p) (autoregressive model of order p) and MA(q) (moving average model of order q). There are also complex models aggregating a trend function, periodical oscillations and a random element in one formula. Periodical oscillations are represented by AR and MA models, while complex models such as ARMA or ARIMA include their combinations. From the linguistic point of view the order of the model (p or q) corresponds to the depth of statistically significant textual memory. This means that p or q elements in a text statistically determine the subsequent element p + 1 or q + 1. As the time series method has been exhaustively, copiously described in existing literature – cf. its applications in economy and engineering (cf. Box, Jenkins 1976; Cryer 1986), in psychology (Glass, Wilson, Gottman 1975; Gottman 1981, Gregson 1983, Gottman 1990), in social sciences (McCleary, Hay 1980) and in linguistics (Pawłowski 1998, 2001, 2005; Pawłowski, Eder 2001; Pawłowski, Krajewski, Eder 2010) – it seems unnecessary to discuss it here in detail (basic formulae are given in the Appendix). It is important, nonetheless, to keep in mind that a series of discrete observations of a random variable is called time series, if it can be spanned over a longitudinal axis (representing real time or any other sequential quantity). A full time series can be decomposed into three components: the trend, the periodical functions (including seasonal oscillations) and the random element called white noise. In linguistics such full time series is hardly imaginable (albeit theoretically possible). In a single text analysis trends usually do not appear because the observed variables are stationary, i.e. their values do not significantly outrange the limits imposed by the term’s frequency.2 If a greater number of texts is put in a sequence according to their chronology of appearance, or according to some principle of internal organisation, trend approximation, calculated for a given parameter, is possible. For example, in 2 Our previous research on time series analysis of Latin, Greek, Polish and English verse and prose, proves that “text series” are always stationary (cf. Pawłowski 1998, 2001, 2003; Pawłowski, Eder 2001; Pawłowski, Krajewski, Eder 2010).
Sequential structures in “Dalimil’s Chronicle” 153
the case of Dalimil’s Chronicle subsequent passages are represented by their average verse length. Out of all the parameters possibly calculable for longitudinal data, with an exception made for trend function, which – if it exists – should be estimated separately, the most important one is the autocorrelation of the series (ACF). If it is too small to be considered as significant, there is no need to proceed with model estimation or with further steps of the analysis as it will certainly not yield any noteworthy results. In his study Jerzy Woronczak applied the runs test,3 which is a nonparametrical test used in applied research to evaluate the randomness of binary sequences, usually produced as results of experiments or other empirical observations. The advantages and disadvantages of this approach will now be examined. Given is a series of symbols A and B: {AAABBAAAABBABBBAAABBAABBB}, where na is the number of As, nb is the number of Bs and N is the series length (na + nb = N). If the probabilities of occurrence of an A and a B are known, it is possible to estimate the cumulative distribution function of the number of the runs, i.e. the successive appearances of As or Bs, and to estimate its confidence interval. Woronczak’s idea was to consider a versified text as a “generator” of a random variable related to the structure of subsequent verses. The value of this variable was the verse length expressed in syllables. He considered two or more adjacent verses of the same length as a series corresponding to an AAA... sequence, while clusters composed of varying verse lengths were in his study considered as dissimilated units corresponding to the BBB... sequences. Woronczak called this phenomenon the tendency for verse assimilation or dissimilation respectively. An example of this type of coding (twelve first lines from chapter 7) can be illustrated as follows: Když sobě ten div ukázachu, na Přěmysłu potázachu, které by było znamenie té suché otky vzektvěnie. Jim tak Přěmysł otpovědě, řka: „To jáz vám všě povědě. Otka suchá jesť znamenie mého chłapieho urozenie. Ale že-ť jesť brzo vzkvetła, jakž vem Liubušě była řekła, mój rod z chłapieho pořáda
9 8 8 7 8 8 8 9 7 9 8
3 Also called Wald-Wolfowitz test.
B sequence A sequence B sequence A sequence
B sequence A sequence
154 Adam Pawłowski and Maciej Eder Když sobě ten div ukázachu, dojde králového řáda;
9 8
B sequence
The weakness of the runs test here is the assumption that a versified text is considered as a generator of randomness. Without any preliminary tests or assumptions, one has to admit that texts are a fruit of a deliberate intellectual activity aimed at creating a coherent and structured message. Versification plays here an aesthetic function and cannot be regarded as random. Another point that seems crucial is that one can easily imagine several series presenting exactly the same parameters (N, na and nb) but at the same time different orderings of AA... and BB... series. It is for these reasons that the runs test was not used in our study. Instead, trend modelling and time-series tools were applied.
5 Sampling and quantification of data As the Chronicle is composed of one hundred and six chapters or sections, two basic approaches to the question of sampling have been adopted. Respecting the formal divisions of the text, individual chapters were considered as units and were then quantified. In this case entire verses were coded according to their length expressed in the number of syllables (coding in the number of letters and words was carried out additionally for the sake of comparison). The average verse length was then calculated for all the chapters so that the entire text of the Chronicle could be processed as a time series. As for the research into rhythmical patterns in a line of text, two basic prosodic features, i.e. stress and quantity, were considered as potentially significant. Consequently, syllables in randomly selected samples of continuous text were coded in two ways: as stressed (1) or unstressed (0), and as long (1) or short (0). Every sample was composed of ca one hundred and fifty syllables. In this way sixty binary textual series were generated and processed with the time series method. A brief comment on the way of measuring verse length in a medieval chronicle would be of interest here since an inappropriate unit of measure might distort the results despite the sophisticated mathematical apparatus being used for the treatment of data. Although it seems impossible to reproduce the way the text of the Chronicle might have sounded more than seven hundred years ago, some of its metrical features could be reconstructed. The most timeresistant and convincing unit in this case is the syllable, as it is the carrier of prosodic features of rhythm, i.e. stress or quantity. Interestingly enough, our
Sequential structures in “Dalimil’s Chronicle” 155
tests have shown that the coding of the text with graphical units such as letters and words produced very similar results as compared with syllable coding.
6 Results and interpretation In order to test the hypothesis of the gradual “prosaisation” of the Chronicle basic statistical measures were calculated. These were meant to provide a preliminary insight into the data and included arithmetic means and standard deviations applied to a sequence of consecutive chapters of the Chronicle. This first step of the procedure was intended to let the data “speak for itself”. Surprisingly, even these most elementary techniques have clearly shown some stylistic changes, or rather a development of style throughout the text of the Chronicle. Our analysis ought to start with the most intuitive measure of the versification structure, namely the constancy of the verse length. Even if the whole Chronicle was composed using irregular metre (bezrozměrný verš), one could expect the line lengths to be roughly justified in individual chapters as well as across the chapters. As evidenced by Fig. 1, the mean number of syllables per line increases significantly, starting with roughly eight in the opening chapters and reaching ca twelve syllables in the final ones, with a very characteristic break in the middle of the Chronicle.4 The same process could be observed when verse lengths were measured in words (Fig. 2) and letters (not shown). The above observations can be represented in terms of the following linear regression model: 𝑦𝑦𝑦𝑦𝑖𝑖𝑖𝑖 = 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽1 + 𝛽𝛽𝛽𝛽2 + 𝜀𝜀𝜀𝜀,
where: β1, β2 – coefficients of the model ε – random noise. Given the considerably clear picture of the increasing length of lines in subsequent chapters, it is not surprising that the following linear regression model (for syllables) fits the data sufficiently well: 𝑦𝑦𝑦𝑦�𝚤𝚤𝚤𝚤 = 0.017757𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 + 9.28473 + 𝜀𝜀𝜀𝜀 4 The trend lines on the plots are generated using the procedure introduced by Cleveland (1979).
156 Adam Pawłowski and Maciej Eder Although this model (its estimate marked with dashed line in Fig. 1) seems to be satisfactory it does not capture the disturbance of the trend in the middle of the Chronicle (roughly covering chapters 30–60). It is highly probable that one has to do here with the influence of an external factor on the stable narration of the Chronicle.
Fig. 1: Mean verse length in subsequent chapters of the Chronicle (in syllables).
Fig. 2: Mean verse length in subsequent chapters of the Chronicle (in words).
Sequential structures in “Dalimil’s Chronicle” 157
Certainly, the above results alone do not indicate any shift from verse to prose per se. However, when this observation is followed by a measure of dispersion, such as the standard deviation of the line length in subsequent chapters or – more accurately – a normalised standard deviation (i.e. coefficient of variation, or standard deviation expressed in the units of the series mean value), the overall picture becomes clearer, because the dispersion of line lengths (Fig. 3) increases significantly as well. The lines not only become longer throughout the whole Chronicle but also more irregular in terms of their syllabic justification, thus resembling prose more than poetry. The results obtained for syllables are corroborated by similar tests performed on letters and words (not shown): the lines of the Chronicle display a decreasing trend of verse justification, whichever units are being analysed. One should also note that the syllabic model of the opening chapters, namely eight syllable lines with a few outliers, is a very characteristic pattern of archaic oral poetry in various literatures and folkloric traditions of Europe. In the case of Dalimil’s Chronicle, it looks as if the author composed the earliest passages using excerpts from oral epic poetry followed by chapters inspired by non-literary (historical) written sources. If this assumption were to be true, it should be confirmed by a closer inspection of other prosodic factors of the Chronicle such as rhymes.
Fig. 3: Coefficient of variation of line lengths in subsequent chapters of the Chronicle (in syllables).
158 Adam Pawłowski and Maciej Eder
Fig. 4: Rhyme repertory, or the percentage of discrete rhyme pairs used in sub-sequent chapters of the Chronicle.
Another series of tests addressed the question of rhyme homogeneity and rhyme repertory. Provided that oral/archaic poetry tends to reduce alternations in rhymes or to minimise the number of rhymes used (as opposed to elaborate written poetry, usually boasting a rich inventory of rhymes), we examined the complexity of rhyme types in the Chronicle. Firstly, we computed the percentage of discrete rhyme pairs per chapter – an equivalent to the classical type/token ratio – in order to get an insight into the rhyme inventory, with an obvious caveat that this measure is quite sensitive to the sample size. Although the dispersion of results is rather large across subsequent chapters (Fig. 4), the existence of the decreasing trend is evident. This is rather counter-intuitive and contradicts the above-formulated observations since it suggests a greater rhyme inventory at the beginning of the Chronicle, as though text was more “oral” in its final passages than in its opening chapters. The next measure is the index of rhyme homogeneity, aimed to capture longer sequences of matching rhyme patterns or clusters of lines using the same rhyme. It was computed as a number of instances where a rhyme followed the preceding one without alternation, divided by the number of lines of a given chapter. Thus, the value of 0.5 means a regular alternation of rhyme pairs, whilst a chapter of n verses using no rhyme changes (i.e. based entirely on one rhyme pattern) will raise the homogeneity index to (n – 1)/n. On theoretical grounds, both measures – rhyme inventory index and homogeneity index – are
Sequential structures in “Dalimil’s Chronicle” 159
correlated to some extent. However, the following example from chapter 38 will explain the difference between these measures: Tehdy sě sta, že kněz Ołdřich łovieše a sám v pustém lesě błúdieše. Když u velikých túhách bieše, około sebe všady zřieše, uzřě, nali-ť stojí dospěłý hrad. Kněz k němu jíti chtieše rád. Ale že cěsty neumějieše a około hłožie husté bieše, ssěd s koně, mečem cěstu proklesti, i počě po ostrvách v hrad lézti; neb sě nemože nikohého dovołati, by v něm liudie byli, nemože znamenati. Most u něho vzveden bieše, a hrad zdi około sebe tvrdé jmějieše. Když kněz s úsilím v hrad vnide a všecky sklepy znide, zetlełé rúcho vidieše, však i čłověka na něm nebieše. Sbožie veliké a vína mnoho naleze. Ohledav hrad, kudyž był všeł, tudyž vyleze. Pak kněz hrad ten da pánu tomu, jemuž Přiema diechu; proto tomu hradu Přimda vzděchu.
In the above passage there are twelve lines that follow the preceding line’s rhyme; however, the number of unique rhyme patterns is seven as the rhyme -ieše returns several times. Hence, both indexes will differ significantly and their value will be as high as 0.318 and 0.545 (inventory and homogeneity respectively). Fig. 5 shows the homogeneity index for the whole Chronicle. Despite some interesting outliers a general increasing trend is again easily noticeable: the opening chapters not only tend to use more rhyme types as compared with the final passages, but also avoid aggregating rhymes into longer clusters. To keep things simple, one might say that rhyme diversity and copiousness (traditionally associated with elaborate written poetry) turns slowly into repetitiveness and formulaic “dullness” (usually linked to archaic oral poetry). Yet, another simple test confirms these results: out of all the rhymes used in the Chronicle we computed the relative frequencies of the most frequent ones, dividing the number of a particular rhyme’s occurrence by the total number of verses in a chapter. The summarised frequencies of five predominant groups of rhymes are represented in Fig. 6, which shows very clearly that a variety of rhyme patterns of the initial passages is systematically displaced by a limited
160 Adam Pawłowski and Maciej Eder number of line endings. It also means a syntactic simplification of the final parts of the Chronicle, since the overwhelming majority of rhymes are in fact a few inflected morphemes of verbs (-echu|-ichu|-achu, -eše|-aše, -ati|-eti, -idu|-edu| -adu) and a genitive singular morpheme for masculine nouns, adjectives and pronouns (-eho|-oho).
Fig. 5: Clustering of homogeneous rhymes in subsequent chapters of the Chronicle.
Fig. 6: Relative frequency of the most frequent rhyme types in subsequent chapters of the Chronicle.
Sequential structures in “Dalimil’s Chronicle” 161
Next come more sophisticated measures that may betray internal correlations between subsequent lines of the Chronicle, namely the aforementioned time series analysis techniques. The aim of the following tests is to examine whether short lines are likely to be followed by long ones and vice versa (Hrabák 1959). Certainly, the alternative hypothesis is that no statistically significant correlations exist between the consecutive lines of the Chronicle. To test this suggestion we first computed the autocorrelation function (ACF) for subsequent line lengths, expressed in syllables. Next came partial autocorrelation function (PACF), followed by the stage of model estimation and evaluation (if applicable, because most of the results not proved insufficiently significant and thus unsuitable for modelling). The whole procedure was replicated independently for individual chapters.5 The results of the aforementioned procedures are interesting, yet statistically not very significant. As evidenced by Fig. 7 (produced for chapter 1), the autocorrelation function reveals no important values at lags higher than 1, the only significant result appearing at lag 1. It means that a minor negative correlation between any two neighbouring lines (i.e. at lag 1) can be observed in the dataset. To interpret this fact in linguistic terms, any given verse is slightly affected by the preceding one but does not depend at all on the broader preceding context. The ACF function applied to the remaining chapters corroborates the above results. Still, the only value that turned out to be significant is negative in some chapters (Fig. 7) and positive in the other ones (Fig. 8), and it generally yields a substantial variance. It seemed feasible therefore to extract throughout the consecutive chapters all the values for lag 1 and to represent them in one figure (Fig. 9). This procedure reveals some further regularities; namely the correlation between neighbouring lines slowly but systematically increases. It means, once again, that the Chronicle becomes somewhat more rhythmical in its last chapters despite – paradoxically – the prose-like instability of the verse length. In any attempt at measuring textual rhythmicity the most obvious type of data that should be assessed is the sequence of marked and unmarked syllables or other alternating linguistic units (such as metrical feet). In the case of Dalimil’s Chronicle – or the Old Czech language in general – the feature responsible for rhythmicity is the sequence of stressed and unstressed syllables on the one hand, and the sequence of long and short syllables on the other 5 Since the ARIMA modelling is a rather time-consuming task, we decided to assess every second chapter instead of computing the whole dataset. We believe, however, that this approach gives us a good approximation of the (possible) regularities across chapters.
162 Adam Pawłowski and Maciej Eder hand. To test possible rhythmical patterns in the Chronicle, we prepared two independent datasets containing a binary coding – one for stress, the other for quantity.
Fig. 7: Autocorrelation function (ACF) of lengths of subsequent lines of chapter 1.
Fig. 8: Autocorrelation function (ACF) of lengths of subsequent lines of chapter 41.
Sequential structures in “Dalimil’s Chronicle” 163
Fig. 9: Results of the autocorrelation function values (lag 1 only) for the consecutive chapters of the Chronicle.
The results of ACF function applied to a sample chapter are shown in Fig. 10 (stress) and Fig. 11 (quantity). Both figures are rather straightforward to interpret. The stress series contains one predominant and stable value at lag 1, and no significant values at higher lags. This means a strong negative correlation between any two adjacent syllables. This phenomenon obviously reflects a predominant alternation of stressed and unstressed syllables in the sample; it is not surprising to see the same pattern in all the remaining samples. Needless to say, no stylistic shift could be observed across consecutive chapters. These findings seem to indicate that stress alternation is a systemic language phenomenon with no bearing on a stylistic analysis of the Chronicle, and the quantity alternation in a line of text does not generate any significant information. Having said that, both measures should produce more interesting results in a comparative analysis of poetic genres in modern Czech. The quantity-based series reveals a fundamentally different behaviour of the sequence of long and short syllables. The actual existence of quantity in the Old Czech language seems to be totally irrelevant as a prosodic feature in poetry. The values shown in Fig. 11 evidently suggest no autocorrelation at any lag. It means that the time series is simply unpredictable; in linguistic terms, it means no rhythm in the dataset.
164 Adam Pawłowski and Maciej Eder
Fig. 10: Autocorrelation function (ACF) of a sequence of stressed and non-stressed syllables from chapter 3.
Fig. 11: Autocorrelation function (ACF) of a sequence of long and short syllables from chapter 7.
Significantly enough, both prosodic features which one would associate with the techniques of poetic composition and which are quantity and stress, proved to be independent of any authorial choice and did not play any role in the gradual stylistic shift in the Chronicle.
Sequential structures in “Dalimil’s Chronicle” 165
7 Conclusions We have verified the presence of latent rhythmic patterns and have partially corroborated the hypothesis advanced by Jerzy Woronczak in 1963. However, the obtained results elucidate further questions raised by our analysis of the Chronicle. One of these questions concerns the role played by the underlying oral culture in the process of text composition in the Middle Ages. Although in Woronczak’s work there is no direct reference to the works by J. Lord, M. Parry, W. Ong or E. Havelock, all of whom were involved in the study of orality in human communication, the Polish scholar’s reasoning is largely consistent with theirs. Woronczak agrees with the assumption that in oral literature subsequent segments (to be transformed in future into verses of written texts) tend to keep a repetitive and stable rhythm, cadenced by syllable stress, quantity, identical metrical feet or verse length. The reason for this apparent monotony is that illiterate singers of tales and their public appreciated rhythmicity. While the former took advantage of its mnemonic properties that helped them remember hundreds or even thousands of verses, the latter enjoyed its hypnotising power during life performances. Variable and irregular segments marking ruptures of rhythm occurred only in critical moments of the story. However, the bare opposition orality vs. literacy does not suffice to explain the quite complex, as it turns out, stylistic shift in the Chronicle. Without rejecting the role of orality as one of the underlying factors here, we propose a considerably straightforward hypothesis that seems to explain the examined data satisfactorily. Since in the initial chapters of the Dalimilova Kronika the archaic eight-syllable verse type was used, it is feasible that this part of the Chronicle had been inspired by oral poetry of the author’s predecessors (or even copied in some passages). Probably, when the author took on some elements of the oral heritage of his/her times he wished to keep the original syllabic structure, rhymes and line lengths. This was possible – however difficult – for a while, yet the poetic form eventually proved to be too demanding. At some point, as we stipulate, the author realised that creating or supplementing history would require more and more aesthetic compromises; thus, he put less effort into the poetic form of the text’s final chapters. Certainly, other explanations cannot be ruled out. The most convincing one is the hypothesis of the collaborative authorship if the Chronicle. However, had there been two (or more) authors, a sudden stylistic break somewhere in the Chronicle, rather than the gradual shift from poetry to prose, would be more likely.
166 Adam Pawłowski and Maciej Eder Hence, the results of our study go far beyond the Medieval universum or the stylometric routine research; partaking in the debate concerning the questions raised by the opposition between orality and literacy in culture as well as the opposition between fact and fiction in historical narrative, omnipresent in the text-audio-visual space. Our argument is all the more imperative that it is relies on empirical data and on solid, quantitative methodology.
Source Texts Jireček, Josef (ed.). 1882. Rýmovaná kronika česká tak řečeného Dalimila. In Prameny dějin českých 3(1). Praha. Daňhelka Jiří, Karel Hádek, Bohulav Havránek, Naděžda Kvítková (eds.). 1988. Staročeská kronika tak řečeného Dalimila: Vydání textu a veškerého textového materiálu. Vol. 1–2. Praha: Academia.
References Box, George E.P. & Gwilym M. Jenkins. 1976. Time series analysis: Forecasting and control. San Francisco: Holden-Day. Cleveland, William S. 1979 Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74(368). 829–836. Cryer, Jonathan. 1986. Time series analysis. Boston: Duxbury Press. Eder, Maciej. 2008. How rhythmical is hexameter: A statistical approach to ancient epic poetry. Digital Humanities 2008: Book of Abstracts, 112–114. Oulu: University of Oulu. Glass, Gene V., Victor L. Wilson & John M.Gottman. 1975. Design and analysis of time-series experiments. Boulder (CO): Colorado Associated University Press. Gottman, John M. 1981. Time-series analysis: A comprehensive introduction for social scientists. Cambridge: Cambridge University Press. Gottman, John M. 1990. Sequential analysis. Cambridge: Cambridge University Press. Gregson, Robert A.M. 1983. Time series in psychology. Hillsdale (NJ): Lawrence Erlbaum Associates. Herdan, Gustav. 1966. The advanced theory of language as choice and chance. Berlin: Springer. Hrabák, Josef. 1959. Studie o českém verši. Praha: SPN. Köhler, Reinhard. 2005. Synergetic linguistics. In Reinhard Köhler, Gabriel Altmann & Rajmund G. Piotrowski (eds.), Quantitative linguistics. An international handbook, 760–774. Berlin & New York: de Gruyter. McCleary, Richard & Richard A.Hay. 1980. Applied time series analysis for the social sciences. Beverly Hills: Sage Publications. Pawłowski, Adam. 1998. Séries temporelles en linguistique. Avec application à l’attribution de texts. Paris: Romain Gary et Émile Ajar, and Genève: Champion-Slatkine.
Sequential structures in “Dalimil’s Chronicle” 167 Pawłowski, Adam. 2003. Sequential analysis of versified texts in fixed- and free-accent languages: Example of Polish and Russian. In Lew N. Zybatow (ed.), Europa der Sprachen: Sprachkompetenz – Mehrsprachigkeit – Translation. Teil II: Sprache und Kognition, 235– 246. Frankfurt/M.: Peter Lang. Pawłowski, Adam. 2005 Modelling of the sequential structures in text. In Reinhard Köhler, Gabriel Altmann & Rajmund G. Piotrowski (eds.), Quantitative linguistics. An international handbook, 738–750. Berlin & New York: de Gruyter. Pawłowski, Adam. 2006 Chronological analysis of textual data from the “Wrocław corpus of Polish”. Poznań Studies in Contemporary Linguistics 41. 9–29. Pawłowski, Adam & Maciej Eder. 2001. Quantity or stress? Sequential analysis of Latin prosody. Journal of Quantitative Linguistics 8(1). 81–97. Pawłowski, Adam, Marek Krajewski & Maciej Eder. 2010. Time series modelling in the analysis of Homeric verse. Eos 47(1). 79–100. Woronczak, Jerzy. 1963. Zasada budowy wiersza “Kroniki Dalimila” [Principle of verse building in the “Kranika Dalimila”]. Pamiętnik Literacki 54(2). 469–478.
168 Adam Pawłowski and Maciej Eder
Appendix A series of discrete observations xi, representing the realisations of a random variable Xt, is called time series, if it can be spanned over a longitudinal axis (corresponding to time or any other sequential quantity): 𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 = {𝑥𝑥𝑥𝑥1 , 𝑥𝑥𝑥𝑥2 … 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 }
(1)
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝐸𝐸𝐸𝐸(𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 )
(2)
The mean 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 of a time series 𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 is defined as: estimated by:
𝑁𝑁𝑁𝑁
1 𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 = � 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 𝑁𝑁𝑁𝑁
(3)
𝑡𝑡𝑡𝑡=1
where N – series length xt – value of a series at the moment or position t Variance of a time series is defined as: 𝜎𝜎𝜎𝜎𝑥𝑥𝑥𝑥2 = 𝐸𝐸𝐸𝐸(𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 − 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 )2
(4)
estimated by (using the same notation): 𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥2 =
𝑁𝑁𝑁𝑁
1 �(𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 − 𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 )2 𝑁𝑁𝑁𝑁
(5)
𝑡𝑡𝑡𝑡=1
Autocovariance 𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 of a time series 𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 is defined as:
𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 = 𝐸𝐸𝐸𝐸{(𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 − 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 )(𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡+𝑘𝑘𝑘𝑘 − 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 )}
(6)
where 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 – series theoretical mean k – lag separating the values 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 Autocovariance is estimated by: 𝑐𝑐𝑐𝑐𝑘𝑘𝑘𝑘 =
𝑁𝑁𝑁𝑁−𝑘𝑘𝑘𝑘
1 �(𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 − 𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 )(𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡+𝑘𝑘𝑘𝑘 − 𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 ) 𝑁𝑁𝑁𝑁 − 𝑘𝑘𝑘𝑘 𝑡𝑡𝑡𝑡=1
where N – series length 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 – value of a series at the moment or position t k – lag separating the values 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 Autocorrelation function of order k of a series is defined as:
(7)
Sequential structures in “Dalimil’s Chronicle” 169
𝜌𝜌𝜌𝜌𝑘𝑘𝑘𝑘 =
𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 = 𝛾𝛾𝛾𝛾0 𝜎𝜎𝜎𝜎𝑥𝑥𝑥𝑥2
𝑟𝑟𝑟𝑟𝑘𝑘𝑘𝑘 =
𝑐𝑐𝑐𝑐𝑘𝑘𝑘𝑘 𝑐𝑐𝑐𝑐𝑘𝑘𝑘𝑘 = 𝑐𝑐𝑐𝑐0 𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥2
(8)
where 𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 – autocovariance function (if 𝑘𝑘𝑘𝑘 = 0 then 𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 = 𝜎𝜎𝜎𝜎𝑥𝑥𝑥𝑥2 ) k – lag ACF is estimated by the following formula: (9)
ACF is the most important parameter in time series analysis. If it is nonrandom, estimation of stochastic models is justified and purposeful. If it is not the case, it proves that there is no deterministic component (metaphorically called “memory”) in the data. There are three basic models of time series, as well as many complex models. As no complex models were discovered in the Chronicle of Dalimil, only the simple ones will be presented. A random process consists of independent realisations of a variable 𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 = {𝑒𝑒𝑒𝑒1 , 𝑒𝑒𝑒𝑒2 , … }. It is characterised by zero autocovariance and zero autocorrelation. By analogy to the spectrum of light, the values 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡 which have normal distribution are called white noise. An autoregressive process of order p consists of the subsequent 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 values defined as: 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 = 𝑎𝑎𝑎𝑎1 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−1 + 𝑎𝑎𝑎𝑎2 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−2 + ⋯ + 𝑎𝑎𝑎𝑎𝑝𝑝𝑝𝑝 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−𝑝𝑝𝑝𝑝 + 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡
(10)
𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 = 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡 − 𝑏𝑏𝑏𝑏1 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−1 − 𝑏𝑏𝑏𝑏2 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−2 − ⋯ − 𝑏𝑏𝑏𝑏𝑞𝑞𝑞𝑞 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−𝑞𝑞𝑞𝑞
(11)
Where 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖 – model coefficients 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖 – normally distributed random values p – order of the AR process A moving average process of order q consists of the subsequent 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 values defined as: where 𝑏𝑏𝑏𝑏𝑖𝑖𝑖𝑖 – model coefficients 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖 – normally distributed random values q – order of the MA process Random, autoregressive and moving average processes (without trend) can be aggregated in one complex model, called ARMA (the same notation, as above): 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 = 𝑎𝑎𝑎𝑎1 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−1 + 𝑎𝑎𝑎𝑎2 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−2 + ⋯ + 𝑎𝑎𝑎𝑎𝑝𝑝𝑝𝑝 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−𝑝𝑝𝑝𝑝 + 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡 − 𝑏𝑏𝑏𝑏1 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−1 − 𝑏𝑏𝑏𝑏2 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−2 − ⋯ − 𝑏𝑏𝑏𝑏𝑞𝑞𝑞𝑞 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−𝑞𝑞𝑞𝑞
(12)
170 Adam Pawłowski and Maciej Eder The type of the model is determined by the behaviour of the ACF and PACF (partial autocorrelation) functions, as their values may either slowly decrease or abruptly truncate (cf. Pawłowski 2001: 71). It should be emphasized, however, that in many cases of text analysis it is just the ACF function, which is quite sufficient to analyse and linguistically interpret the data. This is precisely the case in the hereby study of the Dalimil’s Chronicle text.
Taraka Rama and Lars Borin
Comparative Evaluation of String Similarity Measures for Automatic Language Classification 1 Introduction Historical linguistics, the oldest branch of modern linguistics, deals with language-relatedness and language change across space and time. Historical linguists apply the widely-tested comparative method (Durie and Ross, 1996) to establish relationships between languages to posit a language family and to reconstruct the proto-language for a language family.1 Although historical linguistics has parallel origins with biology (Atkinson and Gray, 2005), unlike the biologists, mainstream historical linguists have seldom been enthusiastic about using quantitative methods for the discovery of language relationships or investigating the structure of a language family, except for Kroeber and Chrétien (1937) and Ellegård (1959). A short period of enthusiastic application of quantitative methods initiated by Swadesh (1950) ended with the heavy criticism levelled against it by Bergsland and Vogt (1962). The field of computational historical linguistics did not receive much attention again until the beginning of the 1990s, with the exception of two noteworthy doctoral dissertations, by Sankoff (1969) and Embleton (1986). In traditional lexicostatistics, as introduced by Swadesh (1952), distances between languages are based on human expert cognacy judgments of items in standardized word lists, e.g., the Swadesh lists (Swadesh, 1955). In the terminology of historical linguistics, cognates are related words across languages that can be traced directly back to the proto-language. Cognates are identified through regular sound correspondences. Sometimes cognates have similar surface form and related meanings. Examples of such revealing kind of cognates are: English ~ German night ~ Nacht ‘night’ and hound ~ Hund ‘dog’. If a word has undergone many changes then the relatedness is not obvious from visual inspection and one needs to look into the history of the word to exactly understand 1 The Indo-European family is a classical case of the successful application of comparative method which establishes a tree relationship between some of the most widely spoken languages in the world.
172 Taraka Rama and Lars Borin the sound changes which resulted in the synchronic form. For instance, the English ~ Hindi wheel ~ chakra ‘wheel’ are cognates and can be traced back to the proto-Indo-European root kw ekw lo-. Recently, some researchers have turned to approaches more amenable to automation, hoping that large-scale lexicostatistical language classification will thus become feasible. The ASJP (Automated Similarity Judgment Program) project2 represents such an approach, where automatically estimated distances between languages are provided as input to phylogenetic programs originally developed in computational biology (Felsenstein, 2004), for the purpose of inferring genetic relationships among organisms. As noted above, traditional lexicostatistics assumes that the cognate judgments for a group of languages have been supplied beforehand. Given a standardized word list, consisting of 40–100 items, the distance between a pair of languages is defined as the percentage of shared cognates subtracted from 100%. This procedure is applied to all pairs of languages under consideration, to produce a pairwise inter-language distance matrix. This inter-language distance matrix is then supplied to a tree-building algorithm such as Neighbor-Joining (NJ; Saitou and Nei, 1987) or a clustering algorithm such as Unweighted Pair Group Method with Arithmetic Mean (UPGMA; Sokal and Michener, 1958) to infer a tree structure for the set of languages. Swadesh (1950) applies essentially this method – although completely manually – to the Salishan languages. The resulting “family tree” is reproduced in figure 1. The crucial element in these automated approaches is the method used for determining the overall similarity between two word lists.3 Often, this is some variant of the popular edit distance or Levenshtein distance (LD; Levenshtein, 1966). LD for a pair of strings is defined as the minimum number of symbol (character) additions, deletions and substitutions needed to transform one string into the other. A modified LD (called LDND) is used by the ASJP consortium, as reported in their publications (e.g., Bakker et al. 2009 and Holman et al. 2008).
2 http://email.eva.mpg.de/~wichmann/ASJPHomePage.htm 3 At this point, we use “word list” and “language” interchangeably. Strictly speaking, a language, as identified by its ISO 639-3 code, can have as many word lists as it has recognized (described) varieties, i.e., doculects (Nordhoff and Hammarström, 2011).
Comparative evaluation of string similarity measures 173
2 Related Work Cognate identification and tree inference are closely related tasks in historical linguistics. Considering each task as a computational module would mean that each cognate set identified across a set of tentatively related languages feed into the refinement of the tree inferred at each step. In a critical article, Nichols (1996) points out that the historical linguistics enterprise, since its beginning, always used a refinement procedure to posit relatedness and tree structure for a set of tentatively related languages.4 The inter-language distance approach to treebuilding, is incidentally straightforward and comparably accurate in comparison to the computationally intensive Bayesian-based tree-inference approach of Greenhill and Gray (2009).5 The inter-language distances are either an aggregate score of the pairwise item distances or based on a distributional similarity score. The string similarity measures used for the task of cognate identification can also be used for computing the similarity between two lexical items for a particular word sense.
2.1 Cognate identification The task of automatic cognate identification has received a lot of attention in language technology. Kondrak (2002a) compares a number of algorithms based on phonetic and orthographical similarity for judging the cognateness of a word pair. His work surveys string similarity/distance measures such as edit distance, dice coefficient, and longest common subsequence ratio (LCSR) for the task of cognate identification. It has to be noted that, until recently (Hauer and Kondrak, 2011, List, 2012), most of the work in cognate identification focused on determining the cognateness between a word pair and not among a set of words sharing the same meaning. Ellison and Kirby (2006) use Scaled Edit Distance (SED)6 for computing intralexical similarity for estimating language distances based on the dataset of Indo 4 This idea is quite similar to the well-known Expectation-Maximization paradigm in machine learning. Kondrak (2002b) employs this paradigm for extracting sound correspondences by pairwise comparisons of word lists for the task of cognate identification. A recent paper by Bouchard-Côté et al. (2013) employs a feed-back procedure for the reconstruction of ProtoAustronesian with a great success. 5 For a comparison of these methods, see Wichmann and Rama, 2014. 6 SED is defined as the edit distance normalized by the average of the lengths of the pair of strings.
174 Taraka Rama and Lars Borin European languages prepared by Dyen et al. (1992). The language distance matrix is then given as input to the NJ algorithm – as implemented in the PHYLIP package (Felsenstein, 2002) – to infer a tree for 87 Indo-European languages. They make a qualitative evaluation of the inferred tree against the standard IndoEuropean tree.
Fig. 1: Salishan language family box-diagram from Swadesh 1950.
Kondrak (2000) developed a string matching algorithm based on articulatory features (called ALINE) for computing the similarity between a word pair. ALINE was evaluated for the task of cognate identification against machine learning algorithms such as Dynamic Bayesian Networks and Pairwise HMMs for automatic cognate identification (Kondrak and Sherif, 2006). Even though the
Comparative evaluation of string similarity measures 175
approach is technically sound, it suffers due to the very coarse phonetic transcription used in Dyen et al.’s Indo-European dataset.7 Inkpen et al. (2005) compared various string similarity measures for the task of automatic cognate identification for two closely related languages: English and French. The paper shows an impressive array of string similarity measures. However, the results are very language-specific, and it is not clear that they can be generalized even to the rest of the Indo-European family. Petroni and Serva (2010) use a modified version of Levenshtein distance for inferring the trees of the Indo-European and Austronesian language families. LD is usually normalized by the maximum of the lengths of the two words to account for length bias. The length normalized LD can then be used in computing distances between a pair of word lists in at least two ways: LDN and LDND (Levenshtein Distance Normalized Divided). LDN is computed as the sum of the length normalized Levenshtein distance between the words occupying the same meaning slot divided by the number of word pairs. Similarity between phoneme inventories and chance similarity might cause a pair of not-so related languages to show up as related languages. This is compensated for by computing the length-normalized Levenshtein distance between all the pairs of words occupying different meaning slots and summing the different word-pair distances. The summed Levenshtein distance between the words occupying the same meaning slots is divided by the sum of Levenshtein distances between different meaning slots. The intuition behind this idea is that if two languages are shown to be similar (small distance) due to accidental chance similarity then the denominator would also be small and the ratio would be high. If the languages are not related and also share no accidental chance similarity, then the distance as computed in the numerator would be unaffected by the denominator. If the languages are related then the distance as computed in the numerator is small anyway, whereas the denominator would be large since the languages are similar due to genetic relationship and not from chance similarity. Hence, the final ratio would be smaller than the original distance given in the numerator. Petroni and Serva (2010) claim that LDN is more suitable than LDND for measuring linguistic distances. In reply, Wichmann et al. (2010a) empirically show that LDND performs better than LDN for distinguishing pairs of languages belonging to the same family from pairs of languages belonging to different families. 7 The dataset contains 200-word Swadesh lists for 95 language varieties. Available on http://www. wordgumbo.com/ie/cmp/index.htm.
176 Taraka Rama and Lars Borin As noted by Jäger (2014), Levenshtein distance only matches strings based on symbol identity whereas a graded notion of sound similarity would be a closer approximation to historical linguistics as well as achieving better results at the task of phylogenetic inference. Jäger (2014) uses empirically determined weights between symbol pairs (from computational dialectometry; Wieling et al. 2009) to compute distances between ASJP word lists and finds that there is an improvement over LDND at the task of internal classification of languages.
2.2 Distributional similarity measures Huffman (1998) compute pairwise language distances based on character ngrams extracted from Bible texts in European and American Indian languages (mostly from the Mayan language family). Singh and Surana (2007) use character n-grams extracted from raw comparable corpora of ten languages from the Indian subcontinent for computing the pairwise language distances between languages belonging to two different language families (Indo-Aryan and Dravidian). Rama and Singh (2009) introduce a factored language model based on articulatory features to induce an articulatory feature level n-gram model from the dataset of Singh and Surana, 2007. The feature n-grams of each language pair are compared using a distributional similarity measure called cross-entropy to yield a single point distance between the language pair. These scholars find that the distributional distances agree with the standard classification to a large extent. Inspired by the development of tree similarity measures in computational biology, Pompei et al. (2011) evaluate the performance of LDN vs. LDND on the ASJP and Austronesian Basic Vocabulary databases (Greenhill et al., 2008). They compute NJ and Minimum Evolution trees8 for LDN as well as LDND distance matrices. They compare the inferred trees to the classification given in the Ethnologue (Lewis, 2009) using two different tree similarity measures: Generalized Robinson-Foulds distance (GRF; A generalized version of Robinson-Foulds [RF] distance; Robinson and Foulds 1979) and Generalized Quartet distance (GQD; Christiansen et al. 2006). GRF and GQD are specifically designed to account for the polytomous nature – a node having more than two children – of the Ethnologue trees. For example, the Dravidian family tree shown in figure 3 exhibits four branches radiating from the top node. Finally, Huff and Lonsdale (2011) compare the NJ trees from ALINE and LDND distance metrics to Ethnologue trees using RF distance. The authors did not find any significant improvement by 8 A tree building algorithm closely related to NJ.
Comparative evaluation of string similarity measures 177
using a linguistically well-informed similarity measure such as ALINE over LDND.
3 Is LD the best string similarity measure for language classification? LD is only one of a number of string similarity measures used in fields such as language technology, information retrieval, and bio-informatics. Beyond the works cited above, to the best of our knowledge, there has been no study to compare different string similarity measures on something like the ASJP dataset in order to determine their relative suitability for genealogical classification9. In this paper we compare various string similarity measures10 for the task of automatic language classification. We evaluate their effectiveness in language discrimination through a distinctiveness measure; and in genealogical classification by comparing the distance matrices to the language classifications provided by WALS (World Atlas of Language Structures; Haspelmath et al., 2011)11 and Ethnologue. Consequently, in this article we attempt to provide answers to the following questions: Out of the numerous string similarity measures listed below in section 5: – Which measure is best suited for the tasks of distinguishing related languages from unrelated languages? – Which is measure is best suited for the task of internal language classification? – Is there a procedure for determining the best string similarity measure?
9 One reason for this may be that the experiments are computationally demanding, requiring several days for computing a single measure over the whole ASJP dataset. 10 A longer list of string similarity measures is available on: http://www.coli.uni-saarland.de/ courses/LT1/2011/slides/stringmetrics.pdf 11 WALS does not provide a classification to all the languages of the world. The ASJP consortium gives a WALS-like classification to all the languages present in their database.
178 Taraka Rama and Lars Borin
4 Database and language classifications 4.1 Database The ASJP database offers a readily available, if minimal, basis for massive crosslinguistic investigations. The ASJP effort began with a small dataset of 100-word lists for 245 languages. These languages belong to 69 language families. Since its first version presented by Brown et al. (2008), the ASJP database has been going through a continuous expansion, to include in the version used here (v. 14, released in 2011)12 more than 5500 word lists representing close to half the languages spoken in the world (Wichmann et al., 2011). Because of the findings reported by Holman et al. (2008), the later versions of the database aimed to cover only the 40-item most stable Swadesh sublist, and not the 100-item list. Each lexical item in an ASJP word list is transcribed in a broad phonetic transcription known as ASJP Code (Brown et al., 2008). The ASJP code consists of 34 consonant symbols, 7 vowels, and four modifiers (∗, ”, ∼, $), all rendered by characters available on the English version of the QWERTY keyboard. Tone, stress, and vowel length are ignored in this transcription format. The three modifiers combine symbols to form phonologically complex segments (e.g., aspirated, glottalized, or nasalized segments). In order to ascertain that our results would be comparable to those published by the ASJP group, we successfully replicated their experiments for LDN and LDND measures using the ASJP program and the ASJP dataset version 12 (Wichmann et al., 2010b).13 This database comprises reduced (40-item) Swadesh lists for 4169 linguistic varieties. All pidgins, creoles, mixed languages, artificial languages, proto-languages, and languages extinct before 1700 CE were excluded for the experiment, as were language families represented by less than 10 word lists (Wichmann et al., 2010a),14 as well as word lists containing less than 28 words (70% of 40). This leaves a dataset with 3730 word lists. It turned out that an additional 60 word lists did not have English glosses for the items, which meant that they could
12 The latest version is v. 16, released in 2013. 13 The original python program was created by Hagen Jung. We modified the program to handle the ASJP modifiers. 14 The reason behind this decision is that correlations resulting from smaller samples (less than 40 language pairs) tend to be unreliable.
Comparative evaluation of string similarity measures 179
not be processed by the program, so these languages were also excluded from the analysis. All the experiments reported in this paper were performed on a subset of version 14 of the ASJP database whose language distribution is shown in figure 2.15 The database has 5500 word lists. The same selection principles that were used for version 12 (described above) were applied for choosing the languages to be included in our experiments. The final dataset for our experiments has 4743 word lists for 50 language families. We use the family names of the WALS (Haspelmath et al., 2011) classification.
Fig. 2: Distribution of languages in ASJP database (version 14).
The WALS classification is a two-level classification where each language belongs to a genus and a family. A genus is a genetic classification unit given by Dryer (2000) and consists of set of languages supposedly descended from a common ancestor which is 3000 to 3500 years old. For instance, Indo-Aryan languages are classified as a separate genus from Iranian languages although, it is quite well known that both Indo-Aryan and Iranian languages are descended from a common proto-Indo-Iranian ancestor. The Ethnologue classification is a multi-level tree classification for a language family. This classification is often criticized for being too “lumping”, i.e., too liberal in positing genetic relatedness between languages. The highest node in a family tree is the family itself and languages form the lowest nodes (leaves).
15 Available for downloading at http://email.eva.mpg.de/~wichmann/listss14.zip.
180 Taraka Rama and Lars Borin An internal node in the tree is not necessarily binary. For instance, the Dravidian language family has four branches emerging from the top node (see figure 3 for the Ethnologue family tree of Dravidian languages). Table 1: Distribution of language families in ASJP database. WN and WLs stands for WALS Name and Word Lists. Family Name
WN
Austro-Asiatic Austronesian Border Bosavi Carib Chibchan Dravidian Eskimo-Aleut Hmong-Mien Hokan Huitotoan Indo-European Kadugli Khoisan Kiwain LakesPlain Lower-Sepik-Ramu Macro-Ge Marind
AuA An Bor Bos Car Chi Dra EA HM Hok Hui IE Kad Kho Kiw LP LSR MGe Mar
Mayan
May
Afro-Asiatic Algic Altaic Arwakan Australian
AA Alg Alt Arw Aus
# WLs Family Name 287 29 84 58 194
123 1008 16 14 29 20 31 10 32 25 14 269 11 17 14 26 20 24 30
Mixe-Zoque MoreheadU.Maro Na-Dene Nakh-Daghestanian Niger-Congo
MZ MUM NDe NDa NC
WN
# WLs
Nilo-Saharan Otto-Manguean Panoan Penutian Quechuan Salish Sepik Sino-Tibetan Siouan Sko Tai-Kadai Toricelli Totonacan Trans-NewGuinea Tucanoan Tupian Uralic Uto-Aztecan West-Papuan
NS OM Pan Pen Que Sal Sep ST Sio Sko TK Tor Tot TNG Tuc Tup Ura UA WP
157 80 19 21 41 28 26 205 17 14 103 27 14 298 32 47 29 103 33
WF
38
107 WesternFly
15 15 23 32 834
Comparative evaluation of string similarity measures 181
Fig. 3: Ethnologue tree for the Dravidian language family.
5 Similarity measures For the experiments decribed below, we have considered both string similarity measures and distributional measures for computing the distance between a pair of languages. As mentioned earlier, string similarity measures work at the level of word pairs and provide an aggregate score of the similarity between word pairs whereas distributional measures compare the n-gram profiles between a language pair to yield a distance score.
5.1 String similarity measures The different string similarity measures for a word pair that we have investigated are the following: – IDENT returns 1 if the words are identical, otherwise it returns 0. – PREFIX returns the length of the longest common prefix divided by the length of the longer word.
182 Taraka Rama and Lars Borin – – – – –
–
DICE is defined as the number of shared bigrams divided by the total number of bigrams in both the words. LCS is defined as the length of the longest common subsequence divided by the length of the longer word (Melamed, 1999). TRIGRAM is defined in the same way as DICE but uses trigrams for computing the similarity between a word pair. XDICE is defined in the same way as DICE but uses “extended bigrams”, which are trigrams without the middle letter (Brew and McKelvie, 1996). Jaccard’s index, JCD, is a set cardinality measure that is defined as the ratio of the number of shared bigrams between the two words to the ratio of the size of the union of the bigrams between the two words. LDN, as defined above.
Each word-pair similarity score is converted to its distance counterpart by subtracting the score from 1.0.16 Note that this conversion can sometimes result in a negative distance which is due to the double normalization involved in LDND.17 This distance score for a word pair is then used to compute the pairwise distance between a language pair. The distance computation between a language pair is performed as described in section 2.1. Following the naming convention of LDND, a suffix “D” is added to the name of each measure to indicate its LDND distance variant.
5.2 N-gram similarity N-gram similarity measures are inspired by a line of work initially pursued in the context of information retrieval, aiming at automatic language identification in a multilingual document. Cavnar and Trenkle (1994) used character n-grams for text categorization. They observed that different document categories – including documents in different languages – have characteristic character ngram profiles. The rank of a character n-gram varies across different categories and documents belonging to the same category have similar character n-gram Zipfian distributions.
16 Lin (1998) investigates three distance to similarity conversion techniques and motivates the results from an information-theoretical point of view. In this article, we do not investigate the effects of similarity to distance conversion. Rather, we stick to the traditional conversion technique 17 Thus, the resulting distance is not a true distance metric.
Comparative evaluation of string similarity measures 183
Building on this idea, Dunning (1994, 1998) postulates that each language has its own signature character (or phoneme; depending on the level of transcription) ngram distribution. Comparing the character n-gram profiles of two languages can yield a single point distance between the language pair. The comparison procedure is usually accomplished through the use of one of the distance measures given in Singh 2006. The following steps are followed for extracting the phoneme n-gram profile for a language: – An n-gram is defined as the consecutive phonemes in a window of N. The value of N usually ranges from 1 to 5. – All n-grams are extracted for a lexical item. This step is repeated for all the lexical items in a word list. – All the extracted n-grams are mixed and sorted in the descending order of their frequency. The relative frequency of the n-grams is computed. – Only the top G n-grams are retained and the rest of them are discarded. The value of G is determined empirically. For a language pair, the n-gram profiles can be compared using one of the following distance measures: 1. Out-of-Rank measure is defined as the aggregate sum of the absolute difference in the rank of the shared n-grams between a pair of languages. If there are no shared bigrams between an n-gram profile, then the difference in ranks is assigned a maximum out-of-place score. 2. Jaccard’s index is a set cardinality measure. It is defined as the ratio of the cardinality of the intersection of the n-grams between the two languages to the cardinality of the union of the two languages. 3. Dice distance is related to Jaccard’s Index. It is defined as the ratio of twice the number of shared n-grams to the total number of n-grams in both the language profiles. 4. Manhattan distance is defined as the sum of the absolute difference between the relative frequency of the shared n-grams. 5. Euclidean distance is defined in a similar fashion to Manhattan distance where the individual terms are squared. While replicating the original ASJP experiments on the version 12 ASJP database, we also tested if the above distributional measures, (1–5) perform as well as LDN. Unfortunately, the results were not encouraging, and we did not repeat the experiments on version 14 of the database. One main reason for this result is the relatively small size of the ASJP concept list, which provides a poor estimate of the true language signatures.
184 Taraka Rama and Lars Borin This factor speaks equally, or even more, against including another class of ngram-based measures, namely information-theoretic measures such as cross entropy and KL-divergence. These measures have been well-studied in natural language processing tasks such as machine translation, natural language parsing, sentiment identification, and also in automatic language identification. However, the probability distributions required for using these measures are usually estimated through maximum likelihood estimation which require a fairly large amount of data, and the short ASJP concept lists will hardly qualify in this regard.
6 Evaluation measures The measures which we have used for evaluating the performance of string similarity measures given in section 5 are the following three: 1. dist was originally suggested by Wichmann et al. (2010a), and tests if LDND is better than LDN at the task of distinguishing related languages from unrelated languages. 2. RW is a special case of Pearson’s r – called point biserial correlation (Tate, 1954) – computes the agreement between the intra-family pairwise distances and the WALS classification for the family. 3. γ is related to Goodman and Kruskal’s Gamma (1954) and measures the strength of association between two ordinal variables. In this paper, it is used to compute the level of agreement between the pairwise intra-family distances and the family’s Ethnologue classification.
6.1 Distinctiveness measure (dist) The dist measure for a family consists of three components: the mean of the pairwise distances inside a language family (din); and the mean of the pairwise distances from each language in a family to the rest of the language families (dout). sdout is defined as the standard deviation of all the pairwise distances used to compute dout. Finally, dist is defined as (din−dout)/sdout. The resistance of a string similarity measure to other language families is reflected by the value of sdout. A comparatively higher dist value suggests that a string similarity measure is particularly resistant to random similarities between unrelated languages and performs well at distinguishing languages belonging to the same language family from languages in other language families.
Comparative evaluation of string similarity measures 185
6.2 Correlation with WALS The WALS database provides a three-level classification. The top level is the language family, the second level is the genus and the lowest level is the language itself. If two languages belong to different families, then the distance is 3. Two languages that belong to different genera in the same family have a distance of 2. If the two languages fall in the same genus, they have a distance of 1. This allows us to define a distance matrix for each family based on WALS. The WALS distance matrix can be compared to the distance matrices of any string similarity measure using point biserial correlation – a special case of Pearson’s r. If a family has a single genus in the WALS classification there is no computation of RW and the corresponding row for a family is empty in table 7.
6.3 Agreement with Ethnologue Given a distance-matrix d of order N × N, where each cell dij is the distance between two languages i and j; and an Ethnologue tree E, the computation of γ for a language family is defined as follows: 1. Enumerate all the triplets for a language family of size N. A triplet, t for a language family is defined as {i, j, k}, where i ≠ j ≠ k are languages belonging to a family. A language family of size N has n(n-1)(n-2)/6 triplets. 2. For the members of each such triplet t, there are three lexical distances dij , dik, and djk. The expert classification tree E can treat the three languages {i, j, k} in four possible ways (| denotes a partition): {i, j | k}, {i, k | j}, {j, k | i} or can have a tie where all languages emanate from the same node. All ties are ignored in the computation of γ.18 3. A distance triplet dij , dik, and djk is said to agree completely with an Ethnologue partition {i, j | k} when dij < dik and dij < djk. A triplet that satisfies these conditions is counted as a concordant comparison, C; else it is counted as a discordant comparison, D.
18 We do not know what a tie in the gold standard indicates: uncertainty in the classification, or a genuine multi-way branching? Whenever the Ethnologue tree of a family is completely unresolved, it is shown by an empty row. For example, the family tree of Bosavi languages is a star structure. Hence, the corresponding row in table 5 is left empty.
186 Taraka Rama and Lars Borin 4. Steps 2 and 3 are repeated for all the triplets to yield γ for a family defined as γ = (C−D)/(C+D). γ lies in the range [−1, 1) where a score of −1 indicates perfect C+D disagreement and a score of +1 indicates perfect agreement. At this point, one might wonder about the decision for not using an off-theshelf tree-building algorithm to infer a tree and compare the resulting tree with the Ethnologue classification. Although both Pompei et al. (2011) and Huff and Lonsdale (2011) compare 12 their inferred trees – based on Neighbor-Joining and Minimum Evolution algorithms – to Ethnologue trees using cleverly crafted treedistance measures (GRF and GQD), they do not make the more intuitively useful direct comparison of the distance matrices to the Ethnologue trees. The tree inference algorithms use heuristics to find the best tree from the available tree space. The number of possible rooted, non-binary and unlabeled trees is quite large even for a language family of size 20 – about 256 × 106. A tree inference algorithm uses heuristics to reduce the tree space to find the best tree that explains the distance matrix. A tree inference algorithm can make mistakes while searching for the best tree. Moreover, there are many variations of Neighbor-Joining and Minimum Evolution algorithms.19 Ideally, one would have to test the different tree inference algorithms and then decide the best one for our task. However, the focus of this paper rests on the comparison of different string similarity algorithms and not on tree inference algorithms. Hence, a direct comparison of a family’s distance matrix to the family’s Ethnologue tree circumvents the choice of the tree inference algorithm.
7 Results and discussion In table 2 we give the results of our experiments. We only report the average results for all measures across the families listed in table 1. Further, we check the correlation between the performance of the different string similarity measures across the three evaluation measures by computing Spearman’s ρ. The pairwise ρ is given in table 3. The high correlation value of 0.95 between RW and γ suggests that all the measures agree roughly on the task of internal classification. The average scores in each column suggest that the string similarity measures exhibit different degrees of performance. How does one decide which measure is the best in a column? What kind of statistical testing procedure should be
19 http://www.atgc-montpellier.fr/fastme/usersguide.php
Comparative evaluation of string similarity measures 187
adopted for deciding upon a measure? We address these questions through the following procedure: 1. For a column i, sort the average scores, s in descending order. 2. For a row index 1 ≤ r ≤ 16, test the significance of sr ≥ sr+1 through a sign test (Sheskin, 2003). This test yields a p−value. The above significant tests are not independent by themselves. Hence, we cannot reject a null hypothesis H0 at a significance level of α = 0.01. The α needs to be corrected for multiple tests. Unfortunately, the standard Bonferroni’s multiple test correction and Fisher’s Omnibus test work for a global null hypothesis and not at the level of a single test. We follow the procedure, called False Discovery Rate (FDR), given by Benjamini and Hochberg (1995) for adjusting the α value for multiple tests. Given H1 . . . Hm null hypotheses and P1 . . . Pm p-values, the procedure works as follows: 1. Sort the Pk, 1 ≤ k ≤ m, values in ascending order. k is the rank of a p-value. 2. The adjusted α*k value for Pk is (k/m)α. 3. Reject all the H0s from 1, . . . , k where Pk+1 > α*k. Table 2: Average results for each string similarity measure across the 50 families. The rows are sorted by the name of the measure. Measure
DICE DICED IDENT IDENTD JCD JCDD LCS LCSD LDN LDND PREFIX PREFIXD TRIGRAM TRIGRAMD XDICE XDICED Average
Average Dist
3.3536 9.4416 1.5851 8.163 13.9673 15.0501 3.4305 6.7042 3.7943 7.3189 3.5583 7.5359 1.9888 9.448 0.4846 2.1547 6.1237
Average RW
0.5449 0.5495 0.4013 0.4066 0.5322 0.5302 0.6069 0.6151 0.6126 0.619 0.5784 0.5859 0.4393 0.4495 0.3085 0.4026 0.5114
Average γ
0.6575 0.6607 0.2345 0.3082 0.655 0.6622 0.6895 10.6984 0.6984 0.7068 0.6747 0.6792 0.4161 0.5247 0.433 0.4838 0.5739
188 Taraka Rama and Lars Borin Table 3: Spearman’s ρ between γ, RW, and Dist Dist γ Dist
0.30
RW 0.95 0.32
The above procedure ensures that the chance of incorrectly rejecting a null hypothesis is 1 in 20 for α = 0.05 and 1 in 100 for α = 0.01. In this experimental context, this suggests that we erroneously reject 0.75 true null hypotheses out of 15 hypotheses for α = 0.05 and 0.15 hypotheses for α = 0.01. We report the Dist, γ, and RW for each family in tables 5, 6, and 7. In each of these tables, only those measures which are above the average scores from table 2, are reported. The FDR procedure for γ suggests that no sign test is significant. This is in agreement with the result of Wichmann et al., 2010a, who showed that the choice of LDN or LDND is quite unimportant for the task of internal classification. The FDR procedure for RW suggests that LDN > LCS, LCS > PREFIXD, DICE > JCD, and JCD > JCDD. Here A > B denotes that A is significantly better than B. The FDR procedure for Dist suggests that JCDD > JCD, JCD > TRID, DICED > IDENTD, LDND > LCSD, and LCSD > LDN. The results point towards an important direction in the task of building computational systems for automatic language classification. The pipeline for such a system consists of (1) distinguishing related languages from unrelated languages; and (2) internal classification accuracy. JCDD performs the best with respect to Dist. Further, JCDD is derived from JCD and can be computed in O(m + n), for two strings of length m and n. In comparison, LDN is in the order of O(mn). In general, the computational complexity for computing distance between two word lists for all the significant measures is given in table 4. Based on the computational complexity and the significance scores, we propose that JCDD be used for step 1 and a measure like LDN be used for internal classification. Table 4: Computational complexity of top performing measures for computing distance between two word lists. Given two word lists each of length l. m and n denote the lengths of a word pair wa and wb and C = l(l − 1)/2. Measure JCDD JCD LDND LDN PREFIXD LCSD
Complexity
CO(m + n + min(m − 1, n − 1)) lO(m + n + min(m − 1, n − 1)) CO(mn) lO(mn) CO(max(m, n)) CO(mn)
Comparative evaluation of string similarity measures 189
Measure LCS DICED DICE
Complexity lO(mn) CO(m + n + min(m − 2, n − 2)) lO(m + n + min(m − 2, n − 2))
8 Conclusion In this article, we have presented the first known attempt to apply more than 20 different similarity (or distance) measures to the problem of genetic classification of languages on the basis of Swadesh-style core vocabulary lists. The experiments were performed on the wide-coverage ASJP database (about half the world’s languages). We have examined the various measures at two levels, namely: (1) their capability of distinguishing related and unrelated languages; and (2) their performance as measures for internal classification of related languages. We find that the choice of string similarity measure (among the tested pool of measures) is not very important for the task of internal classification whereas the choice affects the results of discriminating related languages from unrelated ones.
Acknowledgments The authors thank Søren Wichmann, Eric W. Holman, Harald Hammarström, and Roman Yangarber for useful comments which have helped us to improve the presentation. The string similarity experiments have been made possible through the use of ppss software20 recommended by Leif-Jöran Olsson. The first author would like to thank Prasant Kolachina for the discussions on parallel implementations in Python. The work presented here was funded in part by the Swedish Research Council (the project Digital areal linguistics; contract no. 20091448).
20 http://code.google.com/p/ppss/
190 Taraka Rama and Lars Borin
References Atkinson, Quentin D. & Russell D. Gray. 2005. Curious parallels and curious connections — phylogenetic thinking in biology and historical linguistics. Systematic Biology 54(4). 513– 526. Bakker, Dik, André Müller, Viveka Velupillai, Søren Wichmann, Cecil H. Brown, Pamela Brown, Dmitry Egorov, Robert Mailhammer, Anthony Grant & Eric W. Holman. 2009. Adding typology to lexicostatistics: A combined approach to language classification. Linguistic Typology 13(1):169–181. Benjamini, Yoav & Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57(1). 289–300. Bergsland, Knut & Hans Vogt. 1962. On the validity of glottochronology. Current Anthropology, 3(2):115–153. ISSN 00113204. Bouchard-Côté, Alexandre, David Hall, Thomas L. Griffiths & Dan Klein. 2013. Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences, 110(11). 4224–4229. Brew, Chris & David McKelvie. 1996. Word-pair extraction for lexicography. In Kemal Oflazer & Harold Somers (eds.), Proceedings of the Second International Conference on New Methods in Language Processing, 45–55. Ankara. Brown, Cecil H., Eric W. Holman, Søren Wichmann, & Viveka Velupillai. 2008. Automated classification of the world’s languages: A description of the method and preliminary results. Sprachtypologie und Universalienforschung 61(4). 285–308. Cavnar, William B. & John M. Trenkle. 1994. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 161–175. Las Vegas (NV): UNLV Publications. Christiansen, Chris, Thomas Mailund, Christian N. S. Pedersen, Martin Randers, & Martin S. Stissing. 2006. Fast calculation of the quartet distance between trees of arbitrary degrees. Algorithms for Molecular Biology 1. Article No. 16. Dryer, Matthew S. 2000. Counting genera vs. counting languages. Linguistic Typology 4. 334– 350. Dryer, Matthew S. & Martin Haspelmath. 2013. The world atlas of language structures online. Leipzig: Max Planck Institute for Evolutionary Anthropology. http://wals.info (accessed on 26 November 2014). Dunning, Ted E. 1994. Statistical identification of language. Technical Report CRL MCCS-94273. Las Cruces (NM): New Mexico State University, Computing Research Lab. Dunning, Ted E. 1998. Finding structure in text, genome and other symbolic sequences. Sheffield: University of Sheffield. Durie, Mark & Malcolm Ross (eds.). 1996. The comparative method reviewed: Regularity and irregularity in language change. Oxford & New York: Oxford University Press. Dyen, Isidore, Joseph B. Kruskal, & Paul Black. 1992. An Indo-European classification: A lexicostatistical experiment. Transactions of the American Philosophical Society 82(5). 1– 132. Ellegård, Alvar. 1959. Statistical measurement of linguistic relationship. Language 35(2). 131– 156.
Comparative evaluation of string similarity measures 191 Ellison, T. Mark & Simon Kirby. 2006. Measuring language divergence by intra-lexical comparison. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 273–280. Association for Computational Linguistics. http://www.aclweb.org/anthology/P06-1035 (accessed 27 November 2014). Embleton, Sheila M. 1986. Statistics in historical linguistics (Quantitative Linguistics 30). Bochum: Brockmeyer. Felsenstein, Joseph. 2002. PHYLIP (phylogeny inference package) version 3.6 a3. Distributed by the author. Seattle (WA): University of Washington, Department of Genome Sciences. Felsenstein, Joseph. 2004. Inferring phylogenies. Sunderland (MA): Sinauer Associates. Goodman, Leo A. & William H. Kruskal. 1954. Measures of association for cross classifications. Journal of the American Statistical Association 49(268). 732–764. Greenhill, Simon J. & Russell D. Gray. 2009. Austronesian language phylogenies: Myths and misconceptions about Bayesian computational methods. In Alexander Adelaar & Andrew Pawlye (eds.), Austronesian historical linguistics and culture history: A festschrift for Robert Blust, 375–397. Canberra: Pacific Linguistics Greenhill, Simon J., Robert Blust & Russell D. Gray. 2008. The Austronesian basic vocabulary database: From bioinformatics to lexomics. Evolutionary Bioinformatics 4. 271–283. Haspelmath, Martin, Matthew S. Dryer, David Gil, and Bernard Comrie. 2011. WALS online. Munich: Max Planck Digital Library. http://wals.info. Hauer, Bradley & Grzegorz Kondrak. 2011. Clustering semantically equivalent words into cognate sets in multilingual lists. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 865–873. Asian Federation of Natural Language Processing. http://www.aclweb.org/anthology/I11-1097 (accessed 27 November 2014). Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller & Dik Bakker. 2008. Advances in automated language classification. In Antti Arppe, Kaius Sinnemäki & Urpu Nikanne (eds.), Quantitative investigations in theoretical linguistics, 40–43. Helsinki: University of Helsinki. Huff, Paul & Deryle Lonsdale. 2011. Positing language relationships using ALINE. Language Dynamics and Change 1(1). 128–162. Huffman, Stephen M. 1998. The genetic classification of languages by n-gram analysis: A computational technique. Washington (DC): Georgetown University dissertation. Inkpen, Diana, Oana Frunza & Grzegorz Kondrak. 2005. Automatic identification of cognates and false friends in French and English. In Proceedings of the International Conference Recent Advances in Natural Language Processing, 251–257. Jäger, Gerhard. 2013. Phylogenetic inference from word lists using weighted alignment with empirically determined weights. Language Dynamics and Change 3(2). 245–291. Kondrak, Grzegorz. 2000. A new algorithm for the alignment of phonetic sequences. In Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics, 288–295. Kondrak, Grzegorz. 2002a. Algorithms for language reconstruction. Toronto: University of Toronto dissertation. Kondrak, Grzegorz. 2002b. Determining recurrent sound correspondences by inducing translation models. In Proceedings of the 19th international conference on Computational linguistics, Volume 1. Association for Computational Linguistics. http://www.aclweb.org/anthology/C02-1016?CFID=458963627&CFTOKEN=39778386 (accessed 26 November 2014).
192 Taraka Rama and Lars Borin Kondrak, Grzegorz & Tarek Sherif. 2006. Evaluation of several phonetic similarity algorithms on the task of cognate identification. In Proceedings of ACL Workshop on Linguistic Distances, 43–50. Association for Computational Linguistics. Kroeber, Alfred L. & C. Douglas Chrétien. 1937. Quantitative classification of Indo-European languages. Language 13(2). 83–103. Levenshtein, Vladimir I. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics - Doklady 10(8). 707–710. Lewis, Paul M. (ed.). 2009. Ethnologue: Languages of the world, 16th edn. Dallas (TX): SIL International. Lin, Dekang. 1998. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, Volume 1, 296–304. List, Johann-Mattis. 2012. LexStat: Automatic detection of cognates in multilingual wordlists. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, 117–125. Association for Computational Linguistics. http://www.aclweb.org/anthology/W12-0216 (accessed 27 November 2014). Melamed, Dan I. 1999. Bitext maps and alignment via pattern recognition. Computational Linguistics 25(1). 107–130. Nichols, Johanna. 1996. The comparative method as heuristic. In Mark Durie & Malcom Ross (eds), The comparative method revisited: Regularity and Irregularity in Language Change, 39–71. New York: Oxford University Press. Nordhoff, Sebastian & Harald Hammarström. 2011. Glottolog/Langdoc: Defining dialects, languages, and language families as collections of resources. In Proceedings of the First International Workshop on Linked Science, Volume 783. http://ceur-ws.org/Vol783/paper7.pdf (accessed 27 November 2014). Petroni, Filippo & Maurizio Serva. 2010. Measures of lexical distance between languages. Physica A: Statistical Mechanics and its Applications 389(11). 2280–2283. Pompei, Simone, Vittorio Loreto, & Francesca Tria. 2011. On the accuracy of language trees. PloS ONE 6(6). e20109. Rama, Taraka & Anil K. Singh. 2009. From bag of languages to family trees from noisy corpus. In Proceedings of the International Conference RANLP-2009, 355–359. Association for Computational Linguistics. http://www.aclweb.org/anthology/R09-1064 (accessed 27 November 2014). Robinson, David F. & Leslie R. Foulds. 1981. Comparison of phylogenetic trees. Mathematical Biosciences 53(1-2). 131–147. Saitou, Naruya & Masatoshi Nei. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4(4). 406–425. Sankoff, David. 1969. Historical linguistics as stochastic process. Montreal: McGill University dissertation. Sheskin, David J. 2003. Handbook of parametric and nonparametric statistical procedures. Ba Raton (FL): Chapman & Hall/CRC Press. Singh, Anil K. 2006. Study of some distance measures for language and encoding identification. In Proceedings of Workshop on Linguistic Distances, 63–72. Association for Computational Linguistics. http://www.aclweb.org/anthology/W06-1109 (accessed 27 November 2014). Singh, Anil K. & Harshit Surana. 2007. Can corpus based measures be used for comparative study of languages? In Proceedings of Ninth Meeting of the ACL Special Interest Group in
Comparative evaluation of string similarity measures 193 Computational Morphology and Phonology, 40–47. Association for Computational Linguistics. http://aclweb.org/anthology/W07-1306 (accessed 27 November 2014). Sokal, Robert R. & Charles D Michener. 1958. A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin 38. 1409–1438. Swadesh, Morris. 1950. Salish internal relationships. International Journal of American Linguistics 16(4). 157–167. Swadesh, Morris. 1952. Lexico-statistic dating of prehistoric ethnic contacts with special reference to North American Indians and Eskimos. Proceedings of the American Philosophical Society 96(4). 452–463. Swadesh, Morris. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21(2). 121–137. Tate, Robert F. 1954. Correlation between a discrete and a continuous variable. Point-biserial correlation. The Annals of Mathematical Statistics 25(3). 603–607. Wichmann, Søren & Taraka Rama. 2014. Jackknifing the black sheep: ASJP classification performance and Austronesian. Submitted to the proceedings of the symposium ``Let’s talk about trees'', National Museum of Ethnology, Osaka, Febr. 9-10, 2013. Wichmann, Søren, Eric W. Holman, Dik Bakker & Cecil H. Brown. 2010a. Evaluating linguistic distance measures. Physica A: Statistical Mechanics and its Applications 389(17). 3632– 3639. Wichmann, Søren, André Müller, Viveka Velupillai, Cecil H. Brown, Eric W. Holman, Pamela Brown, Matthias Urban, Sebastian Sauppe, Oleg Belyaev, Zarina Molochieva, Annkathrin Wett, Dik Bakker, Johann-Mattis List, Dmitry Egorov, Robert Mailhammer, David Beck & Helen Geyer. 2010b. The ASJP database (version 12). Søren Wichmann, André Müller, Viveka Velupillai, Annkathrin Wett, Cecil H. Brown, Zarina Molochieva, Sebastian Sauppe, Eric W. Holman, Pamela Brown, Julia Bishoffberger, Dik Bakker, Johann-Mattis List, Dmitry Egorov, Oleg Belyaev, Matthias Urban, Robert Mailhammer, Helen Geyer, David Beck, Evgenia Korovina, Pattie Epps, Pilar Valenzuela, Anthony Grant, & Harald Hammarström. 2011. The ASJP database (version 14). Wieling, Martijn, Jelena Prokić & John Nerbonne. 2009. Evaluating the pairwise string alignment of pronunciations. In Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education, 26–34. Association for Computational Linguistics.
Bos NDe NC Pan Hok Chi Tup WP AuA An Que Kho Dra Aus Tuc Ura Arw May LP OM Car
Family
JCDD
15.0643 19.8309 1.7703 24.7828 10.2645 4.165 15.492 8.1028 7.3013 7.667 62.227 6.4615 18.5943 2.8967 25.9289 6.5405 6.1898 40.1516 7.5669 4.635 15.4411
7.5983 8.0567 0.6324 18.5575 3.6634 0.9642 9.2908 6.9894 3.0446 4.7296 33.479 3.3425 11.6611 1.5668 14.0369 0.2392 4.0542 17.3924 3.0591 2.8218 9.7376
JCD TRIGRAMD
14.436 19.2611 1.6102 22.4921 9.826 4.0759 14.4571 7.6086 6.7514 7.2367 53.7259 6.7371 17.2609 3.7314 24.232 6.1048 6.0316 37.7678 7.6686 4.5088 14.6063
DICED
10.9145 13.1777 1.1998 17.2441 7.3298 2.8152 10.4479 5.5301 4.5166 5.3313 29.7032 4.4202 12.4115 2.0659 16.8078 1.6473 4.4878 22.8213 5.3684 3.3448 10.6387
IDENTD
14.4357 9.5648 0.5368 12.2144 4.0392 1.6258 6.6263 7.0905 3.4781 2.5288 27.1896 4.0611 7.3739 0.7709 11.6435 -0.0108 1.7509 17.5961 5.108 2.437 5.1435
Table 5: Distances for families and measures above average
Appendix
10.391 9.6538 1.0685 13.7351 3.6563 2.8052 8.0475 4.0984 4.1228 4.3066 25.9791 3.96 10.2461 1.8204 12.5345 3.4905 2.9965 14.4431 4.8677 2.6701 7.7896
PREFIXD
LDND
8.6767 10.1522 1.3978 12.7579 4.84 2.7234 8.569 4.2265 4.7953 4.6268 23.7586 3.8014 9.8216 1.635 12.0163 3.5156 3.5505 15.37 4.3565 2.7328 9.1164
LCSD
8.2226 9.364 1.3064 11.4257 4.6638 2.5116 7.8533 3.9029 4.3497 4.3107 21.7254 3.3776 8.595 1.5775 11.0698 3.1847 3.3439 13.4738 4.2503 2.4757 8.2592
LDN
4.8419 5.2419 0.5132 6.8728 2.7096 1.7753 4.4553 2.4883 2.648 2.4143 10.8472 2.1531 4.8771 1.4495 5.8166 2.1715 2.1828 7.6795 2.8572 1.3643 5.0205
194 Taraka Rama and Lars Borin
TNG MZ Bor Pen MGe ST Tor TK IE Alg NS Sko AA LSR Mar Alt Sep Hui NDa Sio Kad MUM WF Sal Kiw UA Tot HM
Family
JCDD
1.073 43.3479 9.6352 5.4103 4.2719 4.1094 3.2466 15.0085 7.3831 6.8582 2.4402 6.7676 1.8054 4.0791 10.9265 18.929 6.875 21.0961 7.6449 13.8571 42.0614 7.9936 22.211 13.1512 43.2272 21.6334 60.4364 8.782
0.4854 37.9344 5.011 3.6884 1.0069 0.9103 2.2187 5.331 1.6767 4.5117 1.1485 2.5992 0.7924 2.2048 8.5836 6.182 2.8591 18.4869 3.2895 4.2685 27.8429 6.1084 27.2757 11.3222 46.018 10.4644 39.4131 1.6133
JCD TRIGRAMD
1.216 40.0136 9.5691 5.252 4.0058 3.8635 3.1546 13.4365 6.7064 6.737 2.3163 6.3721 1.6807 4.3844 10.0795 17.9969 6.5934 19.8025 7.3732 12.8415 40.0526 7.8812 20.5567 12.2212 39.5467 19.6366 51.2138 8.5212
DICED
0.8259 30.3553 6.5316 3.8325 2.5482 2.7825 2.3101 7.7664 2.8031 5.2475 1.6505 4.6468 1.2557 2.641 7.1801 9.1747 4.5782 14.7131 4.8035 9.444 25.6201 4.7539 15.8329 9.7777 30.1911 11.6944 33.0995 4.9056
0.5177 36.874 4.1559 2.3022 1.6691 2.173 1.7462 7.5326 1.6917 1.2071 1.1456 4.7931 0.4923 1.5778 6.4301 7.2628 4.6793 16.1439 2.7922 7.3326 21.678 4.7774 22.4019 5.2612 46.9148 4.363 26.7875 4.0467
IDENTD 0.8292 20.4933 6.5507 3.2193 2.0545 2.7807 2.1128 8.1249 4.1028 4.5916 1.321 5.182 1.37 2.1808 5.0488 9.4017 4.3683 12.4005 5.7799 7.8548 17.0677 3.8622 12.516 7.4423 20.2353 9.6858 23.5405 5.7944
PREFIXD
LDND
0.8225 18.2746 6.3216 3.1645 2.4147 2.8974 2.0321 7.6679 4.0256 5.2534 1.3681 4.7014 1.3757 2.1713 4.7739 8.8272 4.1124 10.2317 5.1604 7.9906 17.5982 3.4663 11.2823 7.5338 18.8007 9.4791 22.6512 5.3761
LCSD
0.8258 16.0774 5.9014 2.8137 2.3168 2.7502 1.9072 6.9855 3.6679 4.5017 1.3392 4.5975 1.3883 2.0826 4.5115 7.9513 3.8471 9.2171 4.8233 7.1145 15.9751 3.4324 10.4454 6.7944 17.3091 8.9058 21.3586 4.9898
LDN
0.4629 9.661 3.8474 1.5862 1.1219 1.3482 1.0739 2.8723 1.4322 2.775 0.6085 2.5371 0.6411 1.6308 2.8612 4.1239 2.0261 4.9648 2.3671 4.0156 9.426 2.1726 5.665 3.4597 10.3285 4.9122 11.7915 2.8084
Comparative evaluation of string similarity measures 195
JCDD
27.1726
24.2372
JCD TRIGRAMD
25.2088
DICED
18.8923
WF Tor Chi HM Hok Tot Aus WP MUM Sko ST Sio Pan AuA Mar Kad May NC Kiw Hui LSR TK
Family
LCSD 0.734 0.7387 0.6207 0.5763 1 0.4003 0.7274 0.6158 0.816 0.6274 0.8221 0.3167 0.5338 0.9479 0.7895 0.4048 0.9464 0.7447 0.7698
LDND
0.7638 0.7538 0.6131 0.5608 1 0.4239 0.7204 0.7003 0.7708 0.6223 0.8549 0.3083 0.5625 0.9553
0.7883 0.4193
0.9435 0.7984 0.7757
0.9435 0.7234 0.7194
0.7813 0.3856
0.7148 0.7748 0.5799 0.5622 1 0.4595 0.7463 0.7493 0.7396 0.6042 0.81 0.2722 0.5875 0.9337
LDN
0.9464 0.6596 0.7158
0.7859 0.3964
0.7177 0.7508 0.5505 0.5378 1 0.4619 0.7467 0.7057 0.809 0.5991 0.7772 0.2639 0.548 0.9017
LCS
0.9464 0.7144 0.7782
0.7402 0.2929
0.7795 0.6396 0.5359 0.5181 0.9848 0.4125 0.6492 0.7302 0.7847 0.5945 0.8359 0.275 0.476 0.9256
0.9435 0.692 0.7239
0.7245 0.2529
0.7458 0.7057 0.5186 0.4922 0.9899 0.4668 0.6643 0.6975 0.7882 0.5789 0.8256 0.2444 0.4933 0.9385
0.8958 0.7626 0.6987
0.8131 0.3612
0.7233 0.7057 0.4576 0.5871 0.9848 0.4356 0.6902 0.5477 0.6632 0.5214 0.772 0.2361 0.5311 0.924
JCDD
14.2023
PREFIXD
PREFIX
IDENTD
14.1948
PREFIXD
Table 6: GE for families and measures above average.
EA
Family
0.9107 0.748 0.6991
0.8039 0.3639
0.7193 0.7057 0.429 0.5712 0.9899 0.4232 0.6946 0.5777 0.6944 0.5213 0.7599 0.2694 0.5198 0.918
DICED
LDND
13.7316
0.9137 0.6484 0.6537
0.7988 0.2875
0.7126 0.7057 0.4617 0.5744 0.9949 0.398 0.7091 0.6594 0.6458 0.5283 0.7444 0.2611 0.5054 0.9024
DICE
LCSD
12.1348
0.8988 0.6775 0.6705
0.8121 0.2755
0.7216 0.7477 0.4384 0.5782 0.9848 0.4125 0.697 0.6213 0.6181 0.5114 0.7668 0.2306 0.5299 0.9106
JCD
LDN
6.8154
196 Taraka Rama and Lars Borin
LP Que NS AA Ura MGe Car Bor Bos EA TNG Dra IE OM Tuc Arw NDa Alg Sep NDe Pen An Tup Kho Alt UA Sal MZ
Family
0.8532 0.6692 0.6175 0.7199 0.989 0.623 0.4825 0.9578 0.5246 0.8731 0.7086 0.7851 0.2754 0.9118 0.8502 0.8366 0.818 0.8664 0.7692
0.844 0.6684 0.6431 0.7391 0.9863 0.6335 0.5079 0.9458 0.5301 0.8958 0.7252 0.8011 0.2692 0.9113 0.8558 0.8384 0.8018 0.8788 0.7548
LCSD
0.6893 0.7319 0.4642 0.6053 0.5943 0.6659 0.3092 0.8027
LDND
0.6878 0.737 0.5264 0.6272 0.598 0.6566 0.325 0.7891 0.8349 0.6433 0.6434 0.7135 0.9755 0.6187 0.4876 0.9415 0.5543 0.9366 0.7131 0.8402 0.214 0.9116 0.8071 0.85 0.7865 0.8628 0.7476
LDN
0.7237 0.758 0.4859 0.517 0.6763 0.6944 0.3205 0.7823 0.8349 0.6403 0.6288 0.6915 0.9725 0.6089 0.4749 0.9407 0.5641 0.9388 0.7017 0.831 0.1953 0.9114 0.7903 0.8473 0.8002 0.8336 0.7524
LCS
0.7252 0.7523 0.4532 0.459 0.6763 0.716 0.3108 0.7914 0.8716 0.643 0.6786 0.737 0.9527 0.6189 0.4475 0.9094 0.4883 0.8852 0.7002 0.8092 0.2373 0.8884 0.8801 0.8354 0.7816 0.8793 0.7356
0.6746 0.742 0.4365 0.6134 0.5392 0.6011 0.2697 0.7755
PREFIXD
0.8899 0.6177 0.6688 0.7295 0.9513 0.6153 0.4472 0.9121 0.5147 0.9048 0.6828 0.8092 0.1764 0.8921 0.8333 0.8484 0.7691 0.8708 0.7212
PREFIX
0.7065 0.7535 0.3673 0.5254 0.6495 0.662 0.2677 0.7619 0.8716 0.5977 0.6181 0.5619 0.9459 0.5937 0.4739 0.8071 0.4677 0.8535 0.6654 0.7115 0.207 0.9129 0.8052 0.8183 0.8292 0.7941 0.6707
JCDD
0.627 0.7334 0.5216 0.5257 0.7155 0.7245 0.313 0.7846 0.8716 0.5946 0.6351 0.5823 0.9472 0.5983 0.4773 0.8246 0.4762 0.8724 0.6737 0.7218 0.2106 0.9127 0.8146 0.8255 0.8223 0.798 0.6779
DICED
0.6594 0.7335 0.5235 0.5175 0.479 0.7099 0.3118 0.8005 0.8899 0.5925 0.655 0.6255 0.9403 0.5917 0.4565 0.8304 0.5169 0.892 0.6715 0.7667 0.1469 0.9123 0.736 0.8308 0.8119 0.7865 0.6731
DICE
0.6513 0.7502 0.4882 0.4026 0.6843 0.7508 0.2952 0.7914 0.8899 0.5972 0.6112 0.5248 0.9406 0.5919 0.4727 0.8009 0.5106 0.8701 0.6639 0.7437 0.2036 0.9119 0.7378 0.8164 0.8197 0.7843 0.6683
JCD
0.6235 0.7347 0.4968 0.5162 0.7003 0.6983 0.316 0.7823 Comparative evaluation of string similarity measures 197
NDe Bos NC Hok Pan Chi Tup WP AuA Que An Kho Dra Aus Tuc Ura Arw May LP OM Car MZ TNG Bor Pen MGe
Family
0.4437 0.8047
0.5775 0.7462 0.6263 0.6413
0.1869 0.7335 0.5448 0.2718
0.4356
0.4279 0.817
0.5325
0.8609 0.6976
0.4569 0.8054
0.5735 0.7486 0.6317 0.6385
0.1799 0.7333 0.5548 0.2971
0.4442
0.41 0.8095
0.5264
0.8747 0.6833
LCSD
0.5963
0.5761
LDND
0.8662 0.6886
0.4633
0.4492 0.7996
0.6275
0.1198 0.732 0.589 0.3092
0.555 0.7698 0.642 0.5763
0.4545 0.8048
LDN
0.5556
0.5006
0.8466 0.6874
0.4518
0.4748 0.7988
0.6184
0.1003 0.7327 0.5831 0.3023
0.5464 0.7608 0.6291 0.5759
0.4398 0.8124
0.8549 0.6086
0.5
0.3864 0.7857
0.4116
0.1643 0.6826 0.5699 0.2926
0.5659 0.6951 0.5583 0.6056
0.3384 0.6834
LCS PREFIXD
0.5804
Table 7: RW for families and measures above average.
0.8505 0.6346
0.472
0.4184 0.7852
0.6104
0.0996 0.6821 0.6006 0.3063
0.5395 0.705 0.5543 0.538
0.3349 0.6715
PREFIX
0.4749
0.8531 0.6187
0.469
0.3323 0.7261
0.2806
0.1432 0.6138 0.5585 0.2867
0.5616 0.7381 0.5536 0.5816
0.3833 0.7987
DICED
0.4417
0.8536 0.6449
0.4579
0.336 0.7282
0.539
0.0842 0.6176 0.589 0.257
0.5253 0.7386 0.5535 0.5176
0.3893 0.8032
DICE
0.4372
0.8321 0.6054
0.4434
0.3157 0.6941
0.399
0.1423 0.5858 0.5462 0.2618
0.5593 0.7136 0.5199 0.5734
0.3538 0.7629
JCD
0.4089
0.8308 0.6052
0.4493
0.3093 0.6921
0.3951
0.1492 0.582 0.5457 0.2672
0.5551 0.7125 0.5198 0.5732
0.3485 0.7592
0.7625 0.4518
0.3295
0.1848 0.6033
0.1021
0.1094 0.4757 0.5206 0.2487
0.4752 0.6818 0.5076 0.5147
0.2925 0.5457
0.2841
JCDD TRIRAMD
0.412
198 Taraka Rama and Lars Borin
ST IE TK Tor Alg NS Sko AA LSR Mar Alt Hui Sep NDa Sio Kad WF MUM Sal Kiw UA Tot EA HM
Family
0.642
0.9332
0.6605
0.6637
0.9358
0.6771
LCSD
0.5596 0.6961 0.58 0.4699 0.3459 0.6072 0.8075 0.6001 0.5911 0.6306 0.8644 0.68 0.656 0.6463
LDND
0.5647 0.6996 0.588 0.4688 0.3663 0.6118 0.8107 0.6136 0.5995 0.654 0.8719 0.6821 0.6613 0.6342
0.6639
0.9296
0.6681
LDN
0.5435 0.6462 0.5004 0.4818 0.4193 0.5728 0.806 0.4681 0.6179 0.6741 0.8632 0.6832 0.6662 0.6215
0.6504
0.9261
0.6463
0.6211
0.9211
0.6364
0.5558 0.6917 0.5777 0.4515 0.3456 0.5587 0.7842 0.6031 0.5695 0.6192 0.8634 0.6519 0.6587 0.6077
LCS PREFIXD
0.5261 0.6392 0.4959 0.483 0.4385 0.5803 0.7999 0.431 0.6153 0.6547 0.8546 0.6775 0.6603 0.6151
0.6037
0.9135
0.6425
0.5412 0.6363 0.4948 0.4602 0.3715 0.5118 0.7825 0.4584 0.5749 0.6278 0.8533 0.6593 0.6615 0.5937
PREFIX
0.5829
0.9178
0.5423
DICED
0.4896 0.557 0.5366 0.4071 0.2965 0.578 0.6798 0.5148 0.5763 0.568 0.7745 0.5955 0.6241 0.501
0.5899
0.9148
0.5467
DICE
0.4878 0.5294 0.4302 0.4127 0.3328 0.5434 0.6766 0.3291 0.5939 0.5773 0.7608 0.597 0.6252 0.5067
0.5317
0.8951
0.5067
JCD
0.4788 0.5259 0.5341 0.375 0.291 0.5466 0.6641 0.4993 0.5653 0.5433 0.75 0.5741 0.6085 0.4884
0.5264
0.8945
0.5031
0.4566
0.8831
0.4637
0.3116 0.4541 0.4942 0.3153 0.1986 0.4565 0.5636 0.4123 0.5049 0.4847 0.6492 0.538 0.5769 0.4312
JCDD TRIRAMD
0.478 0.5285 0.535 0.3704 0.2626 0.5429 0.6664 0.4986 0.5529 0.5366 0.7503 0.5726 0.6079 0.4929
Comparative evaluation of string similarity measures 199
Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos
Predicting Sales Trends
Can sentiment analysis on social media help?
1 Introduction Over the last years, social media have gained a lot of attention and there are many users sharing their opinions and experiences on them. As a result, there is an aggregation of personal wisdom and different viewpoints. If all this information is extracted and analyzed properly, the data on social media can lead to useful predictions of several human related events. Such prediction has great benefits in many aspects, such as finance and product marketing (Yu & Kak 2012). The latter has attracted the attention of researchers of the social network analysis fields, and several approaches have been proposed. Although there is a lot of research conducted on predicting events' outputs relevant to finance, stock market and politics, so far there is not an analogous research focusing on prediction of products' sales. Most of the current approaches analyse the sentiment of the tweets in order to predict different events. The limitation of current approaches is that they use sentiment metrics in a strictly quantitative way, taking into account the pure number of favourites or the fraction of likes over dislikes which are not always accurate or representative of a people's sentiment. In other words, the current approaches are not taking into account the sentiment trends or a longitudinal sentiment fluctuation expressed by people over time about a product, which could help in estimating a future trend. To this end, we propose a computational approach, which conducts predictions on the sales trends of products based on the public sentiment (i.e. positive/negative/neutral stance) expressed via Twitter. The sentiment expressed in a tweet is determined on a per context basis through the context of each tweet, by taking into account the relations among its words. The sentiment feature used for making predictions on sales trends is not considered as an isolated parameter but is used in correlation and in interaction with other features extracted from sequential historical data. The rest of the chapter is organized as follows: In the next section (2) the state-of-the-art in the area of prediction of various human related events exploiting sentiment analysis is discussed. Then, in section 3, the proposed approach is
202 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos presented. The method is evaluated in section 4, where the conducted experiments and the obtained results are presented. The chapter is concluded in section 5.
2 State of the art Stock market prediction has attracted much attention both from academia and business. Research towards this direction has started by (Eppen & Fama 1969, Fama 1991, Cootner 1964). Their approaches were based on random walk theory and the Efficient Market Hypothesis (EMH) (Fama 1965). According to the EMH stock market prices are largely driven by new information, i.e. news, rather than present and past prices. Since news is unpredictable, stock market prices will follow a random walk pattern and cannot be predicted with more than 50 percent accuracy (Qian & Rasheed 2007, Bollen et al. 2011). According to Bollen et al. (2011), there are two main problems with EMH. First, there are numerous studies that show that stock market prices do not follow a random walk and can indeed to some degree be predicted (Qian & Rasheed 2007, Gallagher & Taylor 2002, Kavussanos & Dockery 2001). The latter puts into question the basic assumptions of EMH. Second, recent research has shown that even though news are unpredictable, indicators can be extracted from online social media (like blogs, Twitter and feeds) in order to predict changes in various economic and commercial indicators. For example, Gruhl et al. (2005) show that online chat activity can help to predict book sales. In the same vein, in Mishne & Glance (2006) the authors use perform sentiment analysis in blogs to predict movie sales. Most recently Asur & Huberman (2010) demonstrate how public sentiment related to movies, as expressed on Twitter, can actually predict box office receipts. To this end, different sentiment tracking approaches have been proposed in the literature and significant progress has been made in extracting public mood directly from social media content such as blog content (Gilbert & Karahalios 2010, Mishne & Glance 2006, Liu et al. 2007, Dodds & Danforth 2010). Twitter has gained a lot of attention towards this direction. Although tweets are limited to only 140 characters, if we aggregate the millions of tweets submitted to Twitter at any given moment, we may have an accurate representation of public mood and sentiment (Pak & Paroubek 2010). As a result, real-time sentiment-tracking methods have been proposed, such as Dodds & Danforth (2010) and “Pulse of Nation”.
Predicting Sales Trends. Can sentiment analysis on social media help? 203
The proposed approaches use different metrics about social media which have been employed in prediction. The metrics used may be divided into two categories (Yu & Kak 2012): message characteristics and social network characteristics. The message characteristics focus on the messages themselves, such as the sentiment and time series metrics. On the other hand, the social network characteristics concern structural features. The sentiment metrics are the static features of posts. With a qualitative sentiment analysis system, the messages could be assigned a positive, negative, or neutral sentiment. Thus naturally the numbers of positive, negative, neutral, non-neutral, and total posts are five elementary content predictors. These metrics may have different prediction power at different stages (Yu & Kak 2012). In various other endeavours, they try to calculate the relative strength of the computed sentiment. To this end, the prediction approach adopted by Zhang & Skiena (2009), computes various ratios between different types of sentiment bearing posts: they specifically use the ratio between the total number of positive and total posts, the ratio between the total counts of negative and total posts and the ratio between the counts of neutral and total posts. In Asur & Huberman (2010), they calculate the ratio between the total number of neutral and non-neutral posts, and the ratio between the numbers of positive and negative posts. Further, with the use of time series metrics researchers try to investigate the posts more dynamically, including the speed and process of the message generation. In that case different time windows size can be taken into account in order to calculate the generating rate (such as hourly, daily or weekly) of posts. The intuition behind the use of post generation rate is that a higher post generation rate implies the engagement of more people with a topic, and thus topic is concerned more attractive. For example, in Asur & Huberman (2010), they have shown that the use of a daily generating post rate, as a feature before the release of a movie is a good predictor for movie box-office. As mentioned before, the social network characteristics measuring structural features can be of a great importance in prediction methodologies. For example, centrality which measures the relative importance of a node within a network, or the number followers of a user, as well as the re-tweets in twitter when they are combined with the message characteristics can provide invaluable information about the importance and the relative strength of the computed sentiment. Concerning the prediction methods used, different approaches have been proposed. For example, prediction of future product sales using a Probabilistic Latent Semantic Analysis (PLSA) model for sentiment analysis in blogs, is proposed by Liu et al. (2007). A Bayesian approach has been also proposed in Liu et al. (2007). In particular, if the prediction result is discrete, the Bayes classifier can
204 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos be applied directly, otherwise the prediction result must be discretized first (Sharda & Delen 2006). A Regression model has been also used in Szabo & Huberman (2010). Regression methods analyse relationship between the dependent variable, prediction result, and one or more independent variables, such as the social network characteristics. Another direction that has started to be explored is the model based prediction. A mathematical model is built on the object before prediction, which requires deep insight into the object (i.e. knowledge about social media to develop effective models for them). Up to now, there are few works towards this direction (e. g. Romero et al. 2011, Lerman & Hogg 2010). The currently existing approaches for prediction of sales' trends as we have discussed in the previous paragraphs, are usually based on common classifiers like SVM or Naive Bayes which they both exploit a bag-of-words representation of features. These methods cannot encode the structural relations of historical data which are very often responsible for the revelation of a future sales' trend; Therefore, we are challenged to explore an approach based on structural models which contrary to bag-of-words classification approaches, takes advantage of the sequential structure of historical data and thus is able to exploit patterns learned from their relations which can reveal the fluctuation of sales that would lead to the correct prediction of a sales' trend. The limitation of current approaches is that they use sentiment metrics in a strictly quantitative way, taking into account the pure number of favourites or the fraction of likes over dislikes which are not always accurate or representative of a people's sentiment; For example somebody could “like” a product in general but could also make a negative comment (post) on aspects of it. Most importantly, by just calculating ratios of positive over negative posts, the current approaches are not taking into account the sentiments trends or a longitudinal sentiment fluctuation expressed by people over time about a product, which could help in estimating a future trend. On the other hand in our approach the sentiment expressed in a tweet is determined on a per context basis through the context of each tweet, by taking into account the relations among its word' as it is described in Section 3. Then the sentiment feature is integrated in our methodology for prediction of sales trends' not as an isolated parameter but as a feature in correlation and in interaction with all the remaining features and sentiments included within our historical data.
Predicting Sales Trends. Can sentiment analysis on social media help? 205
3 Methodology The proposed approach for prediction on sales which is based on sentiment analysis is articulated in two consecutive two stages as follows: – sentiment analysis stage – prediction of sales' trends of products based on sentiment. The output of the first stage (i.e. tweets annotated with positive, negative or neutral sentiment) is providing the input for its successor.
3.1 Sentiment Analysis of tweets In Sentiment Analysis the problem of detecting the polarity orientation of sentences (i.e. tweets) can be regarded as a classification problem: each sentence can be represented by an ordered sequence of features concerning words. These features can be effectively exploited by a Conditional Random Fields (CRF) model (Lafferty et al. 2001). The motivation for using CRF in sentiment analysis is based on the principle that the meaning a sentence can imply, is tightly bound to the ordering of its constituent words. Especially in the case of the English language which lacks rich infection, the syntactic and semantic relations of its constituents are mainly implied through word ordering. In English the change of word ordering in a sentence can affect significantly its meaning. Similarly we have detected that word ordering can also affect not only the meaning but also the polarity that a sentence conveys. This can be further illustrated through the following examples: a) Honestly, they could not have answered those questions. b) They could not have answered those questions honestly. Examples (a) and (b) contain the same word information, but convey a different polarity orientation. Example (a) implies a positive polarity since the author, placing the adverb in the initial position is trying to support its claim, providing excuses for the behaviour of someone. On the other hand in example (b), the adverb relocation at the end of the sentence changes the meaning of the sentence and creates a negative connotation. Here the author criticizes the non-honest behaviour of someone, implying a negative polarity. CRF is particularly appropriate to capturing such fine distinctions since it is able to capture the relations among sentence constituents (word senses) as a function of their sentential position.
206 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos Moreover considering structured models in sentiment analysis, Choi et al. (2006) use CRFs to learn a sequence model in order to assign sentiments to the sources (persons or entities) to which these sentences belong. Mao & Lebanon (2007) propose a sequential CRF regression model to measure sentence level polarity for determining the sentiments flow of authors in reviews. McDonald et al. (2007) propose a method for learning a CRF model to decide upon the polarity orientation of a document and its sentences. In the last two approaches above the aim is to determine the document level polarity, detecting the polarity orientation of each constituent sentence. In Sadamitsu et al. (2008), structured models are utilized in order to determine the sentiment of documents which are conceived as sequences of sentences. They are based on the assumption that the polarity of the contributing sentences affects the overall document polarity. Similarly in our approach, although sentence polarity is targeted, we investigate how the polarities of the individual words in a sentence contribute to the overall sentence level polarity. Sadamitsu et al. (2008) proposed a method for sentiment analysis of product reviews utilizing inter-sentence structures. This approach is based on the assumption that within a positive product review there can also be negative sentences. They claim that this polarity reversal occurs on a sentence basis, therefore the sentential structures can be modelled by Hidden Conditional Random Fields (HCRF) (Quattoni et al. 2004, Gunawardana et al. 2005). HCRF discriminative models are trained with polarity and valence reversers (e.g. negations) for words. Weights for these features are learned for positive and negative document structures. Sadamitsu et al. (2008) used HMMs at the level of sentences, for determining the sentiment of a document. This approach was soon abandoned since increasing the number of HMM states they had a lower accuracy for sentiment detection. They attribute this effect to the fact that HMM models are generative and not discriminative models. In Rentoumi et al. (2009), structured models such as Hidden Markov Models (HMMs) are exploited in sentiment classification of headlines. The advantage of HMMs against other machine learning approaches employed in sentiment analysis is that the majority of them are based on at bag-of-features representations of sentences, without capturing the structural nature of sub-sentential interactions. On the contrary, HMMs being sequential models encode this structural information, since sentence elements are represented as sequential features. In Rentoumi et al. (2012), the authors proposed the use of CRF for computing the polarity at the sentence level: this is motivated by the fact that CRF models exploit more information about a sentence, in comparison to HMMs or other approaches (i.e. bag-of-words). In particular in Rentoumi et al. (2012) the authors provided
Predicting Sales Trends. Can sentiment analysis on social media help? 207
further experimental evidence that metaphorical expressions assigned with a polarity orientation when exploited by CRF models have been proven valuable in revealing the overall polarity of a sentence. On the other hand a bag-of-words classifier, such as an SVM classifier, which cannot exploit adequately the structure of sentences, can be proven misleading in assigning a correct polarity on examples (a) and (b), since the bag-of-words representation is the same for both sentences.
Fig. 1: A linear chain CRF.
On the other hand a bag-of-words classifier, such as an SVM classifier, which cannot exploit adequately the structure of sentences, can be proven misleading in assigning a correct polarity on examples (a) and (b), since the bag-of-words representation is the same for both sentences. CRF is an undirected graph model which specifies the joint probability for the existence of possible sequences given an observation. We used a linear chain CRF implementation1 with default settings. Figure 1 shows the abstract structure of the model, where x indicate word senses and y states. A linear chain CRF is able to model arbitrary features from the input instead of just the information concerning the previous state of a current observation (as in Hidden Markov Models). Therefore it is able to take into account contextual information, concerning the constituents of a sentence. This is already discussed is valuable when it comes to evaluate the polarity orientation of a sentence. 1 http://crfpp.sourceforge.net/
208 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos In discriminative CRF, we can directly model the conditional probability p(Y|X) of the output Y given the input X. As p(X) is not modelled, CRF allows exploiting the structure of X without modelling the interactions between its parts, but only those with the output. In CRF the conditional probability p(Y|X) is computed as follows: 𝑇𝑇𝑇𝑇
𝐾𝐾𝐾𝐾
1 𝑝𝑝𝑝𝑝(𝑌𝑌𝑌𝑌|𝑋𝑋𝑋𝑋) = exp �� � 𝜆𝜆𝜆𝜆𝑘𝑘𝑘𝑘 𝑓𝑓𝑓𝑓𝑘𝑘𝑘𝑘 (𝑦𝑦𝑦𝑦𝑡𝑡𝑡𝑡 , 𝑦𝑦𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡 , 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 )� 𝑍𝑍𝑍𝑍(𝑋𝑋𝑋𝑋) 𝑡𝑡𝑡𝑡𝑡𝑡 𝑘𝑘𝑘𝑘𝑘𝑘
𝑍𝑍𝑍𝑍(𝑋𝑋𝑋𝑋) is a normalization factor that depends on 𝑋𝑋𝑋𝑋 and the parameter of the training process 𝜆𝜆𝜆𝜆. In order to train the classification model, i.e. to label 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 depicted in Figure 3 - indicating a positive/negative/neutral sentiment for the sentence, we exploit information from the previous (𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 − 1) and the following words (𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 + 1) , exploiting feature functions 𝑓𝑓𝑓𝑓𝑘𝑘𝑘𝑘 . A feature function 𝑓𝑓𝑓𝑓𝑘𝑘𝑘𝑘 looks at a pair of adjacent states 𝑦𝑦𝑦𝑦𝑡𝑡𝑡𝑡 − 1, 𝑦𝑦𝑦𝑦𝑡𝑡𝑡𝑡 , the whole input sequence 𝑋𝑋𝑋𝑋 and the position of the current word. This way CRF exploits the structural information of a sentential sequence. The CRF component assigns a sentence (i.e. tweet) polarity for every word participating in a sentential sequence. This sentence level polarity corresponds to the polarity of a sentence in which this word would most probably participate, according to the contextual information. Then a majority voting process is used to determine the polarity class for a specific sentence. Thus the majority voting does not involve judging the sentence level polarity from word level polarities, but from the sentence level polarities already given to the words with the CRF component. In other words, the final polarity of the sentence is the polarity that a sentence would have wherein the majority of the words would participate. For example for a sentence of five words, CRF would assign a sentence level polarity to each word. If there are 3 negative and 2 positive sentence level polarities given, then the dominant polarity would be the final result, i.e. the negative. In order to train the CRF we used tweets derived from Sanders' collection2. This corpus consists of 2471 manually-classified tweets. Each tweet was classified with respect to one of four different topics into one of the three classes (positive, negative, and neutral). Every word in a sentence is represented by a feature vector of the form: 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓_𝑣𝑣𝑣𝑣𝑓𝑓𝑓𝑓𝑣𝑣𝑣𝑣𝑓𝑓𝑓𝑓𝑣𝑣𝑣𝑣𝑓𝑓𝑓𝑓 = (𝑓𝑓𝑓𝑓𝑣𝑣𝑣𝑣𝑡𝑡𝑡𝑡𝑓𝑓𝑓𝑓𝑡𝑡𝑡𝑡, 𝑠𝑠𝑠𝑠𝑓𝑓𝑓𝑓𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡_𝑝𝑝𝑝𝑝𝑣𝑣𝑣𝑣𝑝𝑝𝑝𝑝)
2 http://www.sananalytics.com/lab/twitter-sentiment/
Predicting Sales Trends. Can sentiment analysis on social media help? 209
where "𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑡𝑡𝑡𝑡𝑓𝑓𝑓𝑓𝑡𝑡𝑡𝑡" is the word, and "𝑠𝑠𝑠𝑠𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓_𝑝𝑝𝑝𝑝𝑣𝑣𝑣𝑣𝑝𝑝𝑝𝑝"is the sentence level polarity class (positive/negative/neutral). Each sentence (i.e. tweet) is represented by a sequence of such feature vectors. Within the test set, the last feature representing the class (pos/neg) for the sentence is absent and is assessed. The test set consists of a corpus of four subsets of tweets corresponding to four different topics: ipad, sony experia, samsung galaxy, kindle fire.
3.2 Prediction Methodology In the current section, we are describing the methodology adopted for the prediction of the fluctuation of sales, and in particular, of electronic devices. The proposed methodology is based on structural machine learning and on sentiment analysis of historical data from Twitter. The rationale behind the adopted methodology for prediction lies on the assumption that a prediction for the sales fluctuation of an electronic device (e.g. for ipad) can be based on the sales fluctuation of similar gadgets (e.g. Samsung galaxy, kindle fire etc.). As it has been already argued in Section 2 the fluctuation of sales for a product is highly dependent on social media, and particularly it is argued that there exists a strong correlation between the spikes in sale rank and the number of related blog posts. Even though there is strong correlation between blog posts and sales' trends, only few researchers work on prediction of sales based on Social Media. We are based on historical data from Twitter and we exploit sentiment as a strong prediction indicator. The latter is based on the fact that on Twitter, organization or product brands are referred in about 19% of all the tweets, and the majority of them comprise strongly expressed negative or positive sentiments. In the proposed methodology, the prediction about a product's sales is based on the sentiment (positive/negative/neutral) expressed in tweets about similar products, whose sales trends (increase/decrease) is already known. We found there is a correlation among similar products, largely because they share the same market competition. Therefore using historical data that can adequately simulate the sales trends for a number of similar products can presumably reveal a generic pattern which, consequently, can lead to accurate prediction. Therefore, we are introducing a classification function which relates tweets' sentiment and historical information (i.e. tweets' dates) concerning specific electronic devices with their sales trends. Doing so, we intend to test the extent to which these features could be seen as a valuable indicator in the correct prediction of sales trends (i.e. increase/decrease). In our case, the sales trends prediction problem can be seen as a discriminative machine learning classification problem which is dealt with the assistance of
210 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos CRF. Table 1 briefs the number of tweets accompanied by their corresponding yearly quartiles that were used for training and testing the CRF model in order to make predictions of sales' trend for a future (i.e. unknown) quartile for an electronic device. For this reason historical tweets' data have been collected for four electronic devices (i.e. ipad, samsung galaxy, sony experia, kindle fire). Table 1: Historical Twitter data for training and testing the CRF prediction approach Device kindle fire
# of tweets 100
kindle fire kindle fire
100 91
samsung galaxy
01/10/2011 31/12/2011 01/04/2011 30/06/2011
48
01/10/2010 31/12/2010
2nd quartile 2011
01/10/2011 31/12/2011
100
01/10/2009 31/12/2009
increase decrease decrease
2nd quartile 2012
increase increase
4th quartile 2011
decrease decrease
4th quartile 2012
01/07/2012 30/09/2012 01/07/2010 30/09/2010
decrease
increase 4th quartile 2011
01/07/2011 30/09/2011 84
increase
decrease
01/01/2012 31/03/2012
samsung galaxy samsung galaxy
01/04/2010 30/06/2010
90
sony experia samsung galaxy
4th quartile 2012
01/07/2012 30/09/2012
sony experia sony experia
01/10/2011 31/12/2011
Correct class
increase
01/01/2011 31/03/2011 86
ipad sony experia
1st quartile 2013
01/07/2012 30/09/2012
ipad ipad
01/01/2012 31/03/2012
Prediction dates
01/10/2012 31/12/2012
kindle fire ipad
historical tweets' data used for training/ testing
increase increase
4th quartile 2010
decrease decrease
As Table 1 depicts, in order to train the CRF we exploited sequential historical data consisting of tweets (i.e. training set) corresponding to four yearly quartiles for each device; two of them represent the class “increase” and two are representing the class “decrease”. The prediction class indicates the correct prediction trend (increase/decrease) for a particular product in a future time frame as found in the official financial journals. In essence, this study involves the conduct of predictions for four distinct devices (ipad, samsung galaxy, sony experia, kindlefire). For instance, as Table 1 shows, in order to predict the sales performance of the iPad (i.e. increase/decrease) in the second quartile of 2011, we were based and use as training set historical
Predicting Sales Trends. Can sentiment analysis on social media help? 211
tweets' data and sentiment values expressed on Twitter about Sony Experia, Kindle Fire and Samsung Galaxy devices in specific early quartiles for which we already knew their sales fluctuation. In general in order to make predictions for each device, CRF is trained with tweets' sequential data for the remaining three devices. The tweets' data for each device as shown in Table 1 are taken from two yearly quartiles for each class, decrease/increase. As a test set we used tweet sequences for iPad taken from the immediately preceding quartile from the one on which we wish to make prediction that is the first quartile of 2011 as well as from the corresponding quartile of the previous year, that is the second quartile of 2010. The specific information about specific time intervals of drop or raise of sales has been taken from prestigious financial and IT online media3. Training and Test data integrated in the prediction methodology are detailed in Table 1. More formally CRF component assigns a tweets' sequence - which is defined as a series of consecutive tweets taken from two yearly quartiles – a sales trend (i.e. increase/decrease) for every tweet used in this sequence. This trend corresponds to the trend of a tweet sequence in which this tweet would most probably participate. Then a majority voting process is used to decide the prediction class for a specific tweets' sequence. For example if there are 3 decrease and 2 increase sequence based sale trends given to the tweets comprising the sequence, then the dominant trend for this sequence would be the class decrease. Under more formalistic means within the training process every tweet in a tweets' data sequence is represented by a feature vector of the form: 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓_𝑣𝑣𝑣𝑣𝑓𝑓𝑓𝑓𝑣𝑣𝑣𝑣𝑓𝑓𝑓𝑓𝑣𝑣𝑣𝑣𝑓𝑓𝑓𝑓 = (𝐷𝐷𝐷𝐷𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 , 𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑣𝑣𝑣𝑣𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑝𝑝𝑝𝑝𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓, 𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝, 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑃𝑃𝑃𝑃𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑃𝑃𝑃𝑃𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑃𝑃𝑃𝑃)
where "𝐷𝐷𝐷𝐷𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓" is the date of the corresponding tweet, probability is the output “Sentiment Probablity” derived within the sentiment analysis stage (a) for that tweet, polarity is the output sentiment value, positive, negative, neutral extracted within the sentiment analysis stage (a) and the prediction trend is the trend (increase/decrease) assigned to this tweet representing the whole tweets' sequence sales trend; each tweets' sequence representing a class (increase or decrease) is consisting of tweets corresponding to two yearly quartiles. Therefore, each tweets' data sequence is represented by a sequence of such feature vectors put in chronological order (according to their features values "𝐷𝐷𝐷𝐷𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓"). Within the test 3 Among others: The Financial Times, Business Insider, Reuters, The Telegraph, CNET, ZDnet, Mobile Statistics.
212 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos set, the last feature representing the trend (increase/decrease) for a tweets' sequence representing a device is absent and is assessed.
4 Experimental Results To verify our claim that the proposed prediction approach is well suited for the particular task, we compare with a system variant that uses Naive Bayes instead of CRF within the last step of the methodology (stage b. which assigns a prediction class for sales' trend, on tweets' sequential data). In order to evaluate this method of classifying tweets' sequential data concerning four electronic devices, we used a 4-fold cross validation method. In Section 2, we took into consideration data for four electronic devices; the algorithm was run 4 times, each time using a different combination of 3 subsets for training and the remaining one for testing. So each classification fold is predicting a sales' trend for each of the four devices under question using its corresponding tweets' sequential data as a test set and the tweets' sequential data for the three remaining devices' have been used as a training set (table 1). The result of the crossvalidation is the average performance of the algorithm in the 4 runs. The available tweets' data sequences (data points for training and testing) are two for each product, each one representing a product's trend to increase or to decrease. Table 2 illustrates for the increase/decrease prediction task the confusion matrices and the accuracy scores for the proposed methodology in comparison with a Naive Bayes approach. Each class increase/decrease is including four tweets' data sequences, representing one out of the four devices. As Table 2 reports, our method exhibits a higher level of accuracy than that of the NB variant (0.50)(𝑝𝑝𝑝𝑝 < 0.05). Moreover, the results indicate that exploited structural information of tweet's data series feature are the best fit for.
Predicting Sales Trends. Can sentiment analysis on social media help? 213 Table 2: System Evaluation using only CRF for prediction of sales' trends vs Naive Bayes: error threshold 5%, p-value: 0.06 Prediction CRF
increase n=4 decrease n=4 accuracy
increase 4 3
decrease 4 1 88%
Prediction NB
increase 3 2 63%
decrease 1 2
In parallel we aim to show that our method when it integrates in the CRF component more linguistic information is more representative of the structure of the sentence, and therefore can lead to judging more accurately the polarity orientation of a sentence, than a feature representation that does not take into account enough information from the sentential context.
5 Conclusions & Discussion This paper presents a two-stage approach for prediction of sales trends' in products based on tweets' sequential data taking into account the sentiment values expressed through these tweets. We provided empirical and experimental evidence on the appropriateness of the particular computational approach which is based on structural models compared to approaches that are based on bag-ofwords representations. In particular we showed that since CRF exploits structural information concerning a tweets' data time series, it can capture non local dependencies of important sentiment bearing tweets. In doing so, we verified the assumption that the representation of structural information, exploited by the CRF, simulates the semantic and sequential structure of time series' data, implying their sales' trend that they can represent.
References Asur, Sitaram & Bernardo A. Huberman. 2010. Predicting the future with social media. In Proceedings of the IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, 492–499. Bollen, Johan, Huina Mao & Xiao-Jun Zeng. 2011. Twitter mood predicts the stock market. Journal of Computational Science 2(1). 1–8.
214 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos Choi, Yejin, Eric Breck & Claire Cardie. 2006. Joint extraction of entities and relations for opinion recognition. In Proceedings of the conference on empirical methods in natural language processing, 431–439. Association for Computational Linguistics. Cootner, Paul H. 1964. The random character of stock market prices. Cambridge (MA): MIT Press. Dodds, Peter & Christopher Danforth. 2010. Measuring the happiness of large-scale written expression: Songs, blogs, and presidents. Journal of Happiness Studies 11(4). 441–456. Eppen, Gary D. & Eugene F. Fama. 1969. Cash balance and simple dynamic portfolio problems with proportional costs. International Economic Review 10(2). 119–133. Fama, Eugene F. 1965. The behavior of stock-market prices. Journal of Business 38(1). 34–105. Fama, Eugene F. 1991. Efficient capital markets: IΙ. Journal of Finance 46(5). 1575–1617. Gallagher, Liam A. & Mark P. Taylor. 2002. Permanent and temporary components of stock prices: Evidence from assessing macroeconomic shocks. Southern Economic Journal 69(2). 345–362. Gilbert, Eric & Karrie Karahalios. 2010. Widespread worry and the stock market. In Proceedings of the fourth international AAAI conference on weblogs and social media, 58–65. Gruhl, Daniel, Ramanthan Guha, Ravi Kumar, Jasmine Novak & Andrew Tomkins. 2005. The predictive power of online chatter. In Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, 78–87. Gunawardana, Asela, Milind Mahajan, Alex Acero & John C. Platt. 2005. Hidden conditional random fields for phone classification. In Ninth European conference on speech communication and technology, 1117–1120. Kavussanos, Manolis & Everton Dockery. 2001. A multivariate test for stock market efficiency: the case of ASE. Applied Financial Economics 11(5). 573–579. Lafferty, John, Andrew McCallum & Fernardo Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th international conference on machine learning, 282–289. Lerman Kristina & Tad Hogg. 2010. Using a model of social dynamics to predict popularity of news. In Proceedings of the 19th international conference on World Wide Web, 621–630. Liu, Yang, Xiangji Huang, Aijun An & Xiaohui Yu. 2007. Arsa: A sentiment-aware model for predicting sales performance using blogs abstract. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, 607–614. Mao, Yi & Guy Lebanon. 2007. Isotonic conditional random fields and local sentiment flow. In Advances in Neural Information Processing Systems 19. http://papers.nips.cc/paper/3152-isotonic-conditional-random-fields-and-local-sentiment-flow.pdf (accessed 1 December 2014). McDonald, Ryan, Kerry Hannan, Tyler Neylon, Mike Wells & Jeff Reynar. 2007. Structured models for ne-to-coarse sentiment analysis. In Proceedings of the 45th annual meeting of the Association of computational linguistics, 432–439. Mishne, Gilad & Natalie Glance. 2006. Predicting movie sales from blogger sentiment. In AAAI symposium on computational approaches to analysing weblogs, 155–158. Pak, Alexander and Patrick Paroubek. 2010. Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of the 7th conference on international language resources and evaluation, 1320–1326. Qian, Bo & Khaled Rasheed. 2007. Stock market prediction with multiple classifiers. Applied Intelligence 26(1). 25–33.
Predicting Sales Trends. Can sentiment analysis on social media help? 215 Quattoni, Ariadna, Michael Collins & Trevor Darrell. 2004. Conditional random fields for object recognition. Proceedings of Neural Information Processing Systems, 1097–1104. Rentoumi, Vassiliki, George Giannakopoulos, Vangelis Karkaletsis & George Vouros. 2009. Sentiment analysis of figurative language using a word sense disambiguation approach. In International conference on recent advances in natural language processing (RANLP 2009), 370–375. Rentoumi, Vassiliki, George A.Vouros, Vangelis Karkaletsis & Amalia Moser. 2012. Investigating Metaphorical Language in Sentiment Analysis: a Sense-to-Sentiment Perspective. ACM Transactions on Speech and Language Processing 9(3). Article no. 6. Romero, Daniel M., Wojciech Galuba, Sitaram Asur & Bernardo A. Huberman. 2011. Influence and passivity in social media. In Proceedings of the 20th international conference companion on World Wide Web, 113–114. Sadamitsu, Kugatsu, Satoshi Sekine & Mikio Yamamoto. 2008. Sentiment analysis based on probabilistic models using inter-sentence information. In Proceedings of the 6th international conference on language resources and evaluation, 2892–2896. Sharda, Ramesh & Dursun Delen. 2006. Predicting box-office success of motion pictures with neural networks. Expert Systems with Applications 30(2). 243–254. Szabo, Gabor & Huberman, Bernardo A. 2010. Predicting the popularity of online content. Communications of the ACM 53(8). 80–88. Yu, Sheng & Subhash Kak. 2012. A survey of prediction using social media. The Computing Research Repository (CoRR). Zhang, Wenbin & Steven Skiena. 2009. Improving movie gross prediction through news analysis. In Proceedings of the IEEE/WIC/ACM international joint conference on web intelligence and intelligent agent technology - volume 1, 301–304.
Andrij Rovenchak
Where Alice Meets Little Prince Another approach to study language relationships
1 Introduction Quantitative methods, including those developed in physics, proved to be quite successful for studies in various domains, including biology (Ogasawara et al. 2003), social sciences (Gulden 2002), and linguistics (Fontanari and Perlovsky 2004, Ferrer i Cancho 2006, Čech et al. 2011). According to the recently suggested model (Rovenchak and Buk 2011a), which is based on the analogy between word frequency structure of text and Bose-distribution in physics, a set of parameters can be obtained to describe texts. These parameters are related to the grammar type or, being more precise, to the analyticity level of a language (Rovenchak and Buk 2011b). One of the newly defined parameters is an analog of the temperature in physics. One should not mix such a term with other definitions of the “temperature of text” introduced by, e. g., Mandelbrot (1953), de Campos (1982), Kosmidis et al. (2006), Miyazima and Yamamoto (2008), etc. In the present work, two famous novels are analyzed, namely Alice’s Adventures in Wonderland by Lewis Carroll (also known under a shortened title as Alice in Wonderland) and The Little Prince by Antoine de Saint-Exupéry. These texts were translated into numerous languages from different language families and thus are seen as a proper material for contrastive studies. The calculations are made for about forty translations of The Little Prince (LP) and some twenty translations of Alice in Wonderland (AW) with both texts available in thirteen languages. The following are the languages for LP: Arabic, Armenian, Azerbaijani, Bamana, Belarusian, Bulgarian, Catalan, Czech, German, Dutch, English, Esperanto*, Spanish, Estonian, Euskara (Basque), Farsi, French, Georgian, Greek, Hebrew, Hindi, Croatian, Hungarian, Italian, Lojban*, Korean, Latvian, Lithuanian, Mauritian Creole, Mongolian, Polish, Portuguese, Romanian, Russian (2 texts), Serbian, Turkish, Ukrainian, Vietnamese, Chinese, Thai. The following are the languages for AW:
218 Andrij Rovenchak Bulgarian, Cymraeg (Welsh), German, English, Esperanto*, French, Gaelic, Hawaiian, Hungarian, Italian, Lojban*, Latin, Lingua Franca Nova*, Polish, Romanian, Russian (3 texts), Swedish, Swahili, Ukrainian, Chinese. Therefore, both texts are available in the following thirteen languages: Bulgarian, German, English, Esperanto*, French, Hungarian, Italian, Lojban*, Polish, Romanian, Russian, Ukrainian, Chinese. In the above lists, asterisks mark artificial languages. For both texts, available languages allow achieving some diversity to study interlingual relations within groups and families, as well as a variety of grammar types, from highly synthetic to highly analytic languages. Note that some ideas, on which this contribution is based, were recently applied by the author to a different language material (Rovenchak 2014).
2 Method description For a given text, the frequency list of words is compiled. In order to avoid issues of lemmatization for inflectional languages, words are understood as orthographical words – i.e., alphanumeric sequences between spaces and/or punctuation marks. Thus, for instance, ‘hand’ and ‘hands’ are treated as different words (different types). Orthographies without “western” word separation can be studied within this approach with some precautions. These will be discussed in Section 3 with respect to texts in Chinese, Japanese, and Thai languages. Let the types with the absolute frequency equal j occupy the jth “energy level”. The number of such types is the occupation number Nj. Somewhat special role is assigned to hapax legomena (types or words occurring only once in a given text), their number is N1. Evidently, there is no natural ordering of words within the level they occupy, which is seen as an analog of indistinguishability of particles in quantum physics and thus a quantum distribution can be applied to model the frequency behavior of words. Since the number of types with a given frequency can be very large, it seems that the Bose-distribution (Isihara 1971: 82; Huang 1987: 183) is relevant to this problem. The occupation number for the jth level in the Bose-distribution equals: 𝑁𝑁𝑁𝑁𝑗𝑗𝑗𝑗 =
1
𝑧𝑧𝑧𝑧 −1 𝑒𝑒𝑒𝑒 𝜀𝜀𝜀𝜀𝑗𝑗𝑗𝑗 /𝑇𝑇𝑇𝑇 − 1
,
(1)
Where Alice meets Little Prince 219
where T is temperature, z is called activity or fugacity, and εj is the energy of the jth level (excitation spectrum). As was shown previously (Rovenchak and Buk 2011a; 2011b), the following power dependence of εj on the level number j leads to a proper description of the observed data for small j: 𝜀𝜀𝜀𝜀𝑗𝑗𝑗𝑗 = (𝑗𝑗𝑗𝑗 − 1)𝛼𝛼𝛼𝛼
(2)
where the values of α appear to be within the domain between 1 and 2. The unity is subtracted to ensure that the lowest energy ε1 = 0. Therefore, one can easily obtain the relation between the number of hapaxes N1 and fugacity analog z: 𝑧𝑧𝑧𝑧 =
𝑁𝑁𝑁𝑁1 . 𝑁𝑁𝑁𝑁1 + 1
(3)
The parameters T and α are defined from observable frequency data for every text or its part. The procedure is as follows. First, the fugacity analog z is calculated from the number of hapaxes using Eq. (3). Then, observed values of Nj with j = 2 ÷ jmax are fitted to 𝑁𝑁𝑁𝑁𝑗𝑗𝑗𝑗 =
1 𝛼𝛼𝛼𝛼 𝑧𝑧𝑧𝑧 −1 𝑒𝑒𝑒𝑒 (𝑗𝑗𝑗𝑗−1) /𝑇𝑇𝑇𝑇 − 1
(4)
The upper limit jmax for fitting can be defined empirically. Due to the nature of word frequency distributions, the occupations of high levels rapidly decreases, or, being more precise, Nj is either 1 or 0 for j being large enough as this is the number of words with distinct large absolute frequencies. To define the value of jmax one can use some naturally occurring separators linked to frequency distributions. In this work, the k-point is applied. The definition of the k-point is similar to that of the h-point rh in the rank-frequency distribution, being the solution of equation f(rh) = rh, where f(r) is the absolute frequency of a word with rank r. The k-point corresponds to the so-called cumulative distribution (Popescu and Altmann 2006), thus it is the solution of equation Nj = j. An extension of these definitions must be applied if the respective equations do not have integer roots. In practice, however, some average values were defined based on different translations of each novel. Such an approach simplifies the automation of the calculation process and, according to preliminary verifications, does not influence the values of the fitting parameters significantly due to the rapid exponential decay of the fitting function (4) for large j. The use of the k-point can be justified from the observations that the lowfrequency vocabulary is composed mostly of the autosemantic (full-meaning) words (Popescu et al. 2009: 37).
220 Andrij Rovenchak Note that the parameters of the model of the frequency distribution (T, z, and α) are not given any in-depth interpretation based on the analogy with the respective physical quantities. Possibility of such a parameter attribution is yet to be studied.
3 Analysis of full texts Full texts were analyzed first. Several comments are required regarding translations into Chinese, Japanese, and Thai. None of these languages uses a writing system with a word separation typical to western orthographies. Therefore, the frequency of characters instead of that of words was studied in the Chinese translations. For the Japanese translations, special software was used to obtain word frequencies. Thai translation of The Little Prince is demonstrated for comparison: in the Thai script, spaces separate phrases but not single words. Several frequency lists are shown in Tables 1 and 2 to demonstrate how the k-point is used to estimate jmax. Table 1: Observed values of Nj for different translations of The Little Prince. The nearest js to the k-points are marked in bold italic and grey-shaded. j
Chinese
Lojban
English
French
Polish
Russian
Ukrainian
1
420
582
1032
1578
2025
1916
2207
2
213
221
359
399
574
571
513
3
140
124
206
176
261
271
216
4
92
84
115
107
122
146
115
5
78
69
65
71
95
96
81
6
43
45
39
44
52
39
54
7
51
53
37
40
37
38
27
8
41
25
27
28
33
43
28
9
20
21
22
18
24
24
18
10
20
23
19
22
15
18
19
11
23
27
20
10
8
11
12
12
17
17
18
12
15
11
11
13
17
16
12
14
10
7
10
14
18
19
13
3
15
8
14
15
15
6
7
12
3
6
5
16
11
10
7
7
9
4
4
17
10
7
6
4
6
7
6
18
16
5
6
9
4
6
9
Where Alice meets Little Prince 221
j
Chinese
Lojban
English
French
Polish
Russian
Ukrainian
19
9
9
6
5
6
4
3
20
8
6
4
4
2
6
4
As Table 1 demonstrates, one can use the following value for The Little Prince: jmax = 10, which appears to be close to the mean of the k-points for the whole set of languages. Table 2: Observed values of Nj for different translations of Alice in Wonderland. The nearest js to the k-points are marked in bold italic and grey-shaded. j
Chinese
Lojban
English
French
Polish
Russian
Ukrainian
16 17 18 19 20
11 18 16 15 12
16 7 15 12 12
9 12 9 8 8
20 8 3 8 7
11 11 6 10 5
7 9 11 4 3
6 5 8 7 9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
460 241 146 86 97 72 57 32 30 37 32 36 22 25 19
939 294 172 118 91 60 57 37 29 36 31 25 23 22 19
1456 558 312 187 117 71 55 59 29 46 27 33 17 18 16
2070 615 305 183 110 84 57 52 32 34 20 23 22 20 17
3141 1000 468 225 127 98 65 59 40 27 20 17 18 14 11
2709 1051 474 292 171 110 59 52 27 30 26 17 18 16 6
3328 1142 524 243 140 77 67 40 32 34 29 18 6 11 12
From Table 2 the estimate for Alice can be taken as jmax = 15 on the same grounds as above. After the fitting procedure is applied, for each text a pair of T and α parameters is obtained. It was established previously (Rovenchak and Buk 2011a; 2011b) that the parameter τ = ln T / ln N has a weak dependence on text length N and thus can be used to compare texts in different languages as they can contain different number of tokens. This is related to the parameter scaling discussed in Section 4. Each text is thus represented by a point on the α– τ plane or, being more precise, by a domain determined through the uncertainty of the fitting. Results are shown in Figures 1 and 2.
222 Andrij Rovenchak
(a)
(b) Fig. 1: Positions of different translations of Alice in Wonderland on the α–τ-plane (a). Enlarged view is given on bottom panel (b). Language codes are explained in the Appendix. For clarity, not all the analyzed languages are shown.
Where Alice meets Little Prince 223
(a)
(b) Fig. 2: Positions of different translations of The Little Prince on the α–τ-plane (a). Enlarged view is given on bottom panel (b). Language codes are explained in the Appendix. For clarity, not all the analyzed languages are shown.
224 Andrij Rovenchak Interestingly, texts are mostly located on a wide band stretching between smallα– small-τ and large-α–large-τ values. In the lower-left corner languages of highly analytical nature are placed, namely: Chinese, Vietnamese, Lojban, Hawaii, Bamana, Mauritian Creole. Top-right corner is, on the contrary, occupied by highly synthetic languages (Slavic group, Hungarian, Latin, etc.). Intermediate positions correspond to language with medium analyticity level.
4 Parameter scaling The main focus in this study is made on the dependence of the defined parameters on text length, meaning the parameter development in the course of text production. The parameters appear to be quite stable (their behavior being easily predictable) for highly analytic languages (Chinese and Lojban, an artificial language). On the other hand, even if the difference is observed between the same language versions for two novels, the divergence is not significant for different translations. This means that translator strategies do not influence the parameter values much, but rather text type (genre) is more important. To study the scaling of the temperature parameter, the languages were grouped into several classes depending on the coarsened mean value of the α exponent: α = 1.15: Lojban; α = 1.20: Chinese; α = 1.40: English; α = 1.45: French; α = 1.50: Bulgarian, German, Esperanto, Italian, Romanian; α = 1.60: Hungarian, Polish, Russian, Ukrainian. With the value of α fixed as above, temperature T was calculated consecutively for first 1000, 2000, 3000, etc. words of each text. The results for T demonstrate a monotonously increasing dependence on the number of words N. The simplest model for the temperature scaling was tested, namely: 𝑇𝑇𝑇𝑇 = 𝑡𝑡𝑡𝑡𝑁𝑁𝑁𝑁𝛽𝛽𝛽𝛽 .
(5)
Obviously, the new parameters, t and β, are not independent of the previously defined τ = ln T / ln N: 𝛽𝛽𝛽𝛽 = 𝜏𝜏𝜏𝜏 −
ln 𝑡𝑡𝑡𝑡 , ln 𝑁𝑁𝑁𝑁
therefore, β coincides with τ in the limit of an infinitely long text.
(6)
Where Alice meets Little Prince 225
Results of fitting are demonstrated in Figure 3, the parameters of temperature scaling are given in Table 3. Note that the range of N for the fitting was put N = 10000 ÷ 40000.
Fig. 3: “Temperature” T (vertical axis) versus text length N (horizontal axis) for several laguages. Longer series of data correspond to Alice in Wonderland. Solid line is a fit for Alice in Wonderland, dashed line is a fit for The Little Prince.
226 Andrij Rovenchak According to Eq. (6), the scaling exponent β is related to language analyticity as does the parameter τ, which is confirmed by Table 3. The abnormally large difference between both parameters for some translations of two texts under consideration (e.g., Italian or Hungarian, cf. also Figures 1 and 2) requires additional detailed analysis. Table 3: Scaling parameters for temperature T. The results of fitting are within intervals t ± Δt and β ± Δβ. Language codes are explained in the Appendix. Code blg deu eng epo fra hun ita jbo pol ron
rus
Text
t
Δt
β
Δβ
LP
0.19
0.06
0.87
0.04
AW
0.20
0.04
0.88
0.02
LP
1.19
0.24
0.64
0.02
AW
0.62
0.05
0.72
0.01
LP
0.49
0.07
0.73
0.02
AW
0.57
0.05
0.72
0.01
LP
0.23
0.05
0.82
0.02
AW
0.61
0.04
0.73
0.01
LP
0.85
0.18
0.68
0.02
AW
0.53
0.05
0.73
0.01
LP
0.28
0.06
0.84
0.02
AW
0.15
0.02
0.93
0.01
LP
0.52
0.12
0.74
0.03
AW
0.21
0.02
0.89
0.01
LP
2.58
0.51
0.51
0.02
AW
4.20
0.46
0.45
0.01
LP
0.15
0.04
0.93
0.03
AW
0.23
0.03
0.88
0.01
LP
0.46
0.12
0.75
0.03
AW
0.47
0.05
0.76
0.01
LP1
0.10
0.03
0.96
0.04
LP2
0.13
0.02
0.92
0.01
AW1
0.17
0.02
0.92
0.01
AW2
0.14
0.01
0.94
0.01
AW3
0.14
0.02
0.94
0.01
Where Alice meets Little Prince 227
Code ukr zho
Text
t
Δt
β
Δβ
LP
0.33
0.12
0.82
0.04
AW
0.16
0.01
0.93
0.01
LP
5.61
1.09
0.44
0.02
AW
9.87
1.37
0.38
0.01
5 Text trajectories Text trajectories are obtained for each translation. The values of two parameters correspond to a point and their change as text grows is a trajectory on the respective plane. First of all, one can observe that stability in the parameter values is not achieved immediately for a small text length. Generally, the number of tokens greater than several thousand is preferable, and translations of The Little Prince are close to the lower edge with respect to text length (cf. Rovenchak and Buk 2011b).
Fig. 4: Trajectories of Russian translations of The Little Prince (+ and × symbols) and Alice in Wonderland (□, ■, and ✴). As with the scaling analysis, step of 1000 was used; thus, the first point corresponds to first 1000 tokens, the second one to first 2000 tokens, etc. Errorbars are not shown for clarity.
228 Andrij Rovenchak Results for text trajectories are demonstrated in Figure 4 for Russian, with two translations of The Little Prince and three translations of Alice in Wonderland. There is a qualitative similarity in the shapes of these trajectories, while numerically different results are obtained for different texts, which, as was already mentioned in Section 4, suggests the influence of genre on the parameter values (even within a single language).
6 Discussion The method to analyze texts using an approach inspired by a statistical-mechanical analogy is described and applied for studying translations of two novels, Alice in Wonderland and The Little Prince. A result obtained earlier is confirmed: there exists a correlation between the level of language analyticity and the values of parameters calculated in this approach. Behavior of the parameters in the course of text production is studied. Within the same language, the dependence on a translator of a given text appears much weaker than the dependence on the text genre. So far, it is yet not possible to provide exact attribution of a language with respect to parameter values since the influence of genre has not been studied in detail. Further prospects of the presented approach are seen in analyzing texts of different genres, which are translated in a number of languages, preferably from different families. These could be religious texts (being though generally very specific in language style) or some popular fiction works (e.g., Pinocchio by Carlo Collodi, Twenty Thousand Leagues under the Sea by Jules Verne, Winnie-the-Pooh by Alan A. Milne, The Alchemist by Paulo Coelho, etc.). With texts translated into now-extinct languages (e.g., Latin or Old Church Slavonic) some hints about language evolution can be obtained as well (Rovenchak 2014).
Acknowledgements I am grateful to Michael Everson for providing electronic versions of Alice in Wonderland translated into Cymraeg (Welsh), Gaelic, Hawaiian, Latin, and Swedish.
Where Alice meets Little Prince 229
References de Campos, Haroldo. 1982. The informational temperature of the text. Poetics Today 3(3). 177– 187. Čech, Radek, Ján Mačutek & Zdeněk Žabokrtský. 2011. The role of syntax in complex networks: Local and global importance of verbs in a syntactic dependency network. Physica A 390(20). 3614–3623. Ferrer i Cancho, Ramon. 2006. Why do syntactic links not cross? Europhysics Letters 76(6). 1228–1234. Fontanari, José F. & Leonid I. Perlovsky. 2004. Solvable null model for the distribution of word frequencies. Physical Review E 70(4). 042901. Gulden, Timothy R. 2002. Spatial and temporal patterns of civil violence in Guatemala, 1977– 1986. Politics and the Life Sciences 21(1). 26–36. Huang, Kerson. 1987. Statistical mechanics. 2nd edn. New York: Wiley. Isihara, Akira. 1971. Statistical physics. New York & London: Academic Press. Kosmidis, Kosmas, Alkiviadis Kalampokis & Panos Argyrakis. 2006. Statistical mechanical approach to human language. Physica A 366. 495–502. Mandelbrot, Benoit. 1953. An informational theory of the statistical structure of language. In Willis Jackson (ed.), Communication theory, 486–504. London: Butterworths. Miyazima, Sasuke & Keizo Yamamoto. 2008. Measuring the temperature of texts. Fractals 16(1). 25–32. Ogasawara, Osamu, Shoko Kawamoto & Kousaku Okubo. 2003. Zipf's law and human transcriptomes: An explanation with an evolutionary model. Comptes rendus biologies 326(10–11). 1097–1101. Popescu, Ioan-Iovitz & Gabriel Altmann. 2006. Some aspects of word frequencies. Glottometrics 13. 23–46. Popescu, Ioan-Iovitz, Gabriel Altmann, Peter Grzybek, Bijapur D. Jayaram , Reinhard Köhler, Viktor Krupa, Ján Mačutek, Regina Pustet, Ludmila Uhlířová & Matummal N. Vidya. 2009. Word frequency studies (Quantitative Linguistics 64). Berlin & New York: Mouton de Gruyter. Rovenchak, Andrij. 2014. Trends in language evolution found from the frequency structure of texts mapped against the Bose-distribution. Journal of Quantitative Linguistics 21(3). 281– 294. Rovenchak, Andrij & Solomija Buk. 2011a. Application of a quantum ensemble model to linguistic analysis. Physica A 390(7). 1326–1331. Rovenchak, Andrij & Solomija Buk. 2011b. Defining thermodynamic parameters for texts from word rank–frequency distributions. Journal of Physical Studies 15(1). 1005.
230 Andrij Rovenchak
Appendix Language Codes Code
Language
Code
Language
ara
Arabic
hin
Hindi
arm
Armenian
hrv
Croatian
aze
Azerbaijani
hun
Hungarian
bam
Bamana
ita
Italian
bel
Belarusian
jbo
Lojban
blg
Bulgarian
jpn
Japanese
cat
Catalan
kor
Korean
cym
Cymraeg (Welsh)
lat
Latin
cze
Czech
lfn
Lingua Franca Nova
deu
German
lit
Lithuanian
dut
Dutch
mcr
Mauritian Creole
eng
English
mon
Mongolian
epo
Esperanto
pol
Polish
eps
Spanish
ron
Romanian
est
Estonian
rus
Russian
eus
Euskara (Basque)
ser
Serbian
far
Farsi
sve
Swedish
fra
French
swa
Swahili
gai
Gaelic
tur
Turkish
geo
Georgian
ukr
Ukrainian
gre
Greek
vie
Vietnamese
haw
Hawaiian
zho
Chinese
Peter Zörnig
A Probabilistic Model for the Arc Length in Quantitative Linguistics 1 Introduction Texts are written or spoken in form of linear sequences of some entities. From the qualitative point of view, they are sequences of phonic, lexical, morphological, semantic and syntactical units. Qualitative properties are secondary for quantitative linguistics; hence they are simply omitted from the research. Thus a text may be presented as a sequence (x1,…,xn) whose elements represent phonic, lexical, grammatical, morphological, syntactic or semantic entities. Of course, the positions of entities depend among others on the grammar, when a sentence is constructed, or on a historical succession of events, when a story is written, and both aspects are known to the speaker. What is unknown and happens without a conscious application, are effects of some background laws which cannot be learned or simply applied. The speaker abides by them without knowing them. A number of such laws – i.e. derived and sufficiently positively tested statements - can be discovered using the sequential form of the text. If such a sequence is left in its qualitative form, there appear distances between equal entities, and every text may be characterized using some function of the given distances. If the text is transformed into a quantitative sequence (e.g. in terms of word lengths, sentence lengths, word frequencies, semantic complexities etc.), then we have a time series which is the object of an extensive statistical discipline. Even if one does not attribute the positions to the entities, one may find Markov chains, transition probabilities, autocorrelations, etc. If one replaces the entities by quantities, a number of other methods become available which may reveal hidden structures of language that would not be accessible in any other way. The original complete text shows only deliberate facts such as meanings and grammatical rules, but rewriting it e.g. in terms of measured morphological complexities may reveal some background mechanisms. Their knowledge may serve as a basis for future research. The text in its quantitative form may change to a fractal, it may display regular or irregular oscillation, the distances between neighbors can be measured, etc. Here we restrict ourselves to the study of the arc length expressing the sum
232 Peter Zörnig of Euclidean distances between the measured properties of neighbors. The arc length of a sequence (x1,…,xn) is defined by
= L:
n −1
∑ i =1
( xi − xi +1 ) 2 + 1
(1)
It is illustrated geometrically in Fig. 1 for the sequence (x1,…,x6) = (4,1,2,5,2,3), where 𝐿𝐿 = √10 + √2 + √10 + √10 + √2 = 12.32.
Fig. 1: Arc length
The arc length is an alternative to the usual statistical measures of variation: It is frequently used in linguistics (see Popescu et al. (2009, 2010, 2011) and Wälchli (2011, p. 7)). ). In the aforementioned literature, (x1,…,xn) is assumed to be a rank-frequency sequence. In the present article the elements of the sequence may represent any linguistic entities.
2 The probability model We model the arc length as a discrete random variable defined by 𝑛𝑛−1
𝐿𝐿 = � �(𝑋𝑋𝑖𝑖 − 𝑋𝑋𝑖𝑖+1 )2 + 1 , 𝑖𝑖=1
(2)
A probabilistic model for the arc length in quantitative linguistics 233
where X1,...,Xn are independently and identically distributed random variables assuming values from the set of integers M = {1,...,m}. The number m represents the inventory size of the abstract text (X1,...,Xn) of length n, whose elements are assumed to be integers that count observed linguistic entities. We are interested in the characteristics of the distribution of L. To this end we first introduce the following two auxiliary variables Z and B related to an element of the sum in (2). Let X and Y be independently and identically distributed variables taking values over M, such that P(X=i) = P(Y=i) = pi for i = 1,...,m (pi ≥ 0 for all i and p1+...+pm = 1) (3) We consider the variables
(4)
𝑍𝑍 ≔ |𝑋𝑋 − 𝑌𝑌|
and
(5) 𝐵𝐵 ≔ �𝑍𝑍 2 + 1 = �(𝑋𝑋 − 𝑌𝑌)2 + 1 . Example 1: For m = 5 the possible values of Z are illustrated in the following table: Y 1 2 3 4 5
a)
X
1
2
3
4
5
0 1 2 3 4
1 0 1 2 3
2 1 0 1 2
3 2 1 0 1
4 3 2 1 0
It follows immediately that Z assumes the values 0, 1,…,4 with probabilities P(Z = 0) = p12+…+ p52, P(Z = 1) = 2 (p1p2 + p2p3 + p3p4 + p4p5), P(Z = 2) = 2 (p1p3 + p2p4 + p3p5), P(Z = 3) = 2 (p1p4 + p2p5), P(Z = 4) = 2 p1p5. For example, Z takes on the value 3 in the four cases: X = 4 and Y = 1; X = 5 and Y = 2; X = 1 and Y = 4; X = 2 and Y = 5, occurring with probabilities p1p4, p2p5, p1p4, p2p5, respectively. b) The variable B has the value √𝑖𝑖 2 + 1 if and only if Z has the value i, since B is a monotone increasing function of Z (see (4), (5)). Therefore 𝑃𝑃𝑃𝑃𝑃 = √𝑖𝑖 2 + 1� = 𝑃𝑃(𝑍𝑍 = 𝑖𝑖) for i=0,...,4, and from part (a) we obtain: P(B = 1) = p12+…+ p52, P(B = √2) = 2 (p1p2 + p2p3 + p3p4 + p4p5), P(B =√5 ) = 2 (p1p3 + p2p4 + p3p5), P(B =√10 ) = 2 (p1p4 + p2p5),
234 Peter Zörnig P(B =√17 ) = 2 p1p5. Generalizing the example yields the following result. Theorem 1: The distributions of Z and B are given by a) P(Z = 0) = p12+…+ pm2, P(Z = i) = 2 (p1 p1+i + p2 p2+i +...+ pm-i pm) for i=1,...,m-1 b) P(B = 1) = p12+…+ pm2, P(B =√𝑖𝑖 2 + 1) = 2 (p1 p1+i + p2 p2+i +...+ pm-i pm) for i = 1,...,m-1 To simplify the presentations below we will introduce the following notations for the probabilities occurring in Theorem 1: q0:= p12+…+ pm2 qi := 2 (p1 p1+i + p2 p2+i +...+ pm-i pm) for i = 1,...,m-1. The next statement holds trivially. Theorem 2: The expectations of Z and B are given by m−1
a) E(Z) = � i qi i=1
m−1
b) E(B) = � �i2 + 1 qi i=0
Example 2: (a) Assuming that the elements of M are chosen with equal probabilities, i.e. p1=…= pm = 1/m, the preceding theorems yield in particular P(Z=0)=q0=1/m; P(Z=i)=qi=2
𝑚𝑚𝑚𝑚𝑚 𝑚𝑚2
for i= 1,…,m,
P(B = 1) = 1/m; P(B = √𝑖𝑖 2 + 1) = 2 𝑚𝑚𝑚𝑚
𝑚𝑚𝑚𝑚
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖
𝐸𝐸(𝑍𝑍) = � 𝑖𝑖𝑖𝑖𝑖𝑖 = � 𝑖𝑖 ∙ 2 𝑚𝑚−1
𝐸𝐸(𝐵𝐵) = � �𝑖𝑖 2 + 1 ∙ 𝑞𝑞𝑖𝑖 = 𝑖𝑖=0
𝑚𝑚𝑚𝑚𝑚 𝑚𝑚2
for i = 1,...,m,
𝑚𝑚 𝑚 𝑚𝑚 𝑚𝑚2 − 1 = 3𝑚𝑚 𝑚𝑚2 𝑚𝑚−1
1 2 + 2 � �𝑖𝑖 2 + 1 ∙ (𝑚𝑚 − 𝑖𝑖) 𝑚𝑚 𝑚𝑚
(6)
𝑖𝑖=1
This expectation can be approximated by linear terms as 0.31 m+0.46
for 4≤m≤20,
E(B) ≈
(7) m/3
for m>20
It can be easily verified that the relative error satisfies
A probabilistic model for the arc length in quantitative linguistics 235
𝐸𝐸(𝐵𝐵) − 0.31𝑚𝑚 𝑚 0.46 ≤ 0.023 𝐸𝐸(𝐵𝐵)
for 4≤m ≤20, and for m > 20 the sequence of relative errors is monotone decreasing starting with the value 0.021. The underlying idea for obtaining the approximation (7) in the case m>20 is as follows. The sum in (6) is approximately equal to 𝑚𝑚𝑚𝑚
𝑚𝑚3 − 𝑚𝑚 6
� 𝑖𝑖 ∙ (𝑚𝑚 𝑚𝑚𝑚) = 𝑖𝑖𝑖𝑖
Thus 𝐸𝐸(𝐵𝐵) ≈
1 2 𝑚𝑚3 − 𝑚𝑚 1 𝑚𝑚 1 𝑚𝑚 + 2+ = + − ≈ 𝑚𝑚 𝑚𝑚 6 𝑚𝑚 3 3𝑚𝑚 3
(b) For m = 4, p1 = 1/2 , p2 = 1/4, p3 = 1/6, p4 = 1/12 we obtain 𝑞𝑞0 =
25 13 5 1 , 𝑞𝑞1 = , 𝑞𝑞2 = , 𝑞𝑞3 = , 72 36 24 12 3
E(Z) = � i qi = i=1 3
13 5 1 37 +2 +3 = = 1.0278 36 24 12 36
𝐸𝐸(𝐵𝐵) = � �𝑖𝑖 2 + 1𝑞𝑞𝑖𝑖 = 𝑖𝑖𝑖𝑖
13 5 1 25 + √2 + √5 + √10 = 1,5873. 36 24 12 72
By using the simple relation V(X) = E(X2) – (E(X))2 we obtain the following formulas for the variances: Theorem 3 m−1
m−1
i=1
i=1
2
a) V(Z) = � i2 q1 − �� i q1 � , m−1
2
m−1
b) V(B) = �(i2 + 1)qi − �� �i2 + 1qi � . i=0
i=1
Example 3: (a) For the special case of equal probabilities pi the theorem yields m−1
V(Z) = � i2 ∙ 2 ∙ i=1
2
2
m−i m2 − 1 m2 − 1 m2 − 1 − � � = − � � 3m 6 3m m2
236 Peter Zörnig
=
m4 + m2 − 2 , 18m2
V(B) =
m−1
1 2 + 2 � (i2 + 1) ∙ (m − 1) − m m m−1
i=1
2
2 1 � + 2 � �i2 + 1 ∙ (m − 1)� m m i=1
2
m−1
1 2 m2 + 5 − � + 2 � �i2 + 1 ∙ (m − 1)� . = 6 m m i=1
(b) For the probabilities in Example 2(b) we obtain 2
𝑉𝑉(𝑍𝑍) = 𝑞𝑞1 + 4𝑞𝑞2 + 9𝑞𝑞3 − �𝐸𝐸(𝑍𝑍)� = −�
37 2 � = 1.3395 . 36
13 5 1 +4 +9 36 24 12 2
V(B) = q0 + 2q1 + 5q2 + 10q3 − �E(B)� = +10
1 − 1.58732 = 0.4249 12
25 13 5 +2 +5 72 36 24
In order to obtain the variance of the arc length (2) we need the following two theorems. Let 𝐵𝐵𝑖𝑖 = �(𝑋𝑋𝑖𝑖 − 𝑋𝑋𝑖𝑖𝑖𝑖 )2 + 1 denote the summands in (2) (see also (5)).
Since the Xi are assumed to be independent, two of the variables Bi are also in-
dependent if they are not directly consecutive. This means e.g. that B1 and Bj are independent for j≥3, but the variables 𝐵𝐵1 = �(𝑋𝑋1 − 𝑋𝑋2 )2 + 1 and 𝐵𝐵2 =
�(𝑋𝑋2 − 𝑋𝑋3 )2 + 1 are not independent. This can be made clear as follows. As-
sume for example that m = 5, i.e. the values of the Xi are chosen from M =
{1,…,5}. Then the maximal length of a line segment of the arc is �(1 − 5)2 + 1 =
√17 (see Fig. 1). But if e.g. the second element of the sequence (x1,...,xn) is x2 = 3 (end height of the first segment), then the length of the second segment is a
most
𝑚𝑚𝑚𝑚𝑚𝑚
𝑖𝑖 ∈𝑀𝑀
�(3 − 𝑖𝑖)2 + 1 = √5. Hence the first line segment of the arc prevents
that the second assumes the maximum lengths, i.e. there is a dependency between these segments.
A probabilistic model for the arc length in quantitative linguistics 237
Theorem 4: Consider the random variables B1 and B2 above, representing any two consecutive summands in (2). Then the expectation of the product of B1 and B2 is 𝑚𝑚
𝑚𝑚
𝑚𝑚
𝐸𝐸(𝐵𝐵1 𝐵𝐵2 ) = � � � 𝑝𝑝𝑖𝑖 𝑝𝑝𝑗𝑗 𝑝𝑝𝑘𝑘 �(𝑖𝑖 𝑖 𝑖𝑖)2 + 1�(𝑗𝑗𝑗 𝑗𝑗)2 + 1 . 𝑖𝑖𝑖𝑖 𝑗𝑗𝑗𝑗 𝑘𝑘𝑘𝑘
The proof follows from the definition of the expectation, since any element of the triple sum represents a product of a value of the random variable B1 B2 and its probability (where the values of B1 B2 need not be distinct). The last formula cannot be simplified, not even for equal pi. The triple sum consists of m3 summands and for not too large values of m it can be calculated by means of suitable software. Example 4: For m = 2, p1 = 3/5 and p2 = 2/5 we get 𝐸𝐸(𝐵𝐵1 𝐵𝐵2 ) = 𝑝𝑝1 𝑝𝑝1 𝑝𝑝1 ∙ 1 ∙ 1 + 𝑝𝑝1 𝑝𝑝1 𝑝𝑝2 ∙ 1 ∙ √2 + 𝑝𝑝1 𝑝𝑝2 𝑝𝑝1 ∙ √2 ∙ √2 +∙ 𝑝𝑝1 𝑝𝑝2 𝑝𝑝2 √2 ∙ 1 +𝑝𝑝2 𝑝𝑝1 𝑝𝑝1 √2 ∙ 1 + 𝑝𝑝2 𝑝𝑝1 𝑝𝑝2 ∙ √2 ∙ √2 ∙ 𝑝𝑝2 𝑝𝑝2 𝑝𝑝1 ∙ 1 ∙ √2 + 𝑝𝑝2 𝑝𝑝2 𝑝𝑝2 ∙ 1 ∙ 1 3 3 3 22 3 22 3 2 2 3 22 3 2 2 = � � + � � √2 + � � ∙ 2 + � � √2 + � � √2 + � � ∙ 2 5 5 5 5 5 5 5 5 5 5 5 3 2 2 2 3 + � � √2 + � � = 1.4388. 5 5 5
By using Theorem 2(b) we now obtain the following statement. Theorem 5: The covariance of B1 and B2 is Cov (B1 B2) = E (B1 B2) - E(B1) E(B2) 𝑚𝑚
𝑚𝑚
𝑚𝑚
= � � � 𝑝𝑝𝑖𝑖 𝑝𝑝𝑗𝑗 𝑝𝑝𝑘𝑘 �(𝑖𝑖 𝑖 𝑖𝑖𝑖𝑖 𝑗𝑗𝑗𝑗 𝑘𝑘𝑘𝑘
𝑖𝑖)2
+ 1�(𝑗𝑗𝑗
𝑗𝑗)2
𝑚𝑚𝑚𝑚
+ 1 − �� 𝑖𝑖𝑖𝑖
�𝑖𝑖 2
2
+ 1𝑞𝑞𝑖𝑖 � .
Example 5: Let be m = 3, p1 = 2/10, p2 = 3/10, p3 = 5/10, hence q0 = 19/50, q1 = 21/50, q2 = 1/5. As illustrated in the above examples we obtain E(B1 B2) and E(B1) = E(B2)
≈ 1.4212, thus Cov(B
1
≈ 2.0468
B2) ≈ 0.0270. Since the covariance is not
zero, we have in particular verified that B1 and B2 are dependent variables. We are now able to determine some characteristics of the arc length L in (2). From the additivity of the expectation it follows: Theorem 6: The expectation of the arc length is 𝑚𝑚𝑚𝑚
𝐸𝐸(𝐿𝐿) = (𝑛𝑛 𝑛 1) 𝐸𝐸(𝐵𝐵) = (𝑛𝑛 𝑛 1) � �𝑖𝑖 2 + 1𝑞𝑞𝑖𝑖 . 𝑖𝑖𝑖𝑖
238 Peter Zörnig This means in particular that E(L) increases linearly with n and the relative arc length
𝐸𝐸(𝐿𝐿) 𝑛𝑛𝑛𝑛
is an increasing function of the inventory size m.
For equal probabilities pi we obtain (see Example 2(a)) 𝐸𝐸(𝐿𝐿) = (𝑛𝑛 − 1) 𝐸𝐸(𝐵𝐵) = (𝑛𝑛 − 1) �
or the approximate relation
(n-1) (0.31 m+0.46)
𝑚𝑚−1
1 2 + � �𝑖𝑖 2 + 1(𝑚𝑚 − 1)� 𝑚𝑚 𝑚𝑚2 𝑖𝑖=1
(8)
for 4≤m≤20,
E(L) ≈
(9) (n-1) m/3
for m>20
An additivity law corresponding to Theorem 6 holds for the variance only when a sum of independent random variables is given. As we have seen above, this is not the case for the arc length. Therefore we make use of the following fact to determine V(L) (see e.g. Ross (2007, p. 54)). Theorem 7: The variance of a sum of arbitrary random variables Y1+…+Yn is given by 𝑛𝑛
𝑉𝑉(𝑌𝑌1 + ⋯ + 𝑌𝑌𝑛𝑛 ) = � 𝑉𝑉(𝑌𝑌𝑖𝑖 ) + 2 𝑖𝑖𝑖𝑖
𝑛𝑛
� 𝐶𝐶𝐶𝐶𝐶𝐶�𝑌𝑌𝑖𝑖 , 𝑌𝑌𝑗𝑗 �.
1≤𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
From this we obtain: Theorem 8: The variance of the arc length is 𝑉𝑉(𝐿𝐿) = (𝑛𝑛 𝑛 1)𝑉𝑉(𝐵𝐵) + 2(𝑛𝑛 𝑛 2)𝐶𝐶𝐶𝐶𝐶𝐶(𝐵𝐵1 𝐵𝐵2 ) where V(B) and Cov(B1B2) are given by Theorems 3(b) and 5, respectively. Proof: From Theorem 7 it follows that 𝑛𝑛𝑛𝑛
𝑉𝑉(𝐿𝐿) = 𝑉𝑉(𝐵𝐵1 + ⋯ + 𝐵𝐵𝑛𝑛𝑛𝑛 ) = 𝑉𝑉(𝐵𝐵1 ) + ⋯ + 𝑉𝑉(𝐵𝐵𝑛𝑛𝑛𝑛 ) + 2 � 𝐶𝐶𝐶𝐶𝐶𝐶(𝐵𝐵𝑖𝑖 , 𝐵𝐵𝑖𝑖𝑖𝑖 ) 𝑖𝑖𝑖𝑖
because the other covariances are zero. Since the last sum has (n-2) elements, it follows the statement. Theorem 9: The variance of the arc length can be alternatively computed as V(L) = (n-1) E(B2) + 2(n-2) E(B1B2) + (5-3n) E(B)2 𝑚𝑚𝑚𝑚
= (𝑛𝑛 𝑛 1) � 𝑖𝑖𝑖𝑖
(𝑖𝑖 2
𝑚𝑚𝑚𝑚
𝑚𝑚
𝑚𝑚
+ 1)𝑞𝑞𝑖𝑖 + 2(𝑛𝑛 𝑛 2) � � � 𝑝𝑝𝑖𝑖 𝑝𝑝𝑗𝑗 𝑝𝑝𝑘𝑘 �(𝑖𝑖 𝑖 𝑖𝑖)2 + 1�(𝑗𝑗𝑗 𝑗𝑗)2 + 1 2
+(5 − 3𝑛𝑛) � � �𝑖𝑖 2 + 1𝑞𝑞𝑖𝑖 � . 𝑖𝑖𝑖𝑖
𝑚𝑚
𝑖𝑖𝑖𝑖 𝑗𝑗𝑗𝑗 𝑘𝑘𝑘𝑘
A probabilistic model for the arc length in quantitative linguistics 239
Proof: From Theorem 8 we obtain V(L) = (n-1) [E(B2) - E(B)2] + 2(n-2) [E(B1B2) - E(B)2] = (n-1) E(B2) + 2(n-2) E(B1B2) + (5-3n) E(B)2. Calculating E(B2), E(B1B2) and E(B) by means of Theorems 2-4 yields the statement. We will now consider a concrete linguistic application. Example 6: Consider the sequence of syllabic lengths of the verses in the poem Der Erlkönig by Goethe: (x1,…,x32) = (8,7,8,8,9,6,6,6,7,7,7,6,8,5,6,6,7,6,6,8,9,5,8,7,8,9,9,6,6,7,7,7)
(10)
(see Popescu et al. (2013, p. 52)). The corresponding observed arc length is L ≈ 50.63. We consider the probabilistic model in which a random variable Xi assumes the values 1,…,9 with probabilities corresponding to the observed relative frequencies in (10), i.e. (p1,…,p9) = (0,0,0,0,2/32,10/32,9/32,7/32,4/32) and m = 9. From the definition of the numbers qi we obtain (q0,…,q8) = (125/512, 201/512, 31/128, 27/256, 1/64, 0, 0, 0, 0). Hence 𝐸𝐸(𝐵𝐵) =
8
� �𝑖𝑖 2 𝑖𝑖𝑖𝑖
9
9
8
+ 1𝑞𝑞𝑖𝑖 ≈ 1.7388, 𝐸𝐸(𝐵𝐵) = �(𝑖𝑖 2 + 1)𝑞𝑞𝑖𝑖 ≈ 3.5605, 9
𝑖𝑖𝑖𝑖
𝐸𝐸(𝐵𝐵1 𝐵𝐵2 ) = � � � 𝑝𝑝𝑖𝑖 𝑝𝑝𝑗𝑗 𝑝𝑝𝑘𝑘 �(𝑖𝑖 𝑖 𝑖𝑖)2 + 1 �(𝑗𝑗𝑗 𝑗𝑗)2 + 1 ≈ 3.1115,
yielding
𝑖𝑖𝑖𝑖 𝑗𝑗𝑗𝑗 𝑘𝑘𝑘𝑘
E(L) = (n-1) ⋅ E(B) = 31 ⋅ 1.7388 ≈ 53.90, V(L) = (n-1) E(B2) + 2(n-2) E(B1B2) + (5-3n) E(B)2 = 31 ⋅ 3.5605 + 60 ⋅ 3.1115 -91 ⋅ 1.73882 ≈ 21.93. Thus in this example the observed arc length is smaller than expected in the probability model, however the absolute deviation between observed and expected arc length is less than the standard deviation 𝜎𝜎 = �𝑉𝑉(𝐿𝐿) ≈ 4.68.
3 Applications We study the sequences (x1,…,xn) determined for 32 texts from 14 different languages (see Table 1), where xi represents the length of the i-th word in the text. The texts are the same as in Zörnig (2013, Section 4, Tab. 1-9). The first column of Table 1 indicates the text, e.g. number 3a indicates the third text in Table 3 of
240 Peter Zörnig the aforementioned article. For each sequence the observed arc length L was determined according to (1) (see column 2 of Table 1). Table 1: Characteristics of the arc length for 32 texts Text
L
1a) Bulgarian 1644.85 N. Ostrovskij 1b) Hungarian 2841.26 A nomina lizmus... 1c) Hungarian 1016.71 Kunczekolbász 1d) Macedonian 2251.33 N. Ostrovskij 2a) Romanian 1681.01 O. Paler 2b) Romanian 2718.28 N. Steinhardt 2c) Russian 1319.67 N. Ostrovskij 2d) Serbian 1703.14 N. Ostrovskij 3a) Slovak 1445.03 Bachletová 3b) Slovak 1655.59 Bachletová 3c) Slovenian 1556.70 N. Ostrovskij 4a) Sundanese 2011.23 Aki Satimi 4b) Sundanese 664.27 Agustusan 4c) Indonesian 573.80 Pengurus 4d) Indonesian 456.16 Sekolah ditutup 5a) Bamana 4057.15 Masadennin 5b) Bamana 3617.14 Sonsanin 5c) Bamana 1893.39 Namakɔrɔba 5d) Bamana 1739.97 Bamak’ sigicoya 6a) Vai 4079.90
Lmin 927.07
Lmax
E(L)
V(L)
1994.13 1571.16 748.80
𝐿𝐿 − 𝐸𝐸(𝐿𝐿) n �𝑉𝑉(𝐿𝐿)
𝐿𝐿 m 𝑛𝑛 − 1
2.6930
926 1.7782 6
1316.31 3679.39 2792.90 2621.41 0.9446
1314 2.1639 9
460.31 1244.57
956.70
920.81 1.9775
458 2.2247 9
1124.07 2719.36 2101.74 1413.60 3.9787
1123 2.0065 6
892.49 2008.51 1591.80 1065.11
2.7337
891 1.8888 7
1512.49 3361.98 2685.09 1903.83 0.7606
1511 1.8002 7
793.49 1683.19 1339.49
726.42 -0.7354
792 1.6684 7
1002.07 2075.82 1651.02
808.56 1.8328
1001 1.7031 6
876.23 1783.55 1448.23
733.63 -0.1182
873 1.6571 6
925.49 2139.51 1675.45
939.62 -0.6479
924 1.7937 7
978.07 1923.95 1566.72
694.73 -0.3798
977 1.5950 6
1283.66 2318.10 1937.87
652.49 2.8720
1183 1.5688 5
251.58
1.6795
416 1.6007 6
346.07 664.96 535.43
176.88 0.1784
345 1.6680 6
281.07
189.24 0.7194
280 1.6350 6
417.07
761.00
551.39
637.63
446.27
2617.90 4559.65 4059.86 3270.44 - 0.0474
2616 1.5515 8
2394.49 4015.23 3585.09 2440.98 0.6488
2393 1.5122 7
1407.66 2123.92 1913.07
663.41 -0.7641
1407 1.3467 5
1139.07 1966.32 1719.04
912.12
0.6930
1138 1.5303 6
3140.66 4579.51 4083.40 772.52 -0.1259
3140 1.2997 5
A probabilistic model for the arc length in quantitative linguistics 241
L
Text Mu ja vaa I 6b) Vai Sa’bu Mu’a’… 6c) Vai Vande bɛ Wu’u 7a) Vai Siika 7b) Tagalog Rosales 7c) Tagalog Hernandez 7d) Tagalog Hernandez 8a) Romanian Popescu 8b) German Assads Familiendiktatur 8c) German ATT00012 9a) German Die Stadt des Schweigens 9b) German Terror in OstTimor 9c) German Unter Hackern...
Lmin
Lmax
E(L)
631.40
495.24
719.79
632.45
571.29
426.24 612.39
552.79
V(L)
𝐿𝐿 − 𝐸𝐸(𝐿𝐿) n �𝑉𝑉(𝐿𝐿)
95.25 -0.1081 105.29
𝐿𝐿 m 𝑛𝑛 − 1
495 1.2781 4
1.8034
426 1.3442 4
2950.33 1663.07 3492.93 2817.18 1697.88 3.2314
1662 1.7762 6
3794.27 1959.90 4309.66 3460.82 2393.52 6.8158
1958 1.9388 8
3238.09 1739.90 3740.27 3011.21 1946.99 5.1416
1738 1.8642 8
2838.63 1467.90 3255.29 2594.65 1735.92 5.8558
1466 1.9376 8
1658.32 1003.07 1947.99 1583.01
1002 1.6567 6
742.53 2.7640
2587.27 1417.73 3228.75 2574.53 2515.87 0.2539
1415 1.8289 10
2153.84 1448.31 2652.87 2130.62 2228.24 0.4917
1146 1.8811 9
2871.06 1569.73 3502.33 2817.12 2938.33 0.9952
1567 1.8334 10
2475.40 1400.31 2972.81 2418.87 2074.72 1.2412
1398 1.7719 9
2558.37 1365.31 3147.71 2529.07 2613.70 0.5731
1363 1.8784 9
The columns 3 and 4 contain the minimal and approximate maximal arc length Lmin and Lmax that could be achieved by permuting the respective sequence (see Popescu et al. (2013, Section 2)). Columns 5 and 6 contain the expectation and variance of the arc length, when it is considered as a random variable as in Section 2. In this context the probabilities are assumed to be equal to the observed sequences of the abstract text. For example, the sequence (x1,...,xn) of the first text in Table 1 has length n = 926 and consists of m = 6 different elements chosen from M = {1,...,6}. The frequency of ocurrence of the element i in the sequence is ki, where (k1,...,k6) = (336, 269, 213, 78, 27, 3), see Zörnig (2013, p. 57). Thus the probabilities in the probabilistic model are assumed to be pi = ki /n for i = 1,…,6 (see also Example 6). Column 7 contains the values of the standardized arc length defined by 𝐿𝐿𝑠𝑠𝑠𝑠 =
𝐿𝐿𝐿𝐿𝐿(𝐿𝐿) √𝜎𝜎
which is calculated in order to make the observed arc lengths
242 Peter Zörnig comparable (see e.g. Ross (2007, p.38) for this concept). For example, for the fourth text it was found Lst ≈ 3.98. This means that the observed arc length is equal to the expected plus 3.98 times the standard deviation 𝜎𝜎, i.e. the observed arc length is significantly larger than in the random model. The columns 8 to 10 of Table 1 contain the sequence length n, the relative arc length 𝐿𝐿𝑟𝑟𝑟𝑟𝑟𝑟 =
the inventory size m.
𝐿𝐿
𝑁𝑁𝑁𝑁
and
It can be observed that the standardized arc length varies between -0.7641 (text 5c: Bamana) and 6.8158 (text 7b: Tagalog). In particular, the latter text is characterized by an extremely high arc length which corresponds to a very frequent change between different entities. The results, which should be validated for further texts, can be used e.g. for typological purposes. The resulting quantities are no isolated entities but their relations to other text properties must still be investigated in order to incorporate them into synergetic control cycles. Comparing the observed arc length L with the sequence length n, one can easily detect the linear correlation illustrated graphically in Fig. 2. By means of a simple linear regression we found the relation L = 254.58 + 1.5023 n corresponding to the straight line. As a measure for the goodness of fit we calculate the coefficient of determination (see Zörnig (2013, p. 56)), defined by 𝑅𝑅2 = 1 −
∑(254.58 + 1.5023𝑛𝑛 𝑛𝑛𝑛(𝑛𝑛))2 ∑(𝐿𝐿(𝑛𝑛) − 𝐿𝐿𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 )2
where L(n) is the observed arc length corresponding to the sequence length n and Lmean is the mean of the 32 observed L-values. The summation runs over all 32 values of n in Tab. 3.1. The above formula yields R2 = 0.9082. Since R2 > 0.9, the fitting can be considered good. Finally, Fig. 3 illustrates graphically the relation between inventory size m and observed relative arc length 𝐿𝐿𝑟𝑟𝑟𝑟𝑟𝑟 =
𝐿𝐿
. Though different values of Lrel can
𝑛𝑛𝑛𝑛
belong to the same inventory size m, it seems that there is a slight tendency that Lrel increases with m. This observation is in accordance with the relations following from the probability model (see the remark after Theorem 6). The simple linear regression yields Lrel = 1.0192 + 0.1012 m, which corresponds to the straight line. The coefficient of determination is R2 = 0.5267.
A probabilistic model for the arc length in quantitative linguistics 243
Fig. 2: Linear regression
Fig. 3: Relative arc length in dependence on the inventory size
244 Peter Zörnig
4 Simulating the distribution of the arc length In Section 2 we derived formulas for the two most important characteristics of the random variable arc length (2), namely expectation and variance. It would be desirable to have a formula for the probability mass function (pmf) available. But this is virtually impossible in view of the long sequences that may arise in linguistics. A feasible approach to obtain more information about the pmf is simulation. The simple procedure which can be performed by any statistical software is as follows: Generate k random sequences (x1,…,xn) with elements from a set M = {1,…,m}, such that i is chosen with probability pi (p1+…+pm = 1). For each sequence calculate the arc length L according to formula (2) and store the k values in a list. This list is an “approximation” for the probability distribution of L with specified pi. For two of the texts in Tab. 3.1, the histograms of the lists of L-values are shown in Fig. 4. (As before, the probabilities have been set equal to the observed relative frequencies in the respective texts). Each of the lists consisted of k = 5000 values and it can be observed that the distributions are similar to normal distributions indicated by continuous lines in the histograms. For most texts of Table 1, the statistical hypothesis that the list of L-values originates from a normal distribution cannot be rejected by means of the Shapiro-Wilk normality test.
Fig. 4: Histograms for the distribution of the arc length.
A probabilistic model for the arc length in quantitative linguistics 245
5 Concluding remarks It is possible to consider generalizations or variations of the concept “arc length”. By using the Minkowski distance we obtain a more general concept by defining 𝑛𝑛−1
𝐿𝐿 = �(|𝑋𝑋𝑖𝑖 − 𝑋𝑋𝑖𝑖+1 |𝑝𝑝 + 1)1/𝑝𝑝 for 𝑝𝑝 ≥ 1
(11)
𝑖𝑖=1
which reduces to (2) for p = 2. One could also develop a continuous model for the arc length by assuming that the variables Xi in (2) are distributed according to a given probability density.
References Popescu, Ioan-Iovitz, Ján Mačutek & Gabriel Altmann. 2009. Aspects of word frequencies. Lüdenscheid: RAM-Verlag. Popescu, Ioan-Iovitz, Emmerich Kelih, Ján Mačutek, Radek Čech, Karl-Heinz Best & Gabriel Altmann. 2010. Vectors and codes of text. Lüdenscheid: RAM-Verlag. Popescu, Ioan-Iovitz, Radek Čech & Gabriel Altmann. 2011. The lambda-structure of texts. Lüdenscheid: RAM-Verlag. Popescu, Ioan-Iovitz, Peter Zörnig, Peter Grzybek, Sven Naumann & Gabriel Altmann. 2013. Some statistics for sequential text properties. Glottometrics 26. 50–99. Ross, Sheldon M. 2007. Introduction to probability models, 9th edn. Amsterdam: Elsevier. Wälchli, Bernhard. 2011. Quantifying inner forms: A study in morphosemantics. Working paper 46. Bern: University of Bern, Institute for Linguistics. http://www.isw.unibe.ch (accessed December 2013) Zörnig, Peter. 2013. A continuous model for the distances between coextensive words in a text. Glottometrics 25. 54–68.
Subject Index accuracy, 188, 202, 206, 212 adjectives, 4, 17, 160 adverbs, 17, 42 affix, 2 ANOVA, 109, 110 arc, 20 – length, 4, 84, 231, 232, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245 associative symmetric sums, 10 author, 77, 78, 94, 149, 152, 157, 165, 205 – conscious control, 77 – subconscious control, 77 authorship attribution, 28, 72, 110, 130 autocorrelation, 4, 35, 43, 231 – function. See function – index, 38, 39, 40, 41, 42, 45, 46, 51, 52 – negative, 39, 44 – positive, 41, 44, 48 – semantic, 50 – textual, 35, 38, 39, 46, 47, 48, 49, 54 automated similarity judgment program (ASJP), 172, 176, 177, 178, 179, 180, 183, 189, 193 autoregressive integrated moving average (ARIMA) model, 152, 161 autoregressive–moving-average (ARMA) model, 152, 169 axioms, 9, 10, 14 – commutativity, 11 Bayes classifier, 203, 204, 212 binary coding, 148, 162 Bonferroni correction, 80, 187 borderline conditions, 9 boundary conditions, 2, 5, 72 centroid, 44 chapters, 1, 8, 26, 27, 28, 29, 32, 75, 76, 147, 149, 154, 155, 157, 158, 159, 161, 163, 164, 165 – length, 72, 73, 74, 76, 148, 150 characteristics, 8, 9, 11, 13, 14, 17, 18, 21, 22, 23, 28, 32 – consistency, 8, 12, 13, 14, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 28, 29, 32
– generalized, 9, 10, 11, 12, 13, 14, 21, 22, 26, 29, 32 – part of speech, 16, 17, 20, 25, 26, 27, 28 – partial, 9, 10, 11, 12, 13, 14, 18, 21, 23, 32 – rhymed, 17, 20, 23, 25, 30 – unrhymed, 18, 23, 24, 25, 28 – weighted sum, 11 – weights, 11, 12, 13, 18, 19, 21, 28 characters, 36, 39, 41, 91, 178, 202 – frequency, 220 chi-square, 40, 41, 42, 43, 45, 46, 54, 94, 100 Chomsky, Noam, 67 clauses, 58, 59, 60, 95 – length, 58, 59, 60, 61 coefficient of variation, 157 cognacy judgments, 171 cognates, 171, 172 – automatic identification, 173, 174, 175 comparative method, 171 conditional random fields (CRF) model, 205, 206, 207, 208, 210, 211, 212, 213 confidence intervals, 61, 65, 136, 139, 153 confusion matrix, 212 control cycle, 2, 3, 4 corpus, 49, 112, 126, 176, 208 – Brown, 50 – Potsdam, 98 – reference, 50 correlation, 13, 30, 148, 161, 186, 209 – coefficient, 13, 17 – coefficient of determination, 61, 62, 63, 64, 65, 127, 242 – negative, 161, 163 – Pearson’s r, 17, 185 – point biserial, 185 – Spearman’s ρ, 186 correspondence analysis, 46 cross entropy, 184 cross validation, 212 dependent variable, 204 dictionary, 2, 90, 91 dimensionality, 11 dispersion, 157, 158
248 Subject Index distribution – Bose, 217, 218 – cumulative, 219 – distances, 1 – empirical, 94 – frequency, 95, 130, 134, 144 – hyper-binomial, 98 – hyper-Pascal, 96 – hyper-Poisson, 96, 97 – mixed negative binomial, 99 – motif length, 130 – motifs, 96, 98, 133, 134, 137, 138, 139 – negative binomial, 99 – normal, 39 – nouns, 46 – part of speech, 28 – probability, 35, 89, 94, 95, 111 – rank-frequency, 92, 93, 100, 219 – right truncated Zipf-Alexeev, 135 – stationary, 35, 36, 41, 44, 47 – verbs, 46 – verse length, 148 – word frequency, 219 – word length, 96, 128, 130, 133, 134, 135, 144 – Zipf, 182 – Zipf-Mandelbrot, 92, 94, 95, 100, 133 document models – bag-of-words, 35, 37, 38, 54, 204, 207 – random, 57, 58, 59, 65 – sequential, 35, 54 document weights, 44, 45, 47 Dryer, Matthew S., 191 Durbin-Watson statistic, 42 edges, 20, 22 Ethnologue, 176, 177, 179, 180, 181, 184, 185, 186 Euclidean – dissimilarities, 38, 39, 40, 51, 54 – distances, 51, 71 expectation–maximization (EM) algorithm, 173 false discovery rate (FDR), 187 Flesh formula, 110 Fourier series, 4 fractal, 4, 57, 231 function, 46
– autocorrelation (ACF), 153, 161, 162, 163, 164, 168, 170 – autocovariance. See time series – continuous, 9 – cumulative distribution, 153 – geometric, 96 – logistic growth, 114 – partial autocorrelation (PACF), 152, 161, 170 – periodical, 152 – power, 127, 130 – recurrence, 1 – symmetric, 9 – trend, 153 fuzzy graph, 20, 21, 22, 23 fuzzy quantifiers, 12 gender, 7, 42, 43, 44 generalized estimation, 9, 11, 12, 13, 16, 26 goodness of fit, 61, 62, 63, 242 hapax legomena, 218, 219 Haspelmath, Martin, 191 hidden conditional random fields (HCRF), 206 hidden Markov models (HMM), 206, 207 Hurst’s coefficient, 4 hyperlinks, 46, 47, 49, 54 hyponymy, 50 interjections, 42 inter-language distance, 172, 173 intuition, 2, 203 KL-divergence, 184 lambda indicator, 71, 72, 73, 74, 75, 76, 77, 78, 79, 83, 84, 85 language change, 171 language family, 171, 178, 179, 184, 185 – Dravidian, 176, 180, 181 – Indo-Aryan, 176 – Indo-European, 3, 171 – Mayan, 176 – Salishan, 174 languages – Arabic, 134, 141 – artificial, 178, 218 – Chinese, 91 – Czech (modern), 77, 78, 79, 80, 82, 84, 134, 141, 144, 163 – Czech (old), 147, 149, 161, 163 – English, 77, 78, 79, 81, 82, 83, 85, 110, 152, 171, 178, 205
Subject Index 249 – French, 49, 148 – German, 171 – Greek (ancient), 150 – Greek (modern), 126 – Italian, 93, 99, 100 – Latin, 148, 149, 150, 152 – Polish, 148, 152 – Russian, 148 letters, 90, 155, 157 Levenshtein distance, 172, 175, 176 lexical series, 152 lexicostatistics, 171, 172 linear regression, 155, 242 lines, 42, 44, 157, 159, 161, 163 – endings, 160 – length, 155, 157, 165 linguistic laws, 57, 58, 77, 92, 111, 150, 231 – Menzerath-Altmann, 57, 58, 61, 62, 63, 65, 95, 125, 126, 127, 128, 130 – Piotrowski-Altmann, 112 – Zipf, 59 – Zipf-Mandelbrot, 125 linguistics – computational, 111 – computational historical, 171 – corpus, 111 – historical, 171, 173, 176 – quantitative, 231 – synergetic, 4, 6, 96, 150 majority voting, 208, 211 Mandelbrot, Benoit, 67 Markov chain, 37, 39, 47, 49, 231 Markov transition matrix, 35, 36 Markov transition probability, 41, 231 Menzerath, Paul, 67 Miller, George A., 67 Minkowski distance, 245 Mitzenmacher, Michael, 67 Moran's I index of spatial statistics, 39 morphemes, 2, 125, 160 motifs, 4, 89, 90, 91, 95, 97, 107, 109, 125, 129, 139 – categorical, 98 – D-motif, 98, 107 – F-motif, 90, 144 – F-motiv, 89 – frequency, 125, 136, 137
– granularity, 91 – L-motif, 90, 92, 93, 96, 97, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145 – P-motif, 90 – rank-frequency relation (RFR), 133, 135 – R-motif, 97, 98, 99, 100, 107 – T-motif, 90 – word length, 125, 126, 127, 129, 130 moving average model, 152 multidimensional scaling (MDS), 51 multivariate estimation, 8 navigation – free, 37, 38, 40, 44, 45 – hypertext, 41, 46, 47, 48, 49, 54 – linear periodic, 40, 41 – textual, 36, 44 neighbor-joining (NJ) algorithm, 172, 174, 176 neighbourhoodness, 35, 37, 54 n-grams, 181, 182, 183 – bigrams, 182, 183 – character, 176, 182, 183 – frequency, 151 – relative frequency, 183 – trigrams, 39, 41, 182 nodes, 20, 21, 22, 23, 203 nouns, 16, 17, 46, 50, 51, 160 null hypothesis, 39, 41, 80, 134, 138, 144, 187, 188 oscillations, 1, 4, 152, 231 PageRank algorithm, 47 parts-of-speech (POS) tags, 42, 44 passages, 42, 149, 153, 157, 158, 159 permutations, 41 – test, 41, 45, 48, 49 phonemes, 2, 175, 183 polysemy, 2 polytextuality, 2, 91, 96 positional weights, 35 predicate, 59 probabilistic latent semantic analysis (PLSA), 203 probability, 35, 44, 47, 50, 51, 55, 207 – conditional, 208 process – autoregressive, 169 – moving average, 169
250 Subject Index – random, 169 pronouns, 17, 160 proto-language, 171 pseudotext, 134, 135, 136, 137, 138 punctuation marks, 218 quantity (syllabic), 154, 162, 163, 164, 165 random variable, 152, 153, 168, 232, 237, 239, 241, 244 rhyme, 28, 29, 158, 159, 165 – frequency, 159, 160 – homogeneity, 158, 159 – inventory, 158, 159 – pairs, 148, 158 – repertory, 158 rhythm, 4, 29, 154, 161, 165 runs test, 148, 154 sales trends, 201, 204, 209, 213 screeplot, 52, 53 semantic similarities, 50 sentences, 1, 2, 3, 57, 58, 59, 60, 61, 67, 68, 90, 96, 206 – bag-of-features representation, 206 – length, 59, 95, 110, 231 – polarity, 205, 206, 207, 208, 209, 211, 213 – polarity reversal, 206 sentiment analysis, 201, 203, 205, 206, 209 sequences, 1, 2, 3, 92, 99, 151, 152, 153 – aphanumeric, 218 – chapters, 149, 155, 156, 157, 158, 160, 161, 163 – distances, 4 – graphemes, 57 – historical data, 204, 210 – letters, 59 – linear, 1, 109 – lines, 161, 162 – mathematical, 1 – motifs, 94 – multidimensional, 4 – musical, 4 – numerical, 4 – part of speech, 4, 99 – phonemes, 183 – rank-frequency, 232 – relative errors, 235 – sentences, 206, 208 – syllable lengths, 239
– syllables, 148, 161, 163, 164 – symbolic, 4 – symbols, 4 – texts, 112 – tweets, 211 – word length, 133 – words, 59 Shapiro-Wilk normality test, 244 similarity models, 8, 9, 12, 13, 14, 16, 20, 22, 23, 24, 25, 26, 27, 28, 32 social media, 201, 202, 203, 204, 209 – blogs, 202, 203, 209 – Twitter, 201, 202, 209, 210, 211 social networks – analysis, 201 – characteristics, 203, 204 spatial analysis, 35 standard deviation, 155, 157, 184, 239, 242 stress, 154, 162, 163, 164, 165, 178 string similarity, 171, 173, 175, 177, 181, 184, 185, 186, 187 style development, 29 subgraphs, 20, 21 support vector machines (SVM), 204, 207 syllables, 2, 4, 90, 91, 92, 125, 126, 134, 154, 155, 156, 157, 161, 165 – coding, 155 – long, 154 – short, 154 – stressed, 110, 154, 163 – unstressed, 154, 163 synonyms, 2 synsets, 50 syntactic complexity, 91 syntagmatic relations, 133, 145 syntagmatic time, 151 syntax, 29 term-document matrix, 44, 45, 46, 54 terms, 3, 46, 49 – frequency, 152 – magnification factor, 48 – weights, 45 text – categorization, 182 – classification, 7, 32, 91, 92 – entities, 4 – frequency structure, 71, 77, 78, 84, 217
Subject Index 251 – genre, 71, 72, 79, 130, 163, 224, 228 – length, 71, 72, 73, 77, 78, 79, 83, 84, 85, 110, 111, 221, 224, 225, 227 – production, 224 – randomization, 59, 128, 129, 135, 136, 137, 138 – topic, 203, 208, 209 – trajectories, 227 textual neighbourhoods, 35 textual series, 151 time series, 35, 109, 148, 151, 152, 154, 161, 168, 169, 203, 213, 231 – autocovariance, 168, 169 – variance, 168 tokens, 49, 50, 109, 134, 138, 227 translations, 8, 13, 26, 27, 28, 29, 30, 32, 217, 219, 220, 221, 223, 224, 226, 227, 228 translator, 224, 228 tree inference algorithm, 173, 186 trend analysis, 148, 154 tweets, 201, 202, 204, 208, 209, 212 types, 35, 36, 39, 40, 41, 45, 49, 54, 84, 109, 133, 218 type-token ratio, 71, 72, 109, 110, 158 undirected graph, 207 – fuzzy, 20, 23
unweighted pair group method with arithmetic mean (UPGMA), 172 u-test, 80 verbs, 4, 16, 17, 42, 44, 46, 50, 51, 53, 54, 160 verse, 28, 29, 147, 149, 153, 157, 159, 165, 239 – length, 148, 150, 153, 154, 155, 156, 161, 165 vertices, 20 vocabulary "richness", 71, 109 word length, 1, 90, 91, 110, 125, 126, 127, 129, 133, 231 word lists, 171, 172, 173, 175, 176, 178, 179, 188, 191 WordNet, 50, 51, 53 words, 36, 155, 157 – autosemantic, 219 – content, 114, 116, 117, 118 – frequency, 112, 113, 125, 218, 220, 231 – function, 114 – ordering, 205 – orthographical, 218 – repetition, 3, 77 – senses, 205, 207 – unrhymed, 28, 33 – valence reversers, 206 – zero syllabic, 134 world atlas of language structures (WALS), 177, 179, 180, 184, 185
Authors Index Albert, Réka, 59, 66 Altmann, Gabriel, 1, 5, 6, 28, 29, 33, 57, 66, 71, 72, 86, 87, 95, 96, 108, 110, 113, 122, 123, 125, 130, 131, 219, 229 Andreev, Sergey, 7, 28, 33 Andres, Jan, 57, 61, 66 Anselin, Luc, 35, 55 Antić, Gordana, 134, 145 Argamon, Shlomo, 28, 33 Asur, Sitaram, 202, 203, 213, 215 Atkinson, Quentin D., 171, 190 Baixeries, Jaume, 57, 66 Bakker, Dik, 172, 190, 191, 193 Barabási, Albert-László, 59, 66 Bavaud, François, 35, 39, 40, 41, 51, 55 Beliankou, Andrei, 97, 98, 108 Benešová, Martina, 57, 66, 128, 130 Benjamini, Yoav, 187, 190 Bergsland, Knut, 171, 190 Best, Karl-Heinz, 96, 108, 114 Bollen, Johan, 202, 213 Borisov, Vadim V., 7, 9, 26, 33 Boroda, Mojsej, 4, 6, 89, 108 Bouchard-Côté, Alexandre, 173, 190 Box, George E., 152, 166 Brew, Chris, 182, 190 Brown, Cecil H., 178, 190, 191, 193 Buk, Solomija, 217, 219, 221, 227, 229 Bunge, Mario, 111, 122 Bychkov, Igor A., 9, 33 Carroll, John B., 2, 6 Carroll, Lewis, 217 Cavnar, William B., 182, 190 Čech, Radek, 29, 33, 71, 72, 86, 128, 130, 217, 229 Choi, Yejin, 206, 214 Chomsky, Noam, 59, 67 Chrétien, C. Douglas, 171, 192 Christiansen, Chris, 176, 190 Cleveland, William S., 155, 166 Cliff, Andrew D., 39, 55 Cohen, Avner, 59, 66
Coleridge, Samuel T., 13, 16, 17, 18, 20, 26, 27, 29, 30, 31 Cootner, Paul H., 202, 214 Covington, Michael A., 71, 86, 110, 122 Cramer, Irene M., 61, 66, 110, 123, 125, 130 Cressie, Noel A.C, 35, 55 Cryer, Jonathan, 152, 166 Danforth, Christopher, 202, 214 Daňhelka Jiří, 149, 166 de Campos, Haroldo, 217, 229 de Saint-Exupéry, Antoine, 217 Delen, Dursun, 204, 215 Dementyev, Andrey V., 9, 33 Dickens, Charles, 72, 74, 76 Dockery, Everton, 202, 214 Dodds, Peter, 202, 214 Dryer, Matthew S., 179, 190, 191 Dubois, Didier, 9, 10, 11, 12, 33 Dunning, Ted E., 183, 190 Durie, Mark, 171, 190, 192 Dyen, Isidore, 174, 175, 190 Eder, Maciej, 147, 148, 150, 152, 166, 167 Ellegård, Alvar, 171, 190 Ellison, T. Mark, 173, 191 Elvevåg, Brita, 59, 66 Embleton, Sheila M., 171, 191 Eppen, Gary D., 202, 214 Fama, Eugene F., 202, 214 Fedulov, Alexander S., 9, 26, 33 Felsenstein, Joseph, 172, 174, 191 Ferrer i Cancho, Ramon, 59, 66, 217, 229 Fontanari, José F., 217, 229 Foulds, Leslie R., 176, 192 Francis, Nelson W., 50, 55 Gallagher, Liam A., 202, 214 Gilbert, Eric, 202, 214 Glance, Natalie, 202, 214 Glass, Gene V., 152, 166 Goethe, 239 Goodman, Leo A., 184, 191 Gottman, John M., 152, 166 Gray, Russell D., 171, 173, 190, 191 Greenacre, Michael, 46, 55
254 Authors Index Greenhill, Simon J., 173, 176, 191 Gregson, Robert A., 152, 166 Grinstead, Charles M., 47, 55 Gruhl, Daniel, 202, 214 Gulden, Timothy R., 217, 229 Gumilev, Nikolay, 13, 26, 27, 28, 29, 30, 31 Gunawardana, Asela, 206, 214 Hammarström, Harald, 172, 189, 192, 193 Hašek, Jaroslav, 72, 73, 75 Haspelmath, Martin, 177, 179, 190, 191 Hauer, Bradley, 173, 191 Havel, Václav, 60, 61 Hay, Richard A., 152, 166 Herdan, Gustav, 151, 166 Hess, Carla W., 71, 86 Hochberg, Yosef, 187, 190 Hogg, Tad, 204, 214 Holman, Eric W., 172, 178, 189, 190, 191, 193 Hoover, David L., 7, 33 Hrabák, Josef, 161, 166 Hřebíček, Luděk, 57, 66 Hu, Fengguo, 128, 131 Huang, Kerson, 218, 229 Huberman, Bernardo A., 202, 203, 204, 213, 215 Huff, Paul, 176, 186, 191 Huffman, Stephen M., 176, 191 Inkpen, Diana, 175, 191 Isihara, Akira, 218, 229 Jäger, Gerhard, 176, 191 Jenkins, Gwilym M., 152, 166 Jireček, Josef, 149, 166 Juola, Patrick, 7, 33 Kak, Subhash, 201, 203, 215 Karahalios, Karrie, 202, 214 Kavussanos, Manolis, 202, 214 Keeney, Ralph L., 12, 33 Kelih, Emmerich, 58, 66, 127, 131 Kirby, Simon, 173, 191 Köhler, Reinhard, 4, 6, 29, 33, 89, 91, 94, 95, 96, 97, 98, 108, 109, 110, 123, 125, 130, 131, 133, 134, 145, 150, 151, 166, 167 Kohlhase, Jörg, 114, 122 Kondrak, Grzegorz, 173, 174, 191, 192 Koppel, Moshe, 28, 33 Kosmidis, Kosmas, 217, 229 Krajewski, Marek, 148, 150, 152, 167
Kroeber, Alfred L., 171, 192 Kruskal, William H., 184, 190, 191 Kubát, Miroslav, 110, 123 Kučera, Henry, 50, 55 Lafferty, John, 205, 214 Le Roux, Brigitte, 40, 55 Lebanon, Guy, 206, 214 Lebart, Ludovic, 39, 55 Lee, Kwang H., 18, 33 Lerman Kristina, 204, 214 Levenshtein, Vladimir I., 172, 192 Levik, Wilhelm, 13, 26, 27, 28, 29, 30, 31 Lewis, Paul M., 176, 192 Li, Wientian, 59, 67 Lin, Dekang, 182, 192 List, Johann-Mattis, 173, 192, 193 Liu, Haitao, 128, 131 Liu, Yang, 202, 203, 214 Lonsdale, Deryle, 176, 186, 191 Mačutek, Ján, 60, 66, 71, 86, 95, 108, 125, 127, 130, 131, 133, 135, 145 Mandelbrot, Benoit, 59, 67, 217, 229 Mao, Yi, 206, 213, 214 Mardia, Kanti V., 51, 55 Martynenko, Gregory, 71, 86 Matesis, Pavlos, 126, 127, 128, 129 McCleary, Richard, 152, 166 McDonald, Ryan, 206, 214 McFall, Joe D., 71, 86, 110, 122 McKelvie, David, 182, 190 Melamed, Dan I., 182, 192 Menzerath, Paul, 57, 67 Michailidis, George, 126, 127, 128 Michener, Charles D., 172, 193 Mikros, George K., 7, 33 Milička, Jiří, 110, 123, 125, 128, 131 Miller, George A., 50, 55, 59, 67 Miller, Rupert G., 80, 86 Milliex-Gritsi, Tatiana, 126, 127, 128 Mishne, Gilad, 202, 214 Mitzenmacher, Michael, 59, 67 Miyazima, Sasuke, 217, 229 Molière, 42 Moran, Patrick, 39, 55 Müller, Dieter, 71, 86 Naumann, Sven, 28, 33, 89, 91, 94, 95, 96, 97, 98, 108, 125, 131, 133, 145
Authors Index 255 Nei, Masatoshi, 172, 192 Nichols, Johanna, 173, 192 Nordhoff, Sebastian, 172, 192 Ogasawara, Osamu, 217, 229 Ord, John K., 39, 55 Ourednik, André, 49, 55 Page, Lawrence, 47, 55 Pak, Alexander, 202, 214 Paroubek, Patrick, 202, 214 Pawłowski, Adam, 147, 148, 150, 151, 152, 166, 167, 170 Pedersen, Ted, 51, 55 Perlovsky, Leonid I., 217, 229 Petroni, Filippo, 175, 192 Poe, Edgar Allan, 51 Pompei, Simone, 176, 186, 192 Popescu, Ioan-Iovitz, 28, 33, 71, 72, 77, 79, 84, 85, 86, 219, 229, 232, 239, 241, 245 Prade, Henri, 9, 10, 11, 12, 33 Qian, Bo, 202, 214 Quattoni, Ariadna, 206, 215 Raiffa, Howard, 12, 33 Rama, Taraka, 171, 173, 176, 192, 193 Rasheed, Khaled, 202, 214 Ratkowsky, David A., 71, 86 Rentoumi, Vassiliki, 201, 206, 215 Resnik, Philip, 50, 51, 55 Richards, Brian, 71, 86 Robinson, David F., 176, 192 Romero, Daniel M., 204, 215 Ross, Malcolm, 171, 190, 192 Ross, Sheldon M., 238, 241, 245 Rouanet, Henry, 40, 55 Rovenchak, Andrij, 127, 131, 217, 218, 219, 221, 227, 228, 229 Rudman, Joseph, 7, 33 Saaty, Thomas L., 18, 26, 33 Sadamitsu, Kugatsu, 206, 215 Saitou, Naruya, 172, 192 Sanada, Haruko, 125, 131, 133, 145 Sankoff, David, 171, 192 Schmid, Helmut, 42, 51, 55 Serva, Maurizio, 175, 192
Sharda, Ramesh, 204, 215 Sherif, Tarek, 174, 192 Sheskin, David J., 187, 192 Shimoni, Anat R., 28, 33 Singh, Anil K., 176, 183, 192 Skiena, Steven, 203, 215 Snell, Laurie J., 47, 55 Sokal, Robert R., 172, 193 Solé, Ricard V., 66 Solovyev, Alexander P., 9, 33 Stede, Manfred, 98, 108 Surana, Harshit, 176, 192 Swadesh, Morris, 171, 172, 174, 175, 178, 189, 193 Szabo, Gabor, 204, 215 Tate, Robert F., 184, 193 Taylor, Mark P., 202, 214 Tešitelová, Marie, 71, 86 Torgeson, Warren S., 51, 55 Trenkle, John M., 182, 190 Trevisani, Matilda, 111, 112, 123 Tuldava, Juhan, 71, 87 Tuzzi, Arjuna, 109, 111, 112, 123 Vogt, Hans, 171, 190 Wälchli, Bernhard, 232, 245 Weizman, Michael, 71, 87 Wichmann, Søren, 173, 175, 178, 184, 188, 189, 190, 191, 193 Wieling, Martijn, 176, 193 Wilson, Victor L., 152, 166 Wimmer, Gejza, 5, 6, 71, 87, 96, 108, 110, 123, 125, 127, 131, 135, 145 Woronczak, Jerzy, 147, 148, 150, 153, 165, 167 Xanthos, Aris, 35, 40, 55 Xanthoulis, Giannis, 126, 127, 128 Yamamoto, Keizo, 217, 229 Yangarber, Roman, 189 Yu, Sheng, 201, 203, 214, 215 Zernov, Mikhail M., 21, 34 Zhang, Wenbin, 203, 215 Zimmermann, Hans-Jürgen, 11, 34 Zörnig, Peter, 231, 239, 241, 242, 245 Zysno, Peter, 11, 34
Authors’ addresses Altmann, Gabriel Stüttinghauser Ringstrasse 44, DE-58515 Lüdenscheid, Germany email: [email protected] Andreev, Sergey N. Department of Foreign Languages Smolensk State University Przhevalskogo 4, RU-214000 Smolensk, Russia email: [email protected] Bavaud, François Department of Language and Information Sciences University of Lausanne CH-1015 Lausanne, Switzerland email: [email protected] Benešová, Martina Department of General Linguistics Faculty of Arts Palacky University Křížkovského 14, CZ-77147 Olomouc, Czech Republic email: [email protected] Borin, Lars Department of Swedish University of Göteborg Box 200, SE- 40530 Göteborg, Sweden email: [email protected] Borisov, Vadim V. Department of Computer Engineering Smolensk Branch of National Research University “Moscow Power Engineering Institute” Energeticheskiy proezd 1, RU-214013 Smolensk, Russia email: [email protected]
258 Authors’ addresses Čech, Radek Department of Czech Language University of Ostrava Reální 5, CZ-77147 Ostrava, Czech Republic email: [email protected] Cocco, Christelle Department of Language and Information Sciences University of Lausanne CH-1015 Lausanne, Switzerland email: [email protected] Eder, Maciej Institute of Polish Studies Pedagogical University of Kraków ul. Podchorążych 2, PL-30084 Kraków, Poland and Institute of Polish Language Polish Academy of Sciences al. Mickiewicza 31, PL-31120 Kraków, Poland email: [email protected] Köhler, Reinhard Computerlinguistik und Digital Humanities University of Trier FB II / Computerlinguistik, DE-54286 Trier, Germany email: [email protected] Krithara, Anastasia Institute of Informatics and Telecommunications National Center for Scientific Research (NCSR) ‘Demokritos’ Terma Patriarchou Grigoriou, Aghia Paraskevi, GR-15310, Athens, Greece email: [email protected] Mačutek, Ján Department of Applied Mathematics and Statistics Comenius University Mlynská dolina, SK-84248 Bratislava, Slovakia email: [email protected]
Authors’ addresses 259
Mikros, George K. Department of Italian Language and Literature School of Philosophy National and Kapodistrian University of Athens Panepistimioupoli Zografou, GR-15784 Athens, Greece email: [email protected] and Department of Applied Linguistics College of Liberal Arts University of Massachusetts Boston 100 Morrissey Boulevard, US-02125 Boston, MA, United States of America email: [email protected] Milička, Jiří Institute of Comparative Linguistics Faculty of Arts Charles University Nám. Jana Palacha 2, CZ-11638 Praha 1 email: [email protected] and Department of General Linguistics Faculty of Arts Palacky University Křížkovského 14, CZ-77117 Olomouc, Czech Republic Pawłowski, Adam University of Wrocław Institute of Library and Information Science pl. Uniwersytecki 9/13, 50-137 Wrocław, Poland email: [email protected] Rentoumi, Vassiliki SENTImedia Ltd. 86 Hazlewell road, Putney, London, SW156UR, UK email: [email protected]
260 Authors’ addresses Rama K., Taraka Department of Swedish University of Göteborg Box 200, SE- 40530 Göteborg, Sweden email: [email protected] Rovenchak, Andrij Department for Theoretical Physics Ivan Franko National University of Lviv 12 Drahomanov St., UA-79005 Lviv, Ukraine email: [email protected] Tuzzi, Arjuna Department of Philosophy, Sociology, Education and Applied Psychology University of Padova via Cesarotti 10/12, IT-35123 Padova, Italy email: [email protected] Tzanos, Nikos SENTImedia Ltd. 86 Hazlewell road, Putney, London, SW156UR, UK email: [email protected] Xanthos, Aris Department of Language and Information Sciences University of Lausanne CH-1015 Lausanne, Switzerland email: [email protected] Zörnig, Peter Department of Statistics, Institute of Exact Sciences, University of Brasília, 70910-900, Brasília-DF, Brazil email: [email protected]