176 29 3MB
English Pages 220 Year 2015
Gabriel Altmann, Reinhard Köhler Forms and Degrees of Repetition in Texts
Quantitative Linguistics
Editors Reinhard Köhler Gabriel Altmann Peter Grzybek
Volume 68
Gabriel Altmann, Reinhard Köhler
Forms and Degrees of Repetition in Texts
Detection and Analysis
DE GRUYTER MOUTON
ISBN 978-3-11-041179-9 e-ISBN (PDF) 978-3-11-041194-2 e-ISBN (EPUB) 978-3-11-041202-4 ISSN 0179-3616 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2015 Walter de Gruyter GmbH, Berlin/Munich/Boston Printing and binding: CPI books GmbH, Leck ♾ Printed on acid-free paper Printed in Germany www.degruyter.com
Foreword The present volume is not a textbook of general quantitative text analysis; it focuses only a single yet important aspect of textual phenomena: repetition. The main purpose of the book is, obviously, the introduction of quantitative models and methods for the analysis of repetition types and their individual characteristics. Another significant aim is to show that the creation of a text is only to a certain extent controlled by rules. It goes without saying, of course, that an author is free to shape his/her text within the boundaries set by grammatical and other linguistic rules and that these rules can even be violated. What is known to a much lesser extent and what may become clearer to the reader of the following chapters is that there are text laws, which govern the properties of every text beyond conscious control by the author and beyond any grammatical description. The lawfulness of textual phenomena on the one hand and the freedom of the author on the other hand are only seemingly contradictions. Only on the background of universal laws is it possible to recognise the individual characteristic of an author or a text. Individuality can be considered as a specific set of values of the parameters of universal laws. Repetition is just one of the textual phenomena which abide by laws. By now, we know only a few corresponding hypotheses. The existence of these laws must be postulated not only a priori as required for any kind of research but is also a consequence of the fact that texts are made within a linguistic community and have to meet certain needs. Some of these needs co-operate with each other, some others compete. The dynamic system consisting of cooperating and competing processes produces linguistic mechanisms which are observable only by their consequences in form of laws. A text is the product of self-regulating and self-organising cognitive and social mechanisms. New hypotheses must be set up to investigate these mechanisms and a multitude of texts and their properties must be studied. The empirical results are at the same time the data by means of which the hypotheses are tested and provide heuristic stimulation for other hypotheses. Text linguistics as a discipline has brought valuable conceptual achievements and progress in the description of text structure. Currently, however, the epistemological status of this discipline does not surmount the domain of definitions and empirical generalisations and is, thus, still far from an explanatory state. Maybe the time has not yet come to pursue a higher research goal, i.e. the formulation of a text theory. Nevertheless, this book wants to give corresponding impulses and show some methodological approaches on the basis of not more mathematics than is absolutely needed. It combines the categorical analy-
VI | Foreword sis of repetition into nine forms as described by Altmann (1988) and his methodological approach with recent research results. G.A. and R.K.
Contents 1 1.1 1.2 1.3
Introduction | 1 The universe of discourse | 1 Forms of repetitions | 4 Prolegomena for a future text theory | 6
2 2.1 2.2 2.2.1 2.2.2 2.2.3 2.3 2.3.1 2.3.2 2.3.3 2.3.4 2.4 2.5 2.5.1 2.5.2 2.5.3 2.5.4 2.5.5
Shapeless Repetition | 11 Phonic effects | 12 Two variables | 17 The activity index | 17 Excursus | 21 Comparing activity indices | 25 Global indicators | 31 Entropy | 33 Comparison of two entropy values | 37 The repeat rate | 38 Moments of distributions | 40 Modelling using probability distributions | 49 Some text laws | 63 The Zipf-Mandelbrot Law | 63 The Simon-Herdan Model | 73 Hřebíček’s Reference Law | 77 Type-Token Models | 80 Perspectives | 89
3 3.1 3.2 3.3 3.3.1 3.3.2 3.3.3 3.4
Positional repetition | 91 The rhyme in “Erlkönig” | 91 Open rhymes | 94 The gradual climax | 97 The linear climax | 97 Reduced climax | 99 The exponential climax | 101 Other positional repetitions | 106
4 4.1 4.1.1 4.1.2
Associative repetition | 111 Associative repetition of two words | 113 Short texts | 114 Long texts | 117
VIII | Contents 4.2 4.3 4.4
Presentation | 120 The minimal (acyclic) graph | 121 Vistas | 123
5 5.1 5.2 5.3 5.4
Iterative repetition | 127 Binary sequences | 128 Large samples | 133 Comparison of runs in two texts | 133 Runs of more than two kinds of elements | 135
6 6.1 6.2 6.3 6.4 6.5 6.6
Aggregative repetition | 137 Random distances: binary data | 137 Models of aggregation trends | 141 Brainerd’s Markov-chain model | 146 Non-binary data | 155 Similarity aggregation | 158 Vistas | 161
7 7.1 7.2 7.3
Repetition in blocks | 163 The Frumkina law | 164 Testing the Frumkina law | 166 Vistas | 174
8 8.1 8.2 8.3
Parallel repetitions | 177 Cochran’s Q-test | 178 Analysis of variance | 179 The chi-square test | 182
9 9.1
Cyclic repetition | 187 Fourier analysis | 189
10
Summary | 193
References | 195 Index | 203
1 Introduction 1.1 The universe of discourse Repetition of textual elements is more than a superficial phenomenon. Repetition may even be considered as constitutive for units and relations in a text: on a primary level when no other way exists to establish a unit – as in a musical composition (a motif can be recognised as such only after at least one repetition) – and on a secondary, artistic level, where repetition is a consequence of the transfer of the equivalence principle from the paradigmatic axis to the syntagmatic one (cf. e.g., Jakobson 1971). For our purposes, we do not presuppose any specific definition of the term text. Any meaningful spoken or written sequence of linguistic units in a natural language can be analysed by the methods presented in this volume. We do not even presuppose fundamental properties such as cohesion and coherence; on the contrary: there are objective methods which can be used to find such properties. In our examples, we will analyse only written texts, which makes the procedures easy to follow and to reproduce by the reader. We will use the term textual unit to denote any phenomenon in a text which can be defined in an operational way, i.e. a phenomenon which can be identified unambiguously on a set of criteria and whose properties can be measured. Here is a list of some commonly considered textual units: Character Grapheme Phoneme Syllable Morph, Morpheme Word-form
Phrase Bar Clause Sentence Paragraph Sememe
Metrical foot Lemma Lexeme Syntactic construction Hreb Motif
Numerous textual units can be found in works on text-linguistics such as Koch (1969, 1971), Daneš / Viehweger (1977), Gottman / Parkhurst (1980), Dressler / de Beaugrand (1981), Hřebíček (1997, 2000). All such units possess (or better: can be ascribed) an unlimited number of properties, with respect to which the researcher can classify the units. Some examples of properties of the word HOUSE are the following ones:
2 | Introduction Property
Length in characters Length in syllables Length in morph(eme)s Part-of-speech Number of meanings (polysemy) Number of metaphorical meanings Polytextuality (no. of texts in a corpus in which the unit occurs) Number of compounds (productivity) Number of derivations (productivity) Inflectional paradigm Relative frequency Origin Number of contexts (polytextuality) Number of synonyms Number of variants in a dialect atlas
Value
5 1 1 Noun n k t
m d regular p Germanic r x y
Every unit can also be considered from the point of view of its function in the discourse and with respect to other relations. Here, some of them are listed: Grammatical function Anaphora Speech Act
Reference Cataphora Metaphor
Co-reference Poetic figure Argumentation
Each of them can be subdivided into individual kinds or categories, such as the argumentation relations for instance into {background, circumstance, nonvolitional cause, volitional cause, contrast, condition, concession, evaluation, elaboration, …} or speech acts e.g., (after Searle 1969) into {assertives, directives, commissives, expressives, declarations}. Thus, not only the units can be studied but also their properties, i.e. classes of units sharing one or more properties, and their functions. Generally, the term repetition will be used in the sense of occurring several times in a text. There are various reasons why a textual unit may occur repeatedly in a text: Limitation of inventory. If an inventory is small, the corresponding units must occur with a frequency greater than one. In an English sentence with, say 50 characters, some characters will occur repeatedly since the character inventory is smaller than 50. On the other hand, sentences need not be repeated for reasons of inventory size because there is no limited number of sentences in a natural language. The repetition of units from a finite but not too small inventory such as the morpheme inventory of a language may be used as an indicator of creativeness.
Introduction | 3
Grammar. Certain kinds of repetitions are due to the rules of the grammar of a language. Function words occur much more often than content words; they serve grammatical functions, e.g. determiners, prepositions, conjunctions etc. Specific repetitive patterns and distributions concerning function words are commonly used as characteristics of authors for the purposes of stylometrics and forensic linguistics. Thematic bond. The words which belong to the semantic field of the topic presented in a text occur more often than others. Repetitions of this kind can be used to differentiate text sorts. Technical texts will contain, as a rule, more thematically caused repetitions than novels. Discourse functions, such as emphasis. Stylistic, aesthetic factors. Some of the repetitions which can be observed in texts may be due to special textual functions. The author may want to underline some elements or to produce stylistic or poetic effects such as rhythm and euphony. Perseveration. The repetition of textual units can also have very individual causes. It has been observed that repetitions can be due to mental diseases and self-stimulation. This is why repetition patters can be used for psychiatric diagnoses, cf. Mittenecker (1953), Breidt (1973), Möller, Laux, Deister (2009). Information flow. The author of a text wants, while he/she is writing, to make sure that the addressee will be able to understand it. Among other factors, the limitation of the amount of information conveyed per text passage must not exceed a threshold, which varies with the type of reader. A consequence with respect to the text vocabulary is the fact that a constant flow of new words would overstress the reader. Words must be repeated to avoid this stress and also to be able to explicitly discuss difficult concepts.
It is not always clear whether a repetition was intentionally employed or unintentionally. The only person who might be able to give an answer to this question is the author himself (cf. Jakobson 1971). However, we are not so much interested in the intentionality of an observed property but rather in the question whether a given phenomenon such as a repetition pattern can be considered as significant – which means that it is a potential indicator of some meaningful text or author characteristic. Therefore, we will in the first line show ways to (1) construct and apply useful objective measures of text phenomena and (2) to test whether the results of measurements are significant or random. Randomness in this sense means that a result is not guaranteed to be an indicator of a text characteristic. We conduct investigations of repetitions for the following reasons:
4 | Introduction 1. Characterisation of texts by means of parameters (measures, indicators) as taken from established mathematical statistics or specifically constructed ones in individual cases. 2. Comparison of texts on the basis of their quantitative characteristics and classification of the texts by the results. 3. Research for the laws of text, which control the mechanisms connected to text generation. As a remote aim, the construction of a theory of text consisting of a system of text laws. The final attempt of every possible quantitative text analysis is the construction of a text theory; we must, however, admit that we are still far from a chance to reach this aim. We have been successful in finding and formulating a number of text laws, and this success contributes to our confidence that methods such as those presented in this volume will help to achieve the aims of our demanding program (cf. Altmann 2009). The first two reasons serve philological purposes, they can be applied in forensic linguistics for author attribution and various other aims, and they are involved in applied psychology and psychiatrics for diagnostic purposes and for the documentation of changes in mental states. The third one is, of course, a matter of pure science but – as experiences in other fields show –is very likely to form the theoretical background for advanced applications.
1.2 Forms of repetitions The most familiar form of repetition is the rhyme. This form is a deterministic one; nevertheless, probabilistic models and statistical methods can be applied because the individual properties of the rhymed elements can possess stochastic features. In the following section, we will consider mainly variants of repetitions which are tendencies or deviations from total determination or total randomness. First, we will present a short overview over the most salient forms of repetition, and then a discussion of these forms will be given in more or less detail. There exist the following forms: Absolute repetition This formless variant of repetition corresponds to the pure occurrence of a unit in a text, it is the simple frequency of a given unit in the text. A significant frequency is assumed if the observed frequency is significantly higher than the theoretically expected frequency. The observed number of occurrences can, of course, also be lower than expected. In both cases, the deviant frequency may
Introduction | 5
have a specific function for the text. Absolute frequency always abides by a quantitative text law, the parameters of which are characteristic of the unit or the text or both. Positional repetition A unit may have an unexpected (higher or lower) frequency at a given position, e.g. at the beginning or end of a verse, word-initial or word-final etc. Examples in poetry are alliteration, preferences of vowels in certain positions, rhyme, assonance etc. Associative (or configurative) repetition A unit coincides more often than expected with another one in a given frame, e.g. in a sentence, a verse, a text within a corpus. This repetition may be an indicator of word connotations, of unconscious associations of the text author, of terminology and many others. Iterative repetition A unit forms non-random runs, i.e. uninterrupted sequences. This is found in the first place with formal units such as kinds of metrical feet, metrical patterns, word length sequences, and – depending on the text sort – also with sentence types and others. Aggregative repetition A unit has a concentration at some passages of a text, if forms clusters. This may be a consequence of Skinner’s “formal enhancement”, a kind of selfstimulation. It can be recognised by a large number of small distances and the rare large ones in a text. Aggregative repetition of similar units The same tendency can be observed not only with identical objects but also with similar ones. It is known that neighbouring verses in folk poetry are phonetically more similar to each other than distant ones. Repetition in blocks A unit shows a lawful distribution over text blocks. When a text is segmented in passages of identical lengths and the frequency of a unit in the individual blocks is counted, the distribution of the number of blocks in which the unit occurs x times behaves according to a text law. The parameters of the distribution, which are found with a text and a selected unit are characteristic of the text and the unit.
6 | Introduction Parallel repetition Identical or similar units occur at places which are parallel with respect to a given frame, e.g. a verse pair. The most familiar form is the rhyme. In folk poetry, also phonetic associations and semantic parallels can be found. Cyclic repetition The repetition of units can be regular to the extent that the units form cycles. This phenomenon was intensively investigated on stress position in verses. The above list does not claim to be complete. No description, even no theory is complete at a given moment in time; new aspects, new properties, and new phenomena are detected all the time. Our study covers those aspects of repetition which have been found so far – even if only conceptually – and is made in order to serve as a handbook of methods for the investigation of repetitions. It is by no means already a theory; following the results of the philosophy of science, a theory is a system of laws of the object under study. We present some laws in this volume but we are still on the laborious way to derive theoretically, in a deductive way, properties of text. The methods shown in this book shall serve as inspiration to conduct further investigations and to inductively capture the wealth of the behaviour of text properties. Most of the methods are useful to test the existence of general or text-specific tendencies. The following chapters will not deal with singularities, with individual cases such as the sporadic occurrence of some poetic figures as is the preferred activity in qualitative stylistics (cf. e.g. Gonda 1959; Mason 1961; Austerlitz 1961); we are interested in forms of repetition which abide by laws, which form an objective tendency, and whose latent existence can be shown by means of objective methods.
1.3 Prolegomena for a future text theory It is more common nowadays to speak of text linguistics than of text theory. The development of flawless conceptual systems which cover as many aspects of text as possible is doubtlessly theoretical research. Among the aspects that have to be covered there is the definition of the concepts and their operationalisation, the compilation of criteria for text segmentation and hierarchical structuring, description, and classification. All these aspects are necessary but not sufficient conditions for the construction of a theory. A theory consists of
Introduction | 7
Concepts, Conventions (definitions, criteria, operations), A system of hypotheses, of which at least some must be laws.
It is common to add the ceteris-paribus conditions to this list; there is no law without them. Concepts such as those known in text linguistics can answer questions as to what exists in texts. Conceptual systems of this kind define ontologies based on certain views on texts. Theory construction begins beyond the definition of concepts; it is based on questions as to how text structures arise and why they are as we observe them. In other words, theory construction begins with the creation of hypotheses. We will illustrate this on a simple example. Many papers and books were written about the kinds of reference in texts. In the literature, one can find detailed descriptions, classifications, and criteria to identify these entities and to find them in texts. Thus, the preconditions for the construction of a text theory are given, but the theory does not yet exist. We consider the work by Luděk Hřebíček (cf. 1986) as an important step in the first phase of the development of a text theory. He gave a theoretical justification for the increase of the number of references in dependence on the current text position and on the text vocabulary. He partly tested his hypothesis empirically, connected it with hypotheses about lexical richness of a text and achieved in this way the first law of references. It is by no means sufficient for the construction of a theory to collect hypotheses. These can have very different epistemological status. Bunge (1969: 256 f) differentiates four levels of hypotheses depending on the kind of link they have to the theoretical and empirical bodies of a science. These four levels or kinds of hypotheses are: Guesses, which can neither be confirmed or supported on empirical data nor deduced from theoretical considerations. Inductive hypotheses (or empirical generalisations), which are based on observations or data; they may be well supported by empirical findings. They are the initial form of hypotheses in every science, each of them isolated and without theoretical systematisation. Linguistic universals and other typological hypotheses are of this type as well as the statements in grammars and text linguistics. Plausible (or deductive) hypotheses, which are deduced from a theory but have not yet been tested on empirical material although they are testable. Confirmed hypotheses, which are as well theoretically justified as empirically supported, i.e. deduced and tested hypotheses. They are called laws if they are universal in their scope and embedded in a theoretical system.
8 | Introduction Laws are exactly the kind of hypotheses which are needed to form a theory. It may be cumbersome to find and formulate a law; this aim can hardly be achieved without mathematical means. The testing of a hypothesis by processing and systematically evaluating texts is not possible without appropriate methods of measurement and counting procedures; it is not possible to determine the degree of confirmation a hypothesis enjoys without statistics. The reason for this is that language and text do not behave in a deterministic way but display tendencies, exceptions, and trends rather than rigid patterns. We are not going to say that empirical generalisations as common in text linguistics and in linguistics in general are useless or of little value. On the contrary, they indicate research programs, inspire the quest for laws, give empirical support, and are part of any empirical science anyway. The following steps form the road to theory construction: 1. Defining concepts is the fundamental first step, without which elements, properties, and aspects of text cannot be detected. 2. Setting up hypotheses in form of empirical generalisation on the basis of findings in texts. They may concern structures, dependences, developments, repetition patterns etc. 3. Deducing statements which correspond to assumptions about text phenomena. This is the step where theory construction proper begins. Before, questions as to how and why cannot be tackled or even answered. Detailed descriptions of text phenomena and, most importantly, explanations of the mechanisms of text generation become possible only with a system of laws. We expect that co-operation with neighbouring scientific disciplines such as psychology will be necessary in later phases but the kernel of a text theory should be construed in a linguistic framework. 4. Testing the theoretical conclusions on data from authentic texts using statistical methods such as fitting mathematical models to the empirical data and conducting goodness-of-fit (significance) tests. 5. Systematising the hypotheses, i.e. integrating them into a larger framework by interconnecting them with other hypotheses and laws. Building a system of text laws which are universally valid for all kinds of texts in every natural human language. As long as the term theory denotes what is called a theory in the philosophy of science (which is not always the case in linguistics) there is no other way to set up a text theory. The history of all factual sciences agrees with this statement and we cannot see any argument why linguistics and text linguistics should form an exception.
Introduction | 9
The statement that language is a system is as old as structuralism. At that time, this denomination had almost no consequences; it was just a methodological foundation for the models under development. Under the hegemony of formal linguistics in the following decades, linguists ignored systems theoretical concepts although they were developed at the same time as structuralism (von Bertalanffy 1949, 1950a,b). In other disciplines, systems theory became the most powerful methodology of the century. Its success was shared, at least to some extent, even by linguistics. After Zipf (1949) many linguists were at pains to introduce the conceptualisations of system theory into linguistics (cf. Koch 1974; Nöth 1974. 1975, 1977, 1978, 1983; Oomen 1971; Schweizer 1979; Wildgen 1985 etc.; Köhler, Altmann 1983). Recently, models with pronounced systemstheoretical orientation (cf. Strauss 1980; Köhler 1986) were presented. Still, the status of text was not yet clarified. It is, of course, worthwhile and at this place in order to discuss the status and position of the entity text within a more or less developed theoretical linguistic framework. The most advanced and (with respect to the philosophy of science) reflected approaches in linguistics are the Unified Theory (Wimmer, Altmann 2005) and Synergetic Linguistics (Köhler 1986, 2005). Both of them have been developed within Quantitative Linguistics on the background of systems theory (Ludwig von Bertalanffy 1949,1950,…) and modern advancements and specialisations, in particular synergetics (H. Haken 1978). In this context, we consider texts as structured objects which are produced by 'speakers' and perceived and processed by 'hearers', serving as means of communication. Besides hearers, speakers, and texts, 'language' is an element of the communication system, which contains also other means of communication such as body language, traffic signs etc., in short all semiotic systems. Each semiotic system can be characterised as an abstraction from the actual linguistic or other communicative behaviour, hence as a model, and at the same time as a code system which provides the means to produce and process texts of the respective type. On the other hand, linguistic behaviour, called 'parole' by de Saussure, changes the respective semiotic system (because it is and continues to be an abstraction from the changing communicative behaviour, and thus from the totality of the texts which are produced and perceived within a given period of time). The influence of a given individual text on the change of a language depends on a multitude of parameters, among them the number of hearers/readers of the text and the number of times the text is processed, the prestige of the text producer/author or of the text, the degree to which the given text differs from established usage, the attractiveness (usefulness in various respects) of the innovations, and the degree of 'linguistic immunity' of the audience.
10 | Introduction In synergetic linguistics, speakers and hearers, together with all their physiological and cognitive properties which are relevant to language use, are part of the environment of the language system and interact with it in a similar way as other parts of the environment such as the physical world (acoustic and optical channels). Among the complicated interrelations between the elements of all these systems and sub-systems, there are effects which have been identified and investigated with respect to certain properties of language and text. Some of them have been called 'forces', e.g. Zipf's 'forces' of unification and diversification. Some processes co-operate with each other in one respect while they may compete in another one. We are not going to go into detail here; the reader may consult the cited literature. Only where we refer to such forces, processes, and elements of the self-organising and self-regulating communication system and its sub-systems in order to formulate theoretically based models of this or that text phenomenon, limited aspects of synergetic linguistics may come into play. We would nonetheless underline that a theoretical background such as the sketched one will be indispensable for the construction of a text theory proper, i.e. a systems of laws, which can explain the observed structures and properties of text. Although this volume concentrates on a single aspect of text structure, the repetition patterns, i.e. only one of the infinitely many aspects which can be investigated, we are confident that it may serve as a good example of how quantitative methods are used to unveil and model text properties, how hypotheses are tested, and how candidates for text laws can be found.
2 Shapeless Repetition The measure of shapeless repetition, i.e. multiple occurrence of a unit in a text without any boundary conditions, is known as the frequency in the given text. The observed frequency of a unit is the result of a large number of effects and influences, and the frequencies of the individual text units differ naturally. We will be concerned with the methods to evaluate the frequencies and to decide whether a given frequency shows a particular, unexpected, statistically significant value. We differentiate four cases: The frequency of a unit as compared to an expected value. The frequency of a unit with respect to the frequency of another one. This case can easily be generalised to the case of several other units, which opens up a wealth of variants for study. The frequency of all the units of a given class, which cover the complete text. The search for text laws by which the empirical regularities of the frequency structure abide. In the following sections, the central task is to find, on the basis of observed frequencies, the probability of the phenomenon under study. The knowledge of the probability of an event – e.g., of the occurrence of a unit or of a number of occurrences – is an indispensible prerequisite for reliable evaluations and predictions. A difference between two texts in respect to some criterion can be expectable or surprising; the only objective guideline to decide questions of this kind is the probability of the given difference. Sometimes, the mathematical estimation of a probability is used, viz. when the exact probability is not available due to the lack of theoretical knowledge. In many cases, however, the probability can be calculated simply on the basis of the number of possible configurations and on the number of the configurations which are in the focus of the given study. Often, well-known probability distributions are available for a problem; in other cases, these distributions cannot be applied directly but only after certain transformations. We will introduce each method presented and used in this book with illustrations from real examples.
12 | Shapeless Repetition
2.1 Phonic effects Let us consider the occurrence of an isolated unit in a text or in a part of a text, say a sound. Let us now pose the question whether its frequency (without taking account of its position) can cause a (eu)phonic effect. If this is so then the frequency of the unit will be somehow unexpected. There are three methods to answer this question: Asking the author of the text to tell the researcher if a euphonic effect was intended. This method has some limits; e.g. Homer would not be able to tell us. More important is the problem that the effect need not be caused consciously. Asking a sufficiently large number of text recipients who must be native speakers of the author’s language. This method suffers from many limitations, too. It may even turn out to be impossible. Anyway, the result of such a query will depend on the subjects; differences in age, education, social status and other factors may influence the individual impression. Abstaining from subjective assessments and applying objective statistical methods. This does not yield a direct answer to the question as to the euphonic function of the sound; instead a number, a probability, will have to be interpreted. This method entails advantages over subjective procedures: (a) The result can be reproduced by other researchers; it can be checked and compared with the results of colleagues. When the method was correctly applied the results should be the same (with a minimal tolerance). (b) The conclusions on the basis of a study bear always the risk of a mistake. This risk can exactly be calculated if a statistical investigation was conducted. (c) From such a conclusion and its interpretation in the light of the initial hypothesis, not only the fact that the phonic effect is indeed (or is not) existent but also the source of the effect and how this source must be shaped to cause the effect can be discovered. A poetic text does not contain any units (e.g. sounds, words, or their combinations) which could not appear in a prose text too, phonic effects can arise only from their specific positions, deviating from their regular positions in a neutral text or from an unexpected frequency value. In this chapter, we will discuss only the second variant.
Phonic effects | 13
The question whether a unit has an unexpected frequency can be answered by showing that the observed frequency or an even larger frequency value is very unlikely. The only step in the procedure where a subjective decision has to be made is the choice of the threshold at which the probability is evaluated as “very small”. This threshold is set by convention or determined on the basis of the needs of the current problem. Thus, quantitative linguistics proceeds in the same way as the natural sciences, where the value of the threshold may be a matter of life or death. We subdivide the problem into individual steps (cf. Altmann 1873) and illustrate this by an example. We select this example from an ‘exotic’ language, viz. the Indonesian, which may help to avoid a content-based bias. The first stanza of the poem "Bunda dan anak" by Roestam Effendi reads (in phonological transcription) as follows: Masaq jambaq buah sěbuah dipěram alam diujuŋ dahan merah darah běruris-uris běndera masaq bagi sělera. Our philological question is: Does the phoneme /a/ evoke a (eu)phonic effect in this first stanza? Another formulation, somewhat closer to a statistical question:Is the occurrence of the unit /a/ unexpected with respect to the number of times it appears? Other ways to ask this question are possible, e.g. a formulation in the language of statistics is: What is the probability of the unit /a/ to appear at each vowel position? Another one: Does the observed frequency of /a/ differ significantly from the expected (theoretical) frequency? To prepare the answer to such questions, hypotheses are set up in a specific way. In an informal formulation, our present question would be: Null hypothesis (H0): The observed frequency of /a/ in the first stanza does not differ from the expected one. Alternative hypothesis (H1): The observed frequency of /a/ in the first stanza is greater than expected.
On the basis of the hypotheses, a mathematical model is set up, which simulates the statistical behaviour of the unit under study und produces the numerical result. In our case, the following procedure can be applied:
14 | Shapeless Repetition Let p be the probability of /a/ in a population. We can estimate the value of p in several ways (the “true” value is inaccessible) depending on how we define the corresponding population. Again, there are several options. We could define our population as (a) (b) (c) (d)
The complete poem; All poems by Roestam Effendi; The totality of Indonesian poetry; The totality of Indonesian texts.
Each of these options is connected with complications as Orlov (1982) showed. If the complete poem is /a/-“biased” no result would be obtained with option (a). Option (b) is the only acceptable one if we follow Orlov's findings. However, the totality of the poetic texts written by an author could be /a/-“biased”. Therefore, in doubt, all the alternatives must be considered. Here, we want to give just a demonstration of a procedure; therefore we will use the only material we have at our disposal, viz. a count of 23,000 phonemes from Indonesian prose, as the basis for an estimation. On this data, the relative frequency of p̂a = 0.2227 is obtained for /a/ (the symbol ̂ is used to indicate that a number is the result of an estimation). Let us assume that we have four positions for vowels per verse, and let the probability of /a/ be p and the probability of any other vowel /b/ be q (q = 1-p). Then the probability for /a/ to occur exactly once in a verse is composed of the following events: a b b b, b a b b, b b a b, b b b a and the corresponding probabilities are pqqq, qpqq, qqpq, qqqp or put together 4pq3. The complete statement about the probability for /a/ to occur once in a verse is P(X = 1) = 4pq3, where P denotes the probability of the event (or the result of an experiment) in the parentheses. Similarly, the fact that there are 6 permutations with 2 p and 2
Phonic effects | 15
q and the corresponding probability of /a/ to occur 2 times in a verse are easily determined: P(X = 2) = 6p2q2 etc. The general expression is � � P X = x = � � p x q − x , � x�
where the so-called binomial coefficients can be rewritten as �n� n , � �= x x n − x � �
with k! = k(k-1)(k-1)...3·2·1 called the factorial. A formula for the case where the number of vowels per verse is not yet fixed is (2.1.1)
�n� P X = x = � � p n q n − x � x�
corresponding to the probability of x favourable and (n-x) unfavourable events out of n events. We assumed here implicitly that the occurrences of vowels are independent of each other, an assumption which is in conflict with reality. Therefore, the model presented here is only an approximation. Now we will consider the question as to the probability of /a/ occurring at all the four vowel positions of the verse. The calculation yields � � P X = = � � p q − = = � �
Next, we have to state the threshold below which we decide that the result is significant. This number is usually denoted by the symbol α. We will follow a convention common in statistics and set the threshold to 0.05. The next step is the translation of this statistical result into the language of linguistics. As 0.0025 < 0.05 we assume that we have, in our case, an unexpected phenomenon, which may be interpreted as a euphonic effect. It is in order to mention two circumstances: (1) The euphonic pattern we found is just one of the many potential aspects of euphony. (2) Formula (2.1.1) is not the final formula which could be used in any case. Consider e.g. the case that in our example only 3 occurrences of /a/ are found (as in fact in verse 4). In this case the probability must be calculated to find 3 or more /a/ in the verse, i.e.
16 | Shapeless Repetition
� � � � P X ≥ = � � p q + � � p q � � � � = + = + = a result which is still significant. In the case of shapeless repetitions, euphonic effects can only be assumed when a sound or phoneme is repeated at least once, i.e. when its frequency is at least two. Therefore, we can ask whether /m/ and /q/ show a tendency as well. We have for /m/ p̂m = 0.0415; in the verse there are 7 consonants, among them 2 /m/. Therefore �� P X ≥ = � � � x − x x= � x �
We can facilitate calculations by writing
P X ≥ = − P X < = − P X = − P X = �� �� = − � � − � � � � �� = − = which gives a significant result. We should, however, take into consideration that Indonesian /q/ cannot occur in 7 places but only in 2, viz. word-finally. Hence, � � P X ≥ = � � = � �
The formula which we can use to calculate the euphonic significance of a sound is now given as
(2.1.2)
n � n� P X ≥ x = � � � pi q n −i i=x � i � x − n � � = − � � � p i q n − i i = � i �
where p = the probability of a sound in a population, q = 1 - p, n = number of places in which the sound can occur. We will conclude that a sound is euphonically significant if p < 0.05. Calculations of this type can be performed only for
Two variables | 17
individual units but it is also possible to explore the euphonic effect of the morpheme /buah/, of the syllable /rah/ etc. On the basis of our results, indicators are easy to form but we will refrain from presenting this kind of characterisation here. It would not be wrong to ask the question whether a euphonic repetition of semes can exist, e.g. with gradation. Rhyme, too, can be thought of as a phenomenon with “eulexical” effect because in this case, parts of words can be repeated which are not separable with respect to morphological or semantic criteria. We can also ask which effects can, in principle, be caused by free repetition of a unit in text.
2.2 Two variables In the last section, we studied the shapeless repetition of an individual unit. Our method allows for construing numerous indicators as is done in text analysis as a matter of routine and does not cause any particular problems. If, however, more than one unit is evaluated at the same time, problems arise with the formation of the indicators as well as with their application
2.2.1 The activity index In this section, we confine ourselves to show the work with two variables und to illustrate the problem using the well-known activity index (“Aktionsquotient”, cf. Busemann 1925; Boder 1949; Schlismann 1948; Antosch 1953; Fischer 1969; Altmann 1978). Usually, this index is defined as (2.2.1)
Q = v/a
where v is the number of the “active” words in the text and v the number of adjectives. It is also possible to define the index as Q = a/v as some authors do. When more verbs than adjectives are found in a text it is classified as “active”; when the contrary is true the text is considered as “descriptive”. Let us first have a look at the linguistic interpretation of this indicator. The first open problem is the question as to the criteria for an “active” verb. Shall verbs such as “to sleep”, “to rest”, “to suffer” be counted as ones which express activity or vividness? Shall adverbs, which indicate qualities of verbs, be subsumed under “descriptive” means? In e.g., Goethe’s “Erlkönig” there are more descriptive adverbs than adjectives.
18 | Shapeless Repetition A purely quantitative inspection of the indicator shows that Q = 1 whenever the number of the verbs equals the number of adjectives – regardless of the size of their sum. Therefore, this indicator reflects a kind of equilibrium between verbs and adjectives, not an isolated property. A text can be, of course, very active and very descriptive at the same time. Furthermore, the index expresses only one aspect of the dimension “active – descriptive”; these properties can also be characterised in other ways. The range of the indicator is 0.5). We will test whether this result is due to random fluctuations using criterion (2.2.21). Thus, we obtain for text 1 X=
− = +
which indicates high activity. The results for the individual texts are given in Table 2.2 (column 3). Column 4 gives the probability with which the calculated or an even more extreme X2 value would be expected. As can easily be seen, the evaluation of the value of an activity indicator does not only depend on the absolute number (according to which the texts are arranged in Table 2.2) but also on the size of the sample. Each text is characterised by a significant activity indicator except text 8 with significant descriptiveness and text 9, which shows a balanced relation. At the same time it can be seen that the most significant result was not found with text 3 with the largest Q’ but with text 1. This does not mean that text 1 should be attested a higher activity than text 3; rather the decision about its activity was made with a much smaller risk of a wrong decision than for text 3. Table 2.1: The activity indicator of some German texts after H. Fischer (1969) Text
G. Schwab: Des Odysseus Heimkehr nach Ithaka W. von Ebner Eschenbach: Die Nachbarn 3. J.G. Herder: Die drei Freunde 4. M. Pestalozzi: der Abend vor einem Festtage im Hause einer Rechtschaffenen Mutter 5. J. Gotthelf: Jakobs Lehrjahre 6. W. Raabe: Bekenntnis einer alten Mutter 7. J.P. Heber: Kannitverstan 8. M. Waser: Auf der See 9. A. Stifter: Die Lawine 10. G. Keller: Karl Hedigers Schützenfestrunde
Q
Q'
Verbs
Adjectives 22
3.68
0.79
81 57 54
32 4 12
2.53 9.25 4.50
0.72 0.90 0.82
93 43 86 49 33 70
30 11 48 82 28 40
3.10 3.91 1.83 0.60 1.18 1.75
0.76 0.80 0.65 0.37 0.54 0.64
81
30 | Shapeless Repetition
Table 2.2: Testing the activity indicator Text
8 9 10 7 2 5 1 6 4 3
Q’
0.37 0.54 0.64 0.65 0.72 0.76 0.79 0.80 0.82 0.90
X2
8.31 0.41 8.18 11.76 21.25 32.27 33.80 18.96 26.73 26.57
P
0.0039 0.52 0.0042 0.0006 4(10-6) 10-8 6(10-9) 10-5 2(10-7) 3(10-7)
z for differences P one-sided
2.19 1.22 0.17 1.18 0.69 0.54 0.15 0.30 1.20
0.014 0.111 0.443 0.119 0.245 0.295 0.440 0.382 0.115
Although the texts have been ordered according to their activity values no conclusion can be drawn on this basis as to whether the differences are significant or insignificant (random). We can apply criterion (2.2.23) or (2.2.29) to evaluate the differences between two adjacent Q’ or Q values in each case. According to (2.2.23a) we calculate for the texts 8 and 9 for Q’
p =
v + v + = = n + n +
q = − p = V p − p = + =
and z=
− = .
The results of the tests for adjacent texts according to (2.2.29) can be found in Table 2.2, column 5. The only significant difference is the one between the texts 8 and 9, all the others are insignificant. Table 2.3 contains all the z values for the differences between the texts 1 to 10. A cluster analysis of the texts could be performed by means of an appropriate taxonomic method (cf. Bock 1979).
Global indicators | 31 Table 2.3: Tests for the differences between the texts with respect to their activity (first line = z, second line = P) 2
3
4
5
6
7
8
9
10
1.18 0.12
1.65 0.05
0.50 0.31
0.54 0.295
0.15 0.44
2.38 0.009
6.89 2.8(10-12)
3.41 2.44 0.0003 0.007
1
2.45 0.007
1.53 0.69 0.063 0.245
1.10 0.14
1.18 0.12
5.67 7(10-9)
2.36 0.009
2
1.20 0.115
2.03 0.02
1.42 0.08
3.24 6.60 0.0006 2.1(10-11)
4.17 3.31 3 0.00002 0.0005
0.98 0.16
0.30 0.38
0.30 0.38
6.47 4.9(10-11)
3.52 2.61 0.0002 0.005
4
0.58 0.28
1.93 0.03
6.62 1.8(10-11)
3.03 0.001
2.01 0.02
5
2.03 0.02
5.64 8(10-9)
2.99 0.001
2.11 0.02
6
4.61 2(10-6)
1.41 0.08
0.17 0.43
7
2.19 0.014
4.18 10-5
8
1.22 0.11
9
1.29 0.10
Text
According to (2.2.29) we obtain for Q
z=
− + + +
= =
2.3 Global indicators Section 2.2. was devoted two one of the possible ways to treat two variables which are somehow related to each other. In later sections, other kinds of mod-
32 | Shapeless Repetition els will be shown, in particular functions which represent two (or even more) variables of which one depends on the others. There are always several kinds of models which could be used in a given situation; among others, the way in which variables are related to each other. The selection of a statistical model should follow a thorough analysis of the research problem and the quantities under study. It is also very advisable to know the linguistic interpretation of these quantities, their numerical domains (intervals of possible values), their theoretical probability distribution and, where appropriate, necessary transformations. Some earlier quantitative text analyses proposed quite complex indicators and applied them without such previous preparation. One example is F. Schmidt (1972) with his philologically plausible graduation of predicates in the sentence: (a) main, (b) secondary, (c) additional predicates and a corresponding indicator α = ac/b2, where the variables a, b, and c represent the numbers of the
three predicate types. The index n � α i , i.e. the number of all predicates di-
vided by the sum of the values αi for all sentences, can be called predication coefficient (Schmidt calls it 'style ratio'). While it is a good idea to quantitatively analyse a text with respect to its predicate types, Schmidt's indicator with its three independent variables does not meet any of the conditions for a useful index. There are three independent variables whose values can vary from 0 to infinity, such that these values cannot be interpreted (would 100 be considered as a large or as a small value?), comparisons cannot be performed and conclusions cannot be drawn. Still more variables are involved in e.g. in Birkhoff’s indicator of the 'musical quality' of a poem: M = (aa + 2r + 2m -2ae - 2ce)/C, where the individual components of the formula are some numbers, e.g. “ce = number of consonants minus 2 times the number of vowels”, or “ae = number of sounds which are directly linked with more than two preceding main sounds multiplied by the number of sounds of a syllable following directly an identical syllable if it does not belong to the same word, and multiplied by the number of sounds belonging to one period if to this sequence two preceding sounds belong” (quoted from Gunzenhäuser 1969: 302-303). The sampling distribution of this indicator would be that complicated that its meaningfulness may be doubted. We shall not analyse here indicators of this kind because it is possible to set up simpler, interpretable indicators. We consider in the following shapeless repetition of all elements of a text, i.e. those that cover the whole text. In quantitative text analysis two kinds of
Global indicators | 33
statistically well treatable indicators have been established, viz. global indicators such as average, entropy, repeat rate, “Yule’s indicator”, Ord’s criteria, Popescu’s indicators, etc., and probability distributions which capture the distribution of a (usually discrete) property in text. We shall consider merely a selection because most of these indicators and their usage can be found elsewhere in the literature.
2.3.1 Entropy Since the establishment of information theory, entropy has always been considered a property of a language, of a text, or even of other artistic products. Entropy has been measured for different text units and given such a huge number of possible interpretations that even the pure enumeration of such studies would take some dozens of pages. These interpretations are of poetic, stylistic, aesthetic, communication theoretical etc. character and represent a secondary, object-related re-interpretation of the fundamental meaning of entropy, viz. the non-uniformity of the distribution of the relative frequencies of a set of textual units. Usually, Shannon's version of entropy defined as
H = −� pi pi i
is applied, where pi is the probability of the i-th textual unit in an examined set of n text units. The logarithm in this version has the basis 2 (ld = log2). The pi values are usually estimated by their relative frequencies, i.e. as p i =
fi , N
where fi = the absolute frequency of unit i and N = the sum of all frequencies in the text, i.e. N = Σfi. This is, of course, not the only indicator of entropy (cf. Esteban, Morales 1995). Instead of the dual logarithm we shall use here the natural logarithm in order to save ourselves recalculations and simplify transformations, that is, we define (2.3.1)
H = − � pi pi . i
Writing the relative frequencies as above, we obtain from this (2.3.2) H = N − � f i fi . N i
34 | Shapeless Repetition The measure of entropy can be applied both to qualitative and quantitative data. Examples. (1) Grotjahn (1979: 175) determined the number of words with a given length (measured in terms of the number of syllables) in Goethe’s “Erlkönig” and obtained the results presented in Table 2.4. Table 2.4: Distribution of word lengths (in syllable numbers) in Goethe’s “Erlkönig) (According to Grotjahn 1976) Number i of syllables
No. of words with length i (= fi)
1
152
2
55
3
6
4 2
Σ
215
Using formula (2.3.1) we obtain H1 = ln 215 - (152 ln 152 + 55 ln 55 + 6 ln 6 + 2 ln 2)/215 The result can be computed with a pocket calculator or a computer or looked up in tables of x ln x, cf. Kullback, Kupperman, Ku (1962). We obtain H1 = 5.3706 - (763.6298 + 220.4033 + 10.7506 + 1.3863/215 = 0.7373. This number can be interpreted here as the diversity of word lengths. The number itself does not tell us whether it is great or small. Hence it is reasonable to normalize H1. What is the variation interval of H1? Let us assume that we have n classes (there were n = 4 classes in the “Erlkönig”) and each class has the same probability pi = 1/n. In that case (2.3.1) yields (2.3.3)
H = − �
= − � − n = n n n n
But if a class has the probability p = 1 and all the others have pi = 0, we obtain H1 = -1 ln 1 - 0 ln 0 - … - 0 ln 0 = 0 where we set ln 0 = 0. Hence H1 varies in the interval . Usually, ln n is considered as H0 or as Hmax. Thus if we set
Global indicators | 35
H rel =
H , H
we obtain the relative entropy. In our case it is H rel =
= = .
For the sake of comparison we show some entropy values of word length distributions computed by Grotjahn (1979: 174-177) (cf. Table 2.5, Column 2 and 3). For Var(H), see Section 2.3.2. (2) Drobisch (1866) counted the frequency of individual verse types in Vergil's texts and obtained the results presented in Table 2.6 with Grotjahn’s correction (1979: 207). Here S = spondee, D = dactyl and the last two feet are omitted. The entropy is H1 = ln 1760 - (118 ln 118 + …+ 28 ln 28)/1760 = 2.5958 and
H1,rel = 2.5958/2.7726 = 0.9362.
Table 2.6: Entropy values of word length distributions (according to Grotjahn 1979) Text
Goethe: Erlkönig Der Totentanz Letter No. 589 Letter No. 591 Letter No. 596 Letter No. 605 Letter No. 612 Letter No. 641 Letter No. 644 Letter No. 647 Letter No. 659 Letter No. 667 Schiller: Die Kraniche des Ibycus Lucretius: De rerum natura De arte poetica Caesar: De Bello Gallico Sallustius: Bellum Iugurthinum
H1
0.7373 0.8693 1.0561 0.9730 1.1580 0.9716 1.1118 1.0027 1.2021 1.0872 0.9599 1.2652 0.9990 1.3521 1.3454 1.5964 1.4431
H1.rel
0.5318 0.6270 0.6562 0.7019 0.7195 0.5423 0.6908 0.6230 0.6177 0.6068 0.5940 0.7861 0.6207 0.7546 0.8359 0.8204 0.8054
Var(H)
0.00268760 0.00015440 0.00150811 0.00005172 0.00167903 0.00086099 0.00149714 0.00120948 0.00088160 0.00155308 0.00269397 0.00270826 0.00052240 0.00004468 0.00009340 0.00033559 0.00039049
36 | Shapeless Repetition Table 2.6: Verse types with Vergil (Drobisch 1866; Grotjahn 1979) Verse type
Number
SSSS SSSD SSDS SDSS DSSS SSDD SDSD SDDS DSSD DSDS DDSS SDDD DSDD DDSD DDDS DDDD
118 51 89 178 263 32 66 99 110 199 208 36 65 80 128 38
The greater the entropy the more uniformly the frequencies are distributed. We see that in the first example the Latin writers have higher entropy values than the German ones but one cannot draw conclusions from a mere look at the variation of values. In the hexameters by Vergil, H1,rel = 0.9362, i.e. a value which may tempt to intuitively conclude uniformity. But below we shall show that this is not the case. Here another, secondary interpretation of entropy may come into one's mind: if the value is very small then most frequencies are assembled in one unique class, the other ones are seldom or do not occur at all. If for example all hexameters have the pattern DDDD, then the poem would be rhythmically monotonous. If the patterns alternate, monotony disappears. Hence, entropy is an indicator of monotony or stereotypy: the smaller the entropy, the greater is monotony; the greater the entropy, the more heterogeneous is the text, the more tend the classes towards uniform distribution. The question whether uniformity may be accepted or not can be tested by means of the chi-square test for homogeneity: (2.3.5)
fi − Ei , Ei i = n
X = �
where fi are the observed and Ei the expected frequencies of classes. Since we test for uniformity, each Ei = N/n, hence (2.3.5) can be written as
Global indicators | 37
(2.3.6)
X =
n n � fi − N . N i =
Computing this value for Vergil's verse types, we obtain X2 = 16/(1760)*(1182 + 512 + …+ 1282 + 382) - 1760 = 650.85. In this case, the chi-square test is conducted with 15 degrees of freedom; therefore, we can safely conclude that a uniform distribution is not given. If H1 is computed instead of H1,rel, (2.3.5) need not to be computed because (2.3.7) can be used as an approximation: (2.3.7)
X2 = 2N(H0 - H1)
or, the other way round, if one already computed (2.3.5), H1 can be obtained as (2.3.8)
H = H −
X N
(cf. Altmann, Lehfeldt 1980: 176-178). With Vergil we obtain X2 from (2.3.7) as X2 = 2(1760)(2.7726 - 2.5958) = 622.34 and in the reverse direction we obtain using (2.3.8) H1 = 2.5877.
2.3.2 Comparison of two entropy values Two texts can be compared with respect to their entropy values by means of the t-test as proposed by Hutcheson (1970). The t criterion for this test is computed as (2.3.9)
t=
H − H , Var H + Var H
where H1 and H2 are the respective entropies of two texts. Var(H) can be computed according to the formula (cf. Miller, Madow 1954/1963; Basharin 1959; Bowman, Hutcheson, Odum, Shenton 1969; Hutcheson 1970)
38 | Shapeless Repetition
(2.3.10)
Var H =
� p i
i
pi − H
N
+ O N .
The elements of order O(1/N2) in the development of Var(H) are usually omitted, especially if N is large. The number of degrees of freedom can be computed as (2.3.11)
DF =
Var H + Var H . Var H Var H + N N
Example. Let us consider the entropy values of the word lengths in ”Totentanz” and “Erlkönig” by Goethe. From Table 2.5 we take − = +
t=
The degrees of freedom follow from formula (2.3.11): DF =
+ = ≈ +
As can easily be seen in the tables of the t-distribution, the difference is significant. We can interpret the result as a difference in global diversity of the word length distributions in the two texts. The word lengths in “Totentanz” vary more strongly than in “Erlkönig”.■■■ One can also show that the Latin authors except for Lucrece and Horace differ significantly. The expression Σpi ln2 pi can be computed if one sets pi = fi/N. In that case we obtain
� p i
i
p i =
� f i f i − N � f i f i + N N . N i i
2.3.3 The repeat rate A further, commonly applied indicator of dispersion is Simpson’s distance (Simpson 1949) or Herfindahl’s concentration indicator which is applied in
Global indicators | 39
various disciplines. It was introduced into linguistics by Herdan (e.g. 1962: 3640; 1966: 271-273) and is used today in many textological studies. It is defined as (2.3.12)
n
R = � pi i =
where the probabilities are estimated from relative frequencies pi = fi/N. Above, we defined the chi-square test as X =
n � fi − N N i
Thus we see at once that (2.3.14)
R=
X +N nN
or conversely (2.3.15)
X2 = N(nR - 1).
Replacing X2 in (2.3.14) by its counterpart in entropy (2.3.7) yields (2.3.16)
R=
H − H + . n
These correspondences show that it is sufficient to use one of these indicators (for other ones and their interrelations see Altmann, Lehfeldt 1980: 181; Bowman, Hutcheson, Odum, Shenton 1971). Example. For the data in Table 2.6 we obtained X2 = 650.85. Since the relation (2.3.14) is exact, we obtain from it R = (650.85 + 1760)/[16(1760)] = 0.0856, while the approximate relation (2.3.16) yields R = [2(2.7756 - 2.5958) + 1] = 0.0850. ■■■ The repeat rate is interpreted in different ways. Besides the above meanings it shows the uniformity of the distribution of units: the more similar are the frequencies, the smaller is R and reversely, hence it measures also the stereotypy of the text; are pi the coordinates of a text, then R is the squared distance of the text from the origin. R varies within the interval : if all probabilities are equal, we obtain
40 | Shapeless Repetition
R = Σ(1/n)2 = n(1/n)2 = 1/n, and if one unit has probability 1 and the other ones have 0, then R = 1. Hence one can relativise (normalise) R in the form (2.3.17)
Rrel =
− R − n
or, according to McIntosh (1967) (2.3.18) Rrel Mc =
− R . − n
2.3.4 Moments of distributions If the examined text unit is a quantitative variable taking integer or real values, then it is usually possible to ascertain the frequencies of its individual values in texts. In linguistics and text analysis, even rank variables (ordinal variables, as known e.g. from the rank-frequency form of the Zipf-Mandelbrot law) are studied in this way. Frequency distributions allow us to set up further global indicators, to characterise and to compare texts. We shall present some of them in the sequel. For the sake of illustration we use the distributions stated by Grotjahn (1979) and presented in Table 2.7. Table 2.7: Distributions of word length(measured in terms of syllable numbers; according to Grotjahn 1979:177) Number of syllables in the word x 1 2 3 4 5 6 7 Σ
Number of words with x syllables Caesar: De Bello Gallico fx 184 204 194 125 54 13 1 775
Sallust: Bellum Iugurthinum fx 122 249 196 110 24 1 702
Goethe: Der Totentanz fx 218 99 21 4 342
Schiller: Die Kranische des Ibycus fx 580 296 97 24 1 998
Global indicators | 41
Mean The mean of a discrete distribution is defined as (2.3.19)
n � xf x N i =
x=
where N = Σfx and n is the number of classes. We obtain for the data from Goethe's Totentanz in Table 2.7
x = [1(218) + 2(99) + 3(21) + 4(4)]/342 = 1.4474. The mean is a measure of location stating the place of the distribution on the xaxis.
Variance This indicator measures the dispersion of the distribution around the mean. It can be computed as (2.3.20)
s =
n � x − x f x . N i =
For computational purposes one can evaluate the binomial and obtains (2.3.21)
s =
� xf x � � n �� x f x − � N � i = N �
Correspondingly, we obtain for Goethe in Table 2.7 Σxfx = 495 Σx2fx = 1(218) + 22(99) + 32(21) + 44(4) = 867, from which s =
− =
results. For purposes of statistical tests, the unbiased estimator of the variance is used, which differs from (2.3.21) only by the denominator, which is (N - 1) instead of N, i.e.
42 | Shapeless Repetition
(2.3.22)
s =
� xf x � � n �� x f x − � N − � i = N �
This is also the form in which most statistical software packages and embedded functions in other programs work with. For large N, the difference is minimal. In our example, s2 = 0.4415. Instead of the variance, the so-called standard deviation, i.e. the root of S2 or s2 is preferred when the dispersion of a variable in a text has to be characterised. The difference between S and s is minimal: = and = .
Moments Before we introduce other indicators, we define the moments of a distribution. The initial moments are defined as (2.3.23)
m′r =
n r � x fx N i =
and the central moments as (2.3.24)
mr =
n � x − x r f x . N i =
Evidently m’1 = x and m2 = s2. The central moments can be expressed by means of initial moments. s =
� xf x � � n − x f �� � x N � i = N �
n � xf x x fx − � N i = N ′ ′ = m − m =
In the same way we obtain
� x − x f x N = � x − x x + xx − x f x N = � x f x − x � x f x + x � xf x − x � f x N N N N
m = (2.3.25)
Global indicators | 43
= m′ − m′m′ + m′ m′ − m′ = m′ − m′m′ + m′ as it follows from the definition of initial moments. In the same way one obtains (2.3.26)
m4 = m’4 - 4m’2 m’1 + 6m’2 m’12 - 3m’14
With the aid of central moments, we define the skewness or asymmetry as (2.3.27)
γ =
m m = s m
and the excess as
γ =
m m − = − s m
Example. We illustrate the computation on Goethe’s data from Table 2.7. We obtain Σxfx = 495 (as above) Σx2fx = 876 (as above) Σx3fx = 1(218) +23(99) + 33(21) + 43(4) = 1833 Σx4fx = 1(218) +24(99) + 34(21) + 44(4) = 4527. From these numbers we obtain m1’ = 495/342 = 1.4474 (as above) m2’ = 867/342 = 2.5351 (as above) m3’ = 1833/342 = 5.3596 m4’ = 4527/342 = 13.3368
44 | Shapeless Repetition hence the central moments are as follows m2 = 0.4402 (as above) m3 = 5.3596 - 3(2.5351)1.4474 + 2(1.4474)3 = 0.4162 m4 = 13.2368 - 4(5.3596)1.4474 + 6(2.5351)1.44742 - 3(1.4474)4 = 0.9059. Skewness and excess can be now computed as
= γ = − =
γ =
The resulting values are interpreted according to the following scheme:
γ1 γ2
=0
asymmetric normal
0
left asymmetric flat
right asymmetric steep
Hence “Totentanz” by Goethe is slightly right asymmetric and steep, a fact that can be seen directly from the frequencies. In Table 2.8 the four characteristic indicators of the text from Table 2.7 can be found. Table 2.8: Indicators of texts from Table 2.7 Text
Caesar Sallust Goethe Schiller
x 2.6181 2.5271 1.4474 1.5671
s2
1.6786 1.1325 0.4402 0.5962
γ1
0.4923 0.3619 1.4248 1.2834
γ2
-0.4546 -0.4972 1.6749 1.1127
The numbers in Table 2.7 seem to indicate that the distributions determined in texts by the Latin authors are “different” from the German ones. Such a subjective impression can be made objective by using a statistical test.■■■
Global indicators | 45
2.3.4.1 Comparison of two means Two means can be compared in various ways. Each of them has its focus on another aspect of the problem. It is not possible to proceed in every case in the same way but if we have large sample sizes – as common in text analysis – some of the tests can be surely used. (1) For a comparison of two texts which do not differ with respect to language, text sort, and author, one can suppose that the variances are equal (but unknown) and the test criterion (2.3.29)
x − x = t s + N N
is appropriate, where s =
N − s + N − s . N + N −
Here, t is the Student variable with N1 + N2 - 2 degrees of freedom. If the computed t is greater (in absolute value) than the theoretical one, we reject the null hypothesis of equality. Obviously, N can be used instead of N - 1 when the data sets are large. If we compare the works by Sallust and Caesar using the numbers from Tables 2.7 and 2.8 we obtain + = + − s = s =
and
t=
− = +
The number of degrees of freedom (1475) is here practically infinite, hence we can use the normal distribution. The resulting probability is P = 0.14, thus one can assume that the means are equal. (2) In case that one cannot assume the equality of variances, one computes (2.3.30)
t=
x − x s s + N N
46 | Shapeless Repetition and the number of degrees of freedom using the Welch method as
(2.3.31)
DF =
� s s � � + � � N N �
� s � � s � � � � � � N � + � N � N − N −
.
Various other methods can be found in Sachs (1972). Using this method to compare Sallust and Caesar, we obtain
= s = s =
t=
− = +
The number of degrees of freedom follow from (2.3.31) as
� � + � � � = . DF = � � � � � � � � � � � + � � Both results are almost identical. Testing the difference between Caesar and Goethe using the second method we obtain
t= DF =
− = +
( + )
+
=
Also here one can consider the number of degrees of freedom as infinite, thus the difference between the means is not significant. As shown by Grotjahn (1982), it is, from the linguistic point of view, not very reasonable to test the difference between the other indicators. In general, the amount of data which is obtained from text analyses is very large such that the classical F-test fails and always indicates significant differences between variances. A transformation to a normal variable does not help either if the disper-
Global indicators | 47
sion in the data is too small, as was the case in our examples. It would, perhaps, be more appropriate to consider the above indicators as elements of two vectors and use other methods (cf. e.g. Popescu et al. 2010: 26ff.; Popescu, Mačutek, Altmann 2009).
2.3.4.2 Comparing two distributions Instead of testing the difference of individual indicators, a global comparison of two distributions can be performed. Such a test checks whether the frequencies in both data sets are distributed over the individual classes in the same way. This kind of test is called test for homogeneity. Here we shall present two equivalent tests for homogeneity and illustrate them on the data from Goethe and Schiller in Table 2.7. The data are presented in Table 2.9. Table 2.9: Data for the test for homogeneity Goethe fi1 218 99 21 4 f.1 = 342
Schiller fi2
580 296 97 24 1 f.2 = 998
fi .
798 395 118 28 1 N = 1340
In the following, fij denotes the frequency in cell i,j where i stands for the row and j for the column (i = 1..5; j = 1..2). The marginal frequencies fi. are the sums in the rows and the f. j are the sums of the frequencies in the columns. N is the total sum of all the frequencies. The chi-square test for homogeneity is given by the formula (2.3.37)
n
X = �� j = i =
fij − fi f j N f i f j N
.
For computational purposes the above formula can be written as (2.3.38)
X =
N n fi Nf � − f f f i = fi
and the resulting value is distributed as a chi-square variable with n - 1 degrees of freedom. Inserting the values from Table 2.9 we obtain
48 | Shapeless Repetition
X =
� � + + + + �− � � �
= − = The chi-square characteristic with 4 degrees of freedom at α = 0.05 has the value 9.49. Since our computed value is smaller, we conclude that the two samples are homogeneous. For Sallust and Caesar, we obtain X2 = 33.30 which with 6 degrees of freedom, which signalizes a highly significant difference (P = 9(10-6)). Hence works of the same text-sort need not be homogeneous. Another way to perform this test is the information statistics (2.3.39)
I = �� f ij i
j
Nf ij f i f j
which can be presented for computation purposes as (2.3.40)
I = �� f ij fij + N N − � f i f i − � f j f j i
j
i
j
From this formula, (2.3.37) can be obtained as an approximation by developing a Taylor series. For the data from Table 2.8 we obtain
�� fij fij = + + + + + i
j
= 2N ln N = 2(1340) ln 1340 = 19279.13871
� fi fi = + + + = i
� f j f j = + = . j
Inserting these numbers in (2.3.40) we obtain 2I = 15186.27979 + 19297.13871 - 16700.4501 - 17774.89408 = 8.07. As can be seen, 2I differs slightly from X2. According to Ku’s proposal (1963) one should subtract 1 for each empty cell. Since the class i = 5 in the Goethe data is empty, we correct 2I to 2Icorr = 8.07 - 1 = 7.07
Modelling using probability distributions | 49
which is even closer to the chi-square value. Several other tests are available, which should be considered as alternatives when the chi-square fails to give reliable results, i.e. in particular with very large data sets. Some of them are the following ones: 1. Total Variation Distance (2.3.41)
TVD P Q =
n � Pi − Qi i =
which measures the similarity between two distributions P and Q, of which one can be also the distribution of the expected values while the other one is the observed distribution (cf. Adams, Clarkson (1933). 2. Kullback-Leibler Divergence (2.3.42)
n
DKL P Q = � Pi i =
Pi Qi
(cf. Kullback, S.; Leibler, R.A. 1951). 3. Hellinger distance (2.3.43)
HD P Q =
n
� i =
Pi − Qi
(cf. e.g., Pollard 2002). More detailed information on these and other measures and their properties can be found in Csiszár, Shields (2004); Liese, Vajda (2006), and Mačutek, Wimmer (2014).
2.4 Modelling using probability distributions Characterising texts by means of indicators and comparing two indicator values are inductive methods, which can describe the form of a text to a certain extent. In this way, it is possible to draw a phenomenological image of the text and often also to reconstruct the surface structure of the given text. If two or more different variables (textual units) are counted (measured), factor analysis (prin-
50 | Shapeless Repetition cipal component analysis) or other multivariate methods can help to study the interrelations between these variables (cf. Carroll 1960; Sommers 1962), whose results may even inspire hypotheses which may one day turn out to be useful for the construction of a text theory. Even a successful investigation of the 'phenomenology' of texts and an extensive description of the properties of these texts and of their correlations are still far from the attainment of a theory. Empirical studies and descriptions do not provide any knowledge of the mechanisms which are responsible for the production of the observed configurations of properties. In other words: what is still missing is the possibility to explain the observed and described properties, patterns, and structures, i.e. the laws of text. The way which leads to finding and formulating laws is long and may be hard. More often than not, the quest begins with a 'phenomenological' collection and registration of regularities. Three options are available: (i) Empirical generalisation. This inductive strategy starts with some observations and a set of data. On this basis, an as simple as possible mathematical model is selected and its parameters are fitted to the data such that the (squared sum of) deviations between the results of the calculations and the empirical data become as small as possible. This strategy may be useful when merely a description of the observations is needed or as means for extrapolation (not to confuse with prediction) as applied in areas such as engineering, finance, medicine, insurance companies when theoretically justified models are not available. The disadvantage of this method is that such a model has almost no chance to be integrated in a theory because it has no connection to the theoretical body of knowledge in the discipline. And this is also exactly the reason why validity and reliability of this model are untrustworthy. In rare cases, however, an a posteriori systematisation may succeed; one of these cases was a function proposed by Piotrowskaja and Piotrowski (1974) as an approximation to the temporal development of linguistic elements. This function could be theoretically justified in 1983 (cf. Altmann, v. Buttlar, Rott, Strauss). (ii) Analogy. A mathematical model is imported from another field of research or application because it is hoped to be general enough to cover also the given purpose. If that model can be interpreted from the point of view of the given task or field of research or application, it has a chance to prevail. The well-known Zipf-Mandelbrot law is a model of this kind; it was borrowed from linguistics and transferred to musicology and to the theory of fine arts (cf. Orlov, Boroda, Nadarejšvili 1982). (iii) The ideal case is a "representational" mapping (cf. Bunge 1967/1998) of the regularity under study. Here, one starts from a background body of knowledge, assumes a generating mechanism, and deduces a function from that
Modelling using probability distributions | 51
mechanism which fits well with the observed data. If such a model can be integrated into a system of statements of the same kind (plausible hypotheses with empirical support), one can call it a law (candidate) (cf. Bunge 1967/1998). This ideal way is not easy to go in text analysis as only a few laws are known so far and a huge number of methods are available to study even a single phenomenon. We will discuss in this section one form of mathematical model of text phenomena, viz. a special kind of function: probability distributions. The main particularity of this kind of function is that they map each value of the independent variable, the so-called 'random variable' on a probability. Every kind of variable can play the role of a random variable, i.e. not only numerical ones but also ordinal and even categorical ones such as 'part-of-speech' or 'type of syntactic construction'. It is very doubtable whether a single function or probability distribution can represent every textual unit. There exist, of course, probability distributions and other functions with a rather general applicability (e.g. the Sichel distribution; cf. Sichel 1971, 1974, 1975). Such distributions/functions are, on the other hand, particularly hard to interpret. We assume, on the contrary, that even the statistical behaviour, e.g. the frequency distribution of an individual textual element has to be modelled by means of more than one probability distribution because additional effects (subsidiary conditions) such as the influence of genre, text sort and other extra-linguistic factors cause differences which cannot be captured by different parameter values alone. The following paragraphs will try to show an approach to modelling the behaviour of units in a text. Here we will start from a selected unit and one of its properties, namely the word and its length (measured in terms of the number of syllables). The approach will be based on linguistic considerations and thus have a theoretical foundation. It will be easy to modify for the application to other units and properties and to extend without too complicated mathematics (cf. e.g. Wimmer, Altmann 2005). The distribution of word lengths without any relation to other properties of the word was investigated by Fucks (1955) and Čebanov (1974). These researchers arrived at a displaced Poisson distribution as an appropriate model. Frequent disagreements with data provoked Grotjahn (1982) to consider the parameter of the Poisson distribution as a Gamma-distributed variable, an approach which yields the negative binomial distribution. Already Fucks had pondered this idea (1970); it gives better agreement of the model with the data. We will show here that another approach yields the same result, and we will generalise it even more. On several occasions, Orlov showed that the Zipf-Mandelbrot Law holds for complete texts but not for text fragments. Boroda arrived at the same insight
52 | Shapeless Repetition from his work with musical texts (cf. Orlov, Boroda, Nadarejšvili 1982). Orlov concluded that the author of a text has a concrete text length in mind when s/he begins writing and spreading the information over the footage of the text. In this way, a regular rank-frequency distribution of the words arises, which abides by the Zipf-Mandelbrot Law. Orlov called this theoretically assumed intended (but not always realised) text length 'Zipf Number'. It should rather be called 'ZipfOrlov Number' or 'Zipf-Orlov Length' as an hommage to Orlov's hypothesis. From this finding follows that an author does not (unconsciously) control the word frequencies but rather the distances between the adjacent frequency ranks on the basis of the intended text length. When we try to set up a theoretically justified model of the statistical structure of texts we can therefore base our considerations on the formation of the differences Px - Px-1 = ΔPx-1 in order to calculate the theoretical probabilities. Be Px the probability that the random variable – in our case word length – takes the value x and be Δ the difference operator as defined above. Throughout the following, we will consider the relative difference D = ΔPx-1 / Px-1 and find out how it can be formed. To this end, we assume that any text and any textual unit is subject to Zipf's Forces, which have universal effects in language. These forces are called unification and diversification; they are, depending on the circumstances, attributed to the speaker/writer and hearer/reader (Cf. Zipf 1949). The basic idea is that e.g., the lexical-semantic diversification of words corresponds to the speaker's advantage in a high potential of meanings that can be expressed by uttering a word – whereas the lexical-semantic unification, i.e. the limitation of the words' polysemy decreases the hearer's effort in decoding the message. Short words are advantageous from the point of view of the speaker because they save production effort while the side of the hearer prefers longer ones because they are easier to recognise. Thus, in all fields of language, equilibration processes between extreme poles of properties take place. An author has several means to exert influence on the quantity D depending on the form in which the property under study (here word length) is given: There are, e.g. languages which prefer two-syllable words over one-syllable ones. Another factor is the result of special constraints found in the individual text sort (e.g., metre), intended aesthetic effect (e.g., euphony), the regard to the intended auditory (e.g., the flow of information in a poem vs. in a scientific paper), i.e. influences which result from the respect to the hearer/reader. Let us illustrate this consideration on a simple example. We will denote the 'force', the 'influence', the 'effect', or the 'proportionality' of the speaker by the symbol S and that of the hearer by H. Both can be expressed by real numbers. If both forces act in form of an addition, if they are constant and have a negative effect on D we obtain
Modelling using probability distributions | 53
(2.4.1)
ΔPx − = − H + S Px −
We write A instead of H+S and solve (2.4.1) for Px. We obtain Px - Px-1 = -A Px-1 Px = (1-A) Px-1. The solution is Px = (1-A)xP0. If 0 < A < 1, we can write A = p, 1 - A = q, hence Px = P0qx. Since = P � q x = P ∞
� P = we obtain x
= P p − q
from which follows that P0 = p, hence we obtain the geometric distribution (2.4.2)
Px = pqx, x = 0, 1, ...
If we do not normalize, we obtain a simple geometric sequence. Other approaches are D = -H/S, D = -S/H, D = H - S if S > H etc. These have the same mathematical results as long as the right hand side has a value smaller than 1; when it is larger, we obtain an oscillating function. If D is positive, x must be smaller than a finite number, otherwise a diverging series results. Subtracting the proportional effect of the hearer (or the speaker resp.) from the constant A and dividing it by the same quantity, i.e. (2.4.3)
D=
A − Hx A − Sx or D = , Sx Hx
yields the Poisson distribution with the parameter A/S, i.e. the same solution which Fucks derived for the word length distribution in another way. The technique shown here offers a good linguistic interpretability, further, an attachment to Katz's and Ord's (1967) systems, whose distributions coincide partly with ours, and, moreover, the chance to derive new distributions and to modify established ones. We are convinced that it is impossible to model a property with respect to every text and every language by means of only a single distribution (or simple function). Our assumption confirms also with the fact that up to now already 25
54 | Shapeless Repetition models of the word length distribution in texts have been derived and tested (cf. Popescu et al. 2013). The only way to set up a model which could cover the behaviour of a linguistic property in every variety is to find an extremely general mathematical formula which would fit all kinds of data. Such a model is, however, immune against any attempt to falsify it and hence scientifically useless. Furthermore, experience shows that a distribution which turns out to be sufficiently suitable for short texts might fail when applied to longer ones. This effect could be caused by the fact that long texts are not produced in one go but with temporal interruptions resulting in changing stylistic modes in form of varying word length rhythms. Therefore, longer texts may require mixed or composed distributions. The same holds, of course, for texts which have been edited and corrected, texts by more than one author, texts with several variants, etc. In the following, we will apply the above-sketched approach to derive the distribution of word lengths in texts and also for a model of the sentence length distribution. (1) We assume that the relative difference D obtained from word lengths consists of the sum of a linear function of the hearer's effect and a parameter for the text sort, multiplied by an inverse linear proportion of the speaker's effect (without any additional constant), i.e. (2.4.4)
ΔPx − A − Hx = Px − Sx
Solving for Px yields A − Hx Px − Sx Sx + A − Hx = Px − Sx A + S − H x = Px − Sx
Px = +
(2.4.5)
Isolating S - H and substituting K = A/(S - H) + 1 transforms (2.4.5) to (2.4.6)
S − H K + x − Px − S x K + x − Px − =q x
Px =
by rewriting (S - H) / S as q. We assume that 0 < q < 1 because S > H (cf. above). Equation (2.4.6) can be solved stepwise:
Modelling using probability distributions | 55
K P K + K K + P = q P = q P K K + K + K + x − x Px = q P x P = q
or simply � K + x − � x Px = � � q P x � �
(2.4.7) From
� P = follows x
∞ ∞ � K + x − � x � −K � x = P � � � q = P � � � −q x x x = � x = � � � = P − q − K
Setting p = 1-q yields P0 = pk, from which we obtain (2.4.8)
� K + x − � K x Px = � � p q x = x � �
i.e. the negative binomial distribution, which Grotjahn (1982) obtained for the word length distribution in another way. The constants here are not only functions of the interaction between speaker and hearer but also text parameters. Determining their numerical values is possible only by means of a broad empirical study. As can be seen, q = (S-H)/S, p = 1 - q = 1 - (S-H)/S = H/S, parameter A from (2.4.9) represents one of the above-mentioned forms of speaker-hearer interaction, e.g. H/S, and K = A/(S-H) + 1. Possibly, A is composed of more than a single interpretable quantity. Table 2.10 and Figures 2.1 –2.4 show the fitting of the negative binomial distribution to the data from Goethe's letters which were analysed by Grotjahn. As can be seen, all the results are very good. We cannot conclude, however, that the numerical values of the parameters K and p obtained are characteristic of Goethe's letters or Goethe as an author of letters. They are just the result of an iterative fitting procedure without any certainty to have found the global minimum. Other iterative methods may yield slightly different results.
56 | Shapeless Repetition Table 2.10: Fitting the negative binomial distribution to some letters by Goethe No. 612
fx 164 105 35 15 4 – 6.3120 0.8985 3.47 2 0.1764
No. 647
NPx 162.61 104.38 38.81 10.94 3.26 –
fx 259 132 37 19 6 1 1.8818 0.7424 3.91 3 0.2713
No. 659
NPx 259.16 125.65 46.65 15.55 4.89 2.10
fx 151 68 16 7 1 – 2.5790 0.8314 1.71 2 0.4253
No. 667
NPx 150.91 65.64 19.81 5.10 1.54 –
fx 77 51 26 10 4 – 3.1954 0.7807 0.32 2 0.8521
1
2
3
4
5
1
2
3
4
5
6
Fig. 2.2. Graph of the fit of the negative binomial distribution to the data of letter No. 647
0
0
20
20
40
60
40
80
100
60
120
140
80
Fig. 2.1. Graph of the fit of the negative binomial distribution to the data of letter No. 612
0
0
50
50
100
100
150
200
150
250
x 1 2 3 4 5 6 k p X2 DF p
1
2
3
4
5
Fig. 2.3. Graph of the fit of the negative binomial distribution to the data of letter No. 659
1
2
3
4
5
Fig. 2.4. Graph of the fit of the negative binomial distribution to the data of letter No. 667
NPx 76.15 53.37 24.56 9.33 4.59 –
Modelling using probability distributions | 57
(2) The same approach can be employed to find a model of the sentence length distribution in a text. As to measuring sentence length, a warning is in order: Counting words instead of clauses as a measure of sentence length causes an extra amount of variance. As an intermediate level (the clause level) is skipped additional 'noise' will occur in the data. Therefore, this factor should be taken into account when the model is formulated. Let us denote this factor by B and set up the formula ΔPx − A − Hx = Px − Sx + B
(2.4.9)
Consequently, A − Hx Px − Sx + B Sx + A + B − Hx = Px − Sx + B A + B + S − H x = Px − B + Sx
Px = +
Factoring out (S - H) and in the denominator S yields
A+ B +x S − H S − H Px = Px − B S +x S
(2.4.10)
We denote S−H A+ B B = q = K − − = R − S S−H S
and rewrite (2.4.10) as (2.4.11)
Px =
K + x − qPx − R + x −
yielding recursively the solution (2.4.12)
Px =
K K + K + K + x − qPx − R R + R + R + x −
a result which can be written in various ways. P0 is again a consequence of the fact
� P = and can be presented as x
58 | Shapeless Repetition
(2.4.13)
P =
F K R q
where the denominator is formulated with the help of the hypergeometric function. The distribution (2.4.13) is a special case of Ord's system and is called the Hyperpascal distribution. Table 2.11 and Figures 2.5 – 2.8 show some results of fitting the Hyperpascal distribution to the sentence length data in texts by Herodot as determined by Morton (1965) and Morton, Levison (1966). We cannot exclude that even better fitting results can be achieved on the base of other parameter values. For the sake of the present demonstration these results are rather satisfying. The technique presented so far is extendable almost without any limits (cf. Wimmer, Altmann 2005). The following rules should be observed for its application. (a) In every case, among the fitting models the simplest one should be selected, i.e. the one with as few parameters as possible. The simplicity criterion is opposed to a number of other scientific principles (cf. Bunge 1998) but a simple model is a better initial point than a complicated one. When a simple model which is a special case of a more complex one is sufficient for a given purpose, it should be retained. Word length distributions can be modelled using the geometric distribution if the frequencies decrease monotonically; for the geometric distribution is a special case of the negative binomial distribution with K = 1. When a monotonous decrease is not given, the Poisson distribution can be recommended as it is a limiting case of the negative binomial distribution (if K → ∞ q → Kq = λ ). Table 2.11: Fitting the Hyperpascal d. to sentence lengths in Herodot (Data from Morton 1965)
x o 1 2 3 4 5 6 7 8 9 10 11
No. 1
fx 16 35 47 39 24 16 11 5 4 1 1 1
NPx 16.08 36.19 43.05 37.82 27.62 17.77 10.42 5.69 2.93 1.44 0.68 0.31
No. 2
fx 6 33 34 39 26 22 19 5 4 6 1 0
NPx 4.75 30.81 38.99 35.90 28.56 20.88 14.43 9.59 6.19 3.91 2.42 1.48
No. 3
fx 7 48 37 34 21 23 6 11 6 2 3 2
NPx 6.87 45.44 40.54 31.62 23.44 16.92 12.02 8.44 5.88 4.08 2.81 1.93
No. 4
fx 8 38 49 38 24 15 15 8 2 0 1 0
NPx 8.07 37.05 45.05 39.71 27.99 18.23 11.06 6.36 3.54 1.90 1.00 0.51
Modelling using probability distributions | 59 No. 1
– – – – – – – 5.9987 0.8277 0.3106 1.72 6 0.94
No. 2
– – – – – – –
0 0 1 2 1 1 – 1.7383 0.1414 0.5275 5.33 6 0.50
No. 3
0.89 0.53 0.31 0.18 0.11 0.06 –
– – – – – – – 0.3997 0.0401 0.6630 7.94 8 0.44
No. 4
– – – – – – –
1 0 0 0 0 0 1 2.5904 0.2360 0.4185 6.18 7 0.52
0.26 0.13 0.06 0.03 0.02 0.01 0.00
0
0
10
10
20
20
30
30
40
40
50
12 13 14 15 16 17 18 K R q X2 DF P
0
1
2
3
4
5
6
7
8
9
10
11
0
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
Fig. 2.6. Fitting the Hyperpascal d. to sentence lengths in Herodot No. 2 (cf. Table 2.11)
0
0
10
10
20
20
30
30
40
40
50
50
Fig. 2.5. Fitting the Hyperpascal d. to sentence lengths in Herodot No. 1 (cf. Table 2.11)
1
0
1
2
3
4
5
6
7
8
Fig. 2.7. Fitting the Hyperpascal d. to sentence lengths in Herodot No. 3 (cf. Table 2.11)
9
10
11
0
1
2
3
4
5
6
7
8
9
10
Fig. 2.8. Fitting the Hyperpascal d. to sentence lengths in Herodot No. 4 (cf. Table 2.11)
60 | Shapeless Repetition If these two distributions do not give a satisfying result, the negative binomial distribution should be used. This distribution can be generalised in various ways (cf. e.g., Patil, Yoshi 1968); one of these options was shown above. Sometimes, it may happen that not enough degrees of freedom are left if the applied distribution has too many parameters. One way to overcome such a situation is the amplification of the data set such that more classes are obtained. When this option is not available an additional class with zero frequency may help. (b) When a models is applied to a big mixed sample such as a text collection or a corpus, an important fact has to be taken into account: The parameters of the basic distribution (here the negative binomial distribution or the Hyperpascal distribution) are not the same in every sample. In such cases, a mixture of identical distributions with different parameters can be pondered, i.e. (2.4.14)
k
Px = � α i f x Θi i =
where f x Θ represents the probability function of the components, αi are real numbers with
�
k
α i = , and Θi stands for all the parameters of the ith compo-
i =
nent; another way is to consider the parameter of the basic distribution as random variables and to form a compound distribution of the type Px Θ = � f x Θ g Θ d Θ
or Px Θ = � f x Θ g Θ
or (2.4.15)
Px Θ Θ = �� f x Θ Θ g Θ hΘ d Θd Θ
etc. In this way, it is always possible to achieve a numerically satisfying result, which is, however, often hard to interpret theoretically. We assume that in long texts the unconscious control of the word lengths (or other properties) infers considerably. If N is large, deviations must be expected even if the model works well otherwise; these deviations may go below the conventional limits (0.05, 0.01) significantly. It is, therefore, advisable to reconsider the critical values for linguistics and for individual text properties individually.
Modelling using probability distributions | 61
(c) Some distributions accept under certain conditions also negative parameters. Such values can be linguistically interpreted within our approach without any problems but they should be avoided nevertheless. Usually, iterative methods are available which provide equivalent fitting options yielding positive parameters. (d) Classical estimation methods such as maximum likelihood, the moment method, the method of minimal squares etc. often fail to yield acceptable fitting results. An alternative is offered by iterative optimisation procedures. We recommend this method and use it throughout this volume. (e) We presented the derivation of the distribution in a way that P ≠ . Our word length distributions begin at x = 1, i.e. we do not assume the occurrence of zero-length words. Therefore, in each case, the probability distributions require to be formally displaced. Thus, Px = f x − Θ for x = 1, 2, ... In the case of the variable sentence length we transformed the intervals 1..5, 6..10, 11..15 ... onto the scale 0, 1, 2, ..., viz. x = (yu - 5)/5, where yu represents the upper limit of the interval. In this way, as far as we can see, every distribution of a property of a linguistic unit can be modelled. (f) Another trap that must be avoided is the tendency to implicitly identify the relation between sample and population with the relation between text and corpus or even the relation between corpus and language. Statistical methods enable us to draw reliable conclusions in spite of fragmentary information. Samples, if randomly drawn, large enough and meeting some other requirements allow statistical induction, i.e. reliable statements about the “population”. An individual text, however, is a semiotic system, not a random sample. Moreover, the relation between texts or corpora and language is of an entirely different nature. Texts and corpora are instances of (observable) linguistic behaviour whereas language is an (unobservable) abstract concept, whence a statistical induction from texts or corpora to language “as a whole” is impossible for methodological and epistemological grounds. This fact can simply be expressed by Orlov’s (1982) dictum: there are no populations in language. Moreover, the notion of "language as a whole" as often used in corpus linguistics is clearly a confusion of a scientific model with reality. “Language” is of course a simplification and an abstraction of what linguists observe when studying the linguistic behaviour of people. Language is not observable as such – because it is a model of what is observed. Researchers can therefore either directly observe units of linguistic behaviour, i.e. (oral or written) texts or indirectly by working with data which describe aspects of linguistic behaviour such as dictionaries and grammar. Text collections and corpora doubtlessly qualify as manifestations of linguistic behaviour, and they are often believed to provide
62 | Shapeless Repetition particularly reliable information because of their size. On the other hand, their inhomogeneity, which increases with increasing size causes methodological problems. Some of them can be overcome in the following way. According to Orlov, Boroda and Nadarejšvili (1982), a law, which holds for an individual texts, need not necessarily be also valid for an imaginary population of all texts of the given kind (cf. also item (b)). We will illustrate the effect of collecting texts and forming a composed sample on an example. Table 2.9 gives the results of fitting the negative binomial distribution to the data from text No. 647 with k = 1.8819 X2 = 3.91 p = 0.74224 df = 3 P = 0.27 and text No. 612 with k = 6.3120 p = 0.8983 P = 0.18
X2 = 3.47 df = 2
It would be a mistake to expect that combining the data could stabilise the result. Summing up the frequencies gives the data in Table 2.12 and visualised in Figure 2.9 yielding worse fitting results than the two data sets separately. It is to be noted that word length research is one of the most popular domains of quantitative linguistics. There are monographs, omnibus volumes and individual articles in which about 3000 texts in 50 languages have been analyzed. The number of models increases steadily (cf. Schmidt 1996; Best 1997, 2001; Grzybek 2006; Đuraš 2012; Köhler, Altmann 2013). Table 2.12: Fitting the negative binomial distribution to the mixture of letters No. 612 and No. 647 by Goethe x
fx NPx
1
423 422.36 k = 2.842
2
237 230.40 p = 0.806
3
72 84.96 X2 = 5.39
4
34 26.32 DF = 3
5
7 7.38 P = 0.15
6
1 2.58
0
100
200
300
400
Some text laws | 63
1
2
3
4
5
6
Fig. 2.9. Fitting the Hyperpascal d. to sentence lengths in Herodot No. 3 (cf. Table 2.11)
Last but not least it is to be emphasized again that also continuous models and instead of distributions also simple functions may be fitted to a set of discrete variables. Every model is merely an approximation to the unknown truth, a means to make the reality structured and explainable. What more, discrete and continuous functions or distributions are mutually transformable (cf. Mačutek, Altmann 2007).
2.5 Some text laws In the following, we shall present four well-known models, which display a formless type of repetition. Two of them are based on probability distributions; the other ones have been built on the basis of functions.
2.5.1 The Zipf-Mandelbrot Law No other achievement in linguistics (including structuralism and generative linguistics) provoked such an enduring echo in almost every scientific discipline as the impetuously disputed Zipf-Mandelbrot law. It occurs in each of the humanities, in geography, biology, musicology, documentation sciences, mathematics, physics, economics etc., and finally in systems theory. A simpler variant was originally created by Estoup (1916); a deep analysis of the problem was conducted much later in numerous publications by George Kingsley Zipf
64 | Shapeless Repetition who motivated the observed regularity by means of the assumption that human beings tend to minimize their effort (cf. particularly Zipf 1935, 1949). Many other motivations, derivations and modifications of the model followed; hundreds of publications can be found in the literature (cf. Guiter, Arapov 1982 and especially the bibliography prepared by Wentian Li: http://www-nslijgenetics.org/wli/zipf). The most popular ones are Mandelbrot (1953, 1954a,b, 1957), Arapov, Efimova, Šrejder (1957a,b), Orlov (cf. Orlov, Boroda, Nadarejšvili 1982). A simple presentation can be found in Piotrowski (1984: 125f), Rapoport, (1982), Mandelbrot (1966), Miller, Chomsky (1963) and Brookes (1982). The Zipf-Mandelbrot law states the following: If the frequencies of individual units in a text are determined and arranged according to the rank of their frequencies such that the most frequent element is assigned the rank 1, the second most frequent one rank 2 and so on, then the relative frequencies (which are used as estimations of the probabilities pr) are distributed according to the formula (2.5.1)
Pr =
K r = n A + r γ
Here are n the size of the inventory of the units, A, γ constants, r the rank, K the normalizing constant K =
� A + r γ −
, which depends on A and γ. Some ques-
tions and problems are connected with this approach; it is not yet possible to clarify them satisfyingly in all cases. (1) A rank is actually not a probability variable but just an auxiliary variable. There are, however, several approaches to clarify this problem satisfactorily. We will show a solution with the help of the less well-known approach by Miller (1957). (2) For which kind of linguistic objects does this law hold? Orlov and Boroda (cf. Orlov, Boroda, Nadarejšvili 1982) showed that it is valid only for complete texts irrespectively of whether they are linguistic or musical ones. This fact can be interpreted in the following way: An author plans unconsciously the length of the text (Zipf-Orlov Length, as it may be called) in advance. In dependence on this length s/he controls the flow of information in the text, which results in a specific rank-frequency distribution. Taking a sample (e.g. randomly) or a fragment from a texts yields therefore a heavily distorted picture of this distribution. From this consideration an answer to another question arises: A mixture of a number of texts gives a rank-frequency distribution which can be characterised only by means of as many parameters as texts were involved. This explains why samples, texts fragments, text
Some text laws | 65
collections and corpora do not make any sense as sources of data with respect to Zipf’s Law or the Zipf-Mandelbrot Law. (3) Which are the units for which the Law holds? This is not yet fully known. The rank-frequency distribution of sounds, phonemes, and characters in texts was successfully modelled by means of the Zipf-Mandelbrot d. but also using various other probability distributions such as the geometric d., the Yule d., the negative hypergeometric d., partial sums d. (cf. Zipf 1929, 1935, 1949; Yule 1924; Martindale, C., McKenzie, D., Gusein-Zade, S.M., Borodovsky, M.Y. 1996; Best 2005; Grzybek, Kelih, Stadlober 2009; Good 1969) and by means of functions (Tuldava 1988; Altmann 1993). Only a few of the publications provide a theoretical justification for their models but a recent approach gives a general background for all of them (Wimmer & Altmann: 1999). The distribution of morphs and morphemes has not yet been studied. Parts of speech were subject to rank-frequency studies on the basis of Altmann's ranking function (Altmann 1993; Best 1997) and the PopescuAltmann model (Popescu, Altmann and Köhler 2009), which has the form of a function, too. Even syntactic constructions were scrutinised with respect to their frequency distribution. In Köhler & Altmann (2000), a corresponding investigation was performed, which yielded the Waring d. because instead of the rank-frequency variant of the Zipf-Mandelbrot Law, its complement, the frequency spectrum, was chosen. Also the recently introduced unit 'motif' (we will discuss this unit in a later section) was shown to display a lawful behaviour and to abide by the Zipf-Mandelbrot Law (cf. Köhler & Naumann 2010). The investigation of the rank-frequency distributions of linguistic units and the parameters of these distributions open up enormous perspectives. (4) Some units are easy to segment and to classify, e.g. phonemes. Problems arise with words. In which way should they be counted? Lexemes or wordforms? Which variant is the one which is meant by the Zipf-Mandelbrot Law? Miller’s interpretation (see below) suggests that it is word-forms, not lexemes (via lemmas). The need for an optimal flow of information, i.e. the need for a controlled increase of word-forms, however, seems to refer to lexemes rather than to word-forms. Empirical research will shed light on this problem. (5) The Zipf-Mandelbrot Law was very often applied but its agreement with the given data was only rarely tested. We will present three examples to show how to test the conformity of the Law with data by fitting the parameters to the data and performing a goodness-of-fit test.
66 | Shapeless Repetition Let us first consider the ordered frequencies of phonemes in the text “Erlkönig” (cf. Table 2.13) as determined by Grotjahn (1979:182-3). We use the optimisation method by Nelder and Mead (1964) to estimate the parameter values, and obtain the results presented in the third column of Table 2.13 and visualised in Figure 2.10. The Chi-square test yields X2 = 30.90 with 36 degrees of freedom, which corresponds to P = 0.71. We can therefore accept the adequacy of the ZipfMandelbrot Law as far as these data are concerned. Grotjahn differentiated short and long vowels and considered the diphthongs as phonemes according to German phonology. A different definition could certainly yield somewhat different parameter values. Now let us consider the rank-frequency distribution of the words in the text. We meet some problems here. Focussing word-forms has the consequence that “Erlkönig” and “Erlkönigs” are two different words. One may be in doubt, however, in the case of the word variant “Erlenkönig” in the fourth stanza. Separable verb prefixes as in German and Hungarian are another source of problems as many of them are ambiguous and could also be analysed as adverbs. Table 2.13: Rank-frequency distribution of phonemes in „Erlkönig“ by Goethe (according to Grotjahn 1979) Rank r
Frequency fr
NPr (2.5.1)
1 111 101.85 2 97 86.16 3 66 73.59 4 66 63.40 5 51 55.04 41 6 48.12 35 7 42.33 34 8 37.45 9 33 33.31 10 27 29.77 11 27 26.72 12 25 24.09 13 23 21.80 14 21 19.79 15 17 18.03 16 17 16.47 17 17 15.10 18 16 13.87 19 15 12.78 20 15 11.80 N = 873; A = 14.8743; B = 2.7385;K = 226.4899 X2 = 30.90; DF = 36; P = 0.71
Rank r
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
Frequency fr
14 14 13 13 12 10 9 7 4 4 4 3 3 2 2 2 1 1 1
NPr (2.5.1)
10.92 10.13 9.41 8.77 8.18 7.64 7.15 6.70 6.29 5.92 5.57 5.25 4.96 4.68 4.43 4.20 3.98 3.78 3.59
Some text laws | 67
Anyway, a mechanical procedure would count them as words in their own right whereas they can be seen as parts of the discontinuous verbal word-forms resp. lexemes. Suppletivism adds difficulties: “am”, “are”, “is”, “be”, “was”, “were”, “been” look absolutely different but should be counted as word-forms of a single lexeme. Ambiguities do not simplify the analysis: Are “be” (infinitive) and “be” (conjunctive) just one word-form? Decisions must be made as to whether “I”, “my”, “mine” are word-forms of one lexeme or not etc. Many of such questions will have to be answered on the background of the given linguistic approach, i.e. the applied grammatical model; others depend on the purpose of the study. The rank-frequency distribution of the word-forms in “Erlkönig” is presented in Table 2.14(a) and that of the lexemes in Table 2.14(b). The number at the bottom means that all the ranks 4 to 124 in Table 2.13(a) correspond to frequency 1; analogously the ranks 36 to 97 in Table 2.14(b). The ranks corresponding to frequency 1 were pooled for the Chi-Square test such that the theoretically expected frequency in each group was at least 1.0. In this way, the presented degrees of freedom were obtained. Better fitting results are practically impossible. As can be seen, the parameter values are rather different in the three studied cases (phonemes in Table 2.13, word-forms and lexemes in Table 2.14). It is still unknown how the parameters depend systematically on linguistic levels, texts sorts, language types and text length etc. With respect to another law, the Menzerath Law, a study has unveiled that the parameters of this law do not depend so much on language, author, or text type as had been expected but rather on the level of linguistic analysis, i.e. on the units under investigation (Cramer 2005a,b). Thus, the more the (negative) value of the parameter b approaches 0, the higher the level of analysis (where phonemes and graphemes are considered as low and sentences as high). Mandelbrot modified Zipf’s formula on the basis of an economic argumentation resulting in the form (2.5.1). Most authors agree with his justification of the Law. Miller (1975) however presented another interpretation, which is rarely mentioned in the literature. We will discuss it here in order to show another view of rank-frequency distributions.
0
20
40
60
80
100
120
68 | Shapeless Repetition
1
4
7
10
14
18
22
26
30
34
38
Fig. 2.10. Graph of the empirical (grey bars) and theoretical (light grey bars) frequency distribution in Table 2.13
Imagine, a monkey “types” on a typewriter (or more up-to-date: a computer keyboard); we assume that the keys are pressed randomly where the white space has probability p and all the other keys probability 1-p = q. We assume also that the monkey never types two spaces in sequence. The produced “text” would then consist of series of i letters (i = 1, 2, 3 …) separated by spaces. We can thus expect that we would obtain sequences of length i with probability pi = pqi-1, i = 1, 2, … i.e. a monotonical, geometric decrease of the frequencies with increasing lengths of the “words”. Let us further assume that the keyboard has n keys besides the space. If every combination of letters is admitted and all of them have the same probability then exactly ni words of length i can be formed.
Some text laws | 69 Table 2.14: Rank-frequency distribution (a) of word-forms, (b) of lexemes in Goethe's „Erlkönig“
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40-124
(a)
Frequency 11 9 9 7 6 6 5 5 4 4 4 4 4 4 4 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 85
NPx 11.42 9.14 7.72 6.74 6.01 5.45 4.99 4.62 4.31 4.05 3.82 3.62 3.44 3.29 3.15 3.02 2.90 2.80 2.70 2.61 2.53 2.46 2.39 2.32 2.26 2.20 2.14 2.09 2.04 2.00 1.95 1.91 1.87 1.84 1.80 1.77 1.73 1.70 1.67 90.53
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35-97
(b)
Frequency 24 14 10 10 9 9 6 6 5 5 4 4 4 4 4 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 62
NPx 23.10 15.14 11.47 9.32 7.89 6.87 6.11 5.50 5.02 4.62 4.29 4.00 3.75 3.54 3.35 3.18 3.03 2.89 2.77 2.66 2.55 2.46 2.37 2.29 2.22 2.15 2.06 2.02 1.97 1.91 1.86 1.81 1.77 1.73 67.05
70 | Shapeless Repetition (a)
(b)
N = 225 A = 1.7139 B = 0.7090 K = 0.1030 X2 = 5.91 DF = 99 P = 1.00
0
2
4
6
8
10
12
N = 225 A = 0.5033 B = 0.8276 K = 0.1439 X2 = 5.69DF = 78 P = 1.00
1
3
5
7
9
11
15
25
35
Fig. 2.11. Histogram of the empirical and theoretical distribution in Table 2.14(a)
Thus, the probability of a given word is Pi divided by the number of all possible words of length i, i.e. Pi / ni, and therefore is (2.5.2)
Pi = pq i −n − i ni
We write the left hand side of this equation as p(w, i) for the probability of words of length i. Using the relation a x = e x a
0
5
10
15
20
25
Some text laws | 71
1
6
9
12
16 18 20 22
26 28 30 32
Fig. 2.12. Histogram of the empirical and theoretical distribution in Table 2.14(b)
With n keys there are n words of length 1, n2 words of length 2, n3 words of length 3 etc. Together this sums up to (2.5.4)
� ni =
n − n k − n
words of a length equal or less to k. Now, let us arrange the words according to their lengths and assign them ranks. Words of length 1 will be assigned the ranks 1 up to n; two-letter words the ranks n +1 up to n(1- n2)/(1- n) in conformity with (2.5.4), three-letter words have the ranks n(1- n2)/(1- n) + 1 to n(1- n3)/(1- n) etc. An individual word w of length i will on the average be assigned the rank r(w, i), which is calculated as the centre value of the rank interval which is assigned to the words of length i:
72 | Shapeless Repetition
(2.5.5)
� n − ni − n − ni � r w i = � ++ � � − n − n � n + n + = ni − n − n −
Rearranging this equation and applying the relation a x = e x a yields (2.5.6)
n − � n + � i n � r w i + �=e n + � n − �
We can insert this result into (2.5.3). As the exponent is negative in that formula let us rewrite (2.5.3) and obtain (2.5.7)
p w i =
p ei n − − q n q
This equation can be inserted into (2.5.6) yielding (2.5.8)
p w i =
p � n − � n + �� � � r w i + �� q � n + � n − � �
− − q n
Substituting
A = n + [ n − ]
γ = − q n p {n + [ n − ]}
γ
K =
q
enables us to reformulate the above formula as p w =
K
[ r w + A]
γ
,
which is identical with (2.5.1). Miller’s derivation has the advantage that the parameters of the resulting formula have an immediate interpretation and that economic reasoning is not needed. Its disadvantage is that the derivation is plausible only for word-forms, not for other linguistic units. Moreover, the model entails that every text in a given language displays the same parameter values, which does not conform with reality. And furthermore, one of the assumptions which Miller’s model depends on, viz. that every combination of phonemes be admitted, is not met by natural languages, i.e. the assumption of
Some text laws | 73
uniformity is not fulfilled in any case. Finally, the probabilities of the individual phonemes or characters are not identical. On the contrary, we know that the corresponding inventories have specific probability distributions. For further reading in connection with rank-frequency distributions, the following publications can be recommended: Woronczak (1967), Kalinin (1956, 1964), Segal (1961), Mandelbrot (1961), Belonogov (1962), Simon (1955), Carroll (1968), Li’s bibliography online (http://www.nslij-genetics.org/wli/zipf).
2.5.2 The Simon-Herdan Model The previous section was devoted to word frequencies analysed by means of a rank-frequency distribution. Another way to capture frequencies is based on frequency classes (also called frequency spectrum). This is nothing but a reverse representation of the data and the theoretical probabilities. The classes are formed by words with equal frequency. A look at Table 2.13(a) shows that there are exactly 85 words which occur once in “Erlkönig” – the so-called hapax legomena; 18 words occur two times (dis legomena) etc. The first column in Table 2.15 contains the number of occurrences, which is our new random variable X; the second column shows the number of words with the given number of occurrences fx. The first to conduct substantial empirical investigation in this form was Yule (1944); he determined the spectra of word frequencies without any theoretical model yet. Simon (1955) derived a theoretical distribution on the basis of a stochastic process and called it Yule distribution. This distribution, which can be, for our purposes, written in the form (2.5.9)
px =
b x − b x − = x = b + x x b + x
with c x = c c − c − c − x +
resp. c x = c c + c + c + x −
can be fitted to many empirical distributions.
74 | Shapeless Repetition Table 2.15: Frequency distribution of word forms in Goethe’s „Erlkönig“ x
fx
NPx (Yule)
85 18 6 7 2 2 1 0 2 0 1
NPx (Waring)
85.00 20.34 7.85 3.81 2.12 1.30 0.85 0.58 0.42 0.31 1.42 b = 2.1795 X2 = 4.21 DF = 6 P = 0.65
84.98 20.57 7.90 3.80 2.10 1.27 0.82 0.56 0.40 0.29 1.21 b = 2.881 n = 1.0507 X2 = 4.39 DF = 5 P = 0.49
0
0
20
20
40
40
60
60
80
80
1 2 3 4 5 6 7 8 9 10 ≥11 N = 124
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11
Fig. 2 13. Fitting the Yule (left) and the Waring distributions (right) to the word-form frequencies in Goethe's "Erlkönig" (cf. Table 2.15)
There are however also data which cannot be approximated using this formula. The parameter b can be estimated e.g., on the basis of the empirical mean, as (2.5.10) μ =
b b −
such that x (2.5.11) b = x −
Some text laws | 75
or using the frequency of the first class as (2.5.12) b =
f N − f
In our example, we obtain from (2.5.12) b =
= −
According to (2.5.9) we have P =
⋅ = +
and NP = ⋅ =
The rest of the values is calculated using the recurrence formula (2.5.13) NPx =
x − NPx − b+ x
e.g.,
= + NP = = + NP =
etc. All the values can be found in the third column of Table 2.15. The goodnessof-fit is very good. Optimisation can improve it with the result b = 1.9807, X2 = 3.72, df = 6, P = 0.71. The models fit very well but it can be shown that this is not always the case. To circumvent this disadvantage, several new models have been developed. Haight and Jones (1974) generalised Simon’s stochastic process and obtained a new class of distributions. The Yule distribution is a special case of this class (cf. also Lánský, Radil-Weiss 1980). Sichel (1975) generalised the Poisson distribution and obtained a new class too, of which the Yule distribution is a special case. Orlov (Orlov, Boroda, Nadarejšvili 1982) brought the frequency distribution (spectrum) into connection with the Zipf-Mandelbrot Law. Although this approach seems to be very promising the model does not fit the data satisfactorily.
76 | Shapeless Repetition Gustav Herdan (1964) took another way. He applied the Waring distribution, which we can write in the form (2.5.14) Px =
b n x − ⋅ x = b + n b + n + x −
The Yule as well as the Waring distribution are given here in the so-called 1displaced form. This distribution was often used to successfully fit linguistic data (cf. Muller 1965, 1968, 1977; Tešítelová 1967). As can easily be seen, the Yule distribution is a special case of the Waring distribution with n = 1; for in this case, (2.5.14) becomes
b x − ⋅ b + b + x − b x − = b + b + b + + x − b x − = b + b + b + x
Px =
which is identical with (2.5.9). The estimation of the parameter values can be done, among other methods, in the following way:
(2.5.15)
x − f b = xf − N n =
x − N − f xf − N
The recurrence formula is (2.5.16) Px =
n+x− Px − b + n + x −
In our example, we have x = so that
= − − = n = −
b =
The calculated NPx are given in the fourth column of Table 2.15. The result of the estimation is X25 = 4.39, P = 0.49, a result, which is not as good as that with the Yule distribution. We obtain better values by optimising and appropriately pool-
Some text laws | 77
ing the classes, viz. n = 0.9536, b = 1.9179, X25 = 3.46, P = 0.64. However, as one degree of freedom was lost in this way the Yule distribution still fits better. On the basis of this approach to capturing the distribution of the class frequencies (the spectrum), the following problems can be tackled:
What characterises the texts which can be described using the Yule distributions and which kinds of texts require the Waring distribution? What is the relation between the parameters of the distribution and the texts vocabularies? Are there intervals within which the parameter values vary in dependence on text sorts? If both distributions fail and the approach should be maintained, how can the Waring distribution be generalised? According to our proposal in § 2.4 these distributions follow from the approach (2.5.17) D = −
a x+d
where a = b + 1, d = b for the Yule distribution and a = b + 1, d = b + n - 1 for the Waring distribution, so that there are several ways to generalise them. The most common one is perhaps the Generalised Hypergeometric distribution Type IV by Kemp and Kemp (1956) (cf. also Johnson, Kotz 1969: 158-160), but the above approach offers already a sufficient large number of linguistically well interpretable extensions. A part of them is shown in Zörnig, Altmann 1995), the complete family is presented in Wimmer, Altmann (1999) but especially the unified theory by Wimmer and Altmann (2005) shows the embedding of all theses distributions in a linguistic framework.
2.5.3 Hřebíček’s Reference Law One of the central semantic/pragmatic functions of textual elements is referring to extra-textual objects, properties and events (cf. e.g., Harweg 1974; Palek, Fischer 1977; Halliday, Hasan 1976). It is plausible to assume that also the organisation of reference and co-references in a text abides by laws. One of the reasons is text economy by efficient information structure. Hřebíček (1986) investigated one of the aspects of co-reference organisation. He based his work on two hypotheses:
78 | Shapeless Repetition
The richer the vocabulary of a text the less references will be found; The more sentences constitute a text the more references will be found.
Furthermore, he assumed that the number of references depends on these two factors alone. Denoting r = Number of references; s = Number of sentences; w = Vocabulary richness (types); n = Number of words in the text (tokens, text length) he formulated a differential equation expressing the change of the number of references in relation to the change of the number of sentences as being proportional to vocabulary size, viz. (2.5.18)
∂r = aw ∂s
and at the same time the change of the number of references in relation to the change of the vocabulary size as proportional to the number of sentences, viz. (2.5.19)
∂r = bs ∂w
from which the function (2.5.20)
r = csw c = ab
is obtained. Hřebíček discusses various interpretations of w, the vocabulary richness. For his studies, he chooses the following: Let v be the number of word types in the texts; then v w= n
from which (2.5.21)
v r = cs n
follows. Herdan showed (1966: 76) (2.5.22)
v = na
Inserting this into (2.5.21) yields r = csn a −
Some text laws | 79
or simply (2.5.23)
r = csnb
where b and c are empirical constants, s the number of sentences and n the number of word tokens in the text. Hřebíček tested his model on data from 10 Turkish texts, of which we show here one example for the sake of illustration (see Table 2.15). With the help of the method of minimal squares, which we apply to the logarithmic form of (2.5.23), i.e. by minimising k
� r − c − s − b n
(2.5.24)
i =
i
i
i
we obtain D = k � ni − � ni
Table 2.16: Increase of the number of references in a Turkish text (Hřebíček 1986) Number of sentences si 10 20 30 40 50 60 70 80 90 100
Number of tokens ni 58 100 149 246 299 403 491 601 676 786
Number of references ri
24 46 62 81 82 96 117 141 156 168
Computed number of references
ri
23.82 43.87 61.96 76.59 92.95 106.63 120.74 133.84 147.92 160.66
� � ni � ri − � si − � ni � ri ni − � si � ni �� c = � D
(2.5.25)
� k � ri ni − � si ni − � ni � ri − � si �� b = � D
This yields
80 | Shapeless Repetition c = c = b = −
The fourth column of Table 2.15 gives the values which were calculated on the basis of these parameters. We determine the goodness-of-fit by means of the coefficient of determination (2.5.26)
R = −
� r − r � r − r i
i
i
where ri are the observed, ri the calculated references, and r the mean number of references (all of them logarithms). Thus we obtain R = −
=
R2 varies in the interval . The larger R2, the more of the variability is explained by the independent variable, i.e. the better is the goodness-of-fit. Such a large coefficient of determination as obtained here shows that the model is appropriate. The F-test yields F2,5 = 7.9, which is with P = 0.013 a significant result. By application of this model the following problems can be investigated: (1) The statistical behaviour of specific reference types (anaphora, cataphora) and forms (pronouns, lexical identity, synonyms, hyper- and hyponyms etc.); (2) Differences in the statistical behaviour of references with respect to text sorts; a text typology could be realised on this basis; the differences can be seen in the behaviour of parameters. (3) Differences in the sequence of references with respect to individual languages (keeping other criteria constant).
2.5.4 Type-Token Models One of the most popular text-analytical methods measures the relation between the number of types (i.e. the number of different words) in a text and the number of tokens (i.e. the number of all occurrences of words). In other words: the relation between vocabulary size and text length. Depending on the exact purpose of the investigation, word-forms or lexemes are counted as types. If a kind of vocabulary richness is intended as the aim of a type-token study, word-forms would not be appropriate, of course. Therefore, a text under study should be lemmatised first. It goes without saying that there is always more than one way how the corresponding definitions and
Some text laws | 81
decisions can be varied in a language but this does not affect the method as such. When automatic procedures are used to mechanically determine the tokens the orthographic criterion for word segmentation is the easiest way, of course. A computer program can identify punctuation marks and white spaces without any problems such that large quantities of textual material can be processed in short time. However, this method produces, of course, only an approximation to word-form segmentation according to linguistic criteria. Advances in computational linguistics provide the researcher with more appropriate tools: tokenizers and lemmatizers are available for many languages. For our example, we will lemmatize the “Erlkönig” manually according to the following rules: A German lexeme consists of all forms of a noun; all forms of a verb including suppletive forms (such as “am”, “are”, “be”, “is”, “was”, “were” etc. in English); all forms of an adjective including gradation (“Steigerung”) and the adverbial/predicative form; all forms of pronouns and determiners; all forms of numerals including cardinal, ordinal, distributive etc. forms; verb prefixes are re-attached to the word-form where they occur as separate tokens. Determining type and token numbers in "Erlkönig", following these criteria, yields the data as shown in column "T" of Table 2.1.17. It can be seen that the increase of the number of types flattens already at L = 15 and that this trend becomes stronger and stronger. The central task is now to find a plausible and correct model of this trend. Many authors avoided counting the number of types at each text position but gave the counts only every 100 or more tokens. Most authors counted wordforms although they intended an estimation of the vocabulary of the text authors, a question which we will re-visit below. The approaches to develop a mathematical model of the type-token behaviour (TTR) can roughly be divided into two groups: (1) The view of text as a stochastic process. This approach is likely to have a big future but until today, it was used only to forecast the text author's vocabulary. We think that this attempt is illusory and overestimates the expressive power of a text (cf. Brainerd 1972; McNeil 1973; Gani 1975). Other authors modelled, with better results, the distributions of word frequency classes using stochastic processes (cf. Simon 1955; Haight, Jones 1974; Lánský, Radil-Weiss 1980).
82 | Shapeless Repetition (2) Derivation of a function from considerations about the flow of information in texts or finding a well-fitting function by means of trial-and-error (cf. Herdan 196; Müller 1971; Maas 1972; Nešitoj 1975; Ratkowsky, Halstead, Hantrais 1980; Tuldava 1980; Orlov, Boroda, Nadarejšvili 1982). These approaches have proved to be more appropriate; we will therefore present this kind of method. We consider a text during its creation as a system which evolves in a multi-dimensional space T = . The dimensions Pit represent the specifications of the text properties i at time t; some of them represent the text lengths. Text length can be measured in terms of the number of chapters, paragraphs, sentences, clauses, words, scenes, entries, breaks etc.; therefore there are always several text lengths at the same time, and each of them is an order parameter which enslaves (cf. Haken 1978) its sub-systems. Thus, while a text is being generated, its current length in words, i.e. the (increasing) number of word tokens, enslaves the number of word types. In a stage play, the (increasing) number of entries may influence the dynamics of some properties of the scenes. One of the tasks of text analysis is exploring the system of interacting dynamics within a text. We will employ an approach to derive a function as a model of TTR-like dependences which controls numerous linguistic interrelations (cf. Altmann, Schwibbe, Kaumanns, Köhler, Wilde 1988) and plays a basic role in Köhler's self-regulating control cycle (Köhler 1986, 2005), viz. (cf. § 2.4 where D represents the interaction of speaker and hearer) (2.5.27)
a D= x
which becomes here (2.5.28)
dT adL = T L
where T represents the number of types in the text and L the text length (number of tokens). The coefficient a determines the slope of the function of the solution to the differential equation (2.5.28), viz. (2.5.29)
T = cLa ;
and depends on the kind of the types. If words are under study, the value always varies in the interval 0 < a < 1, a being a smaller number for lemma types
Some text laws | 83
than for word-form types. The constant c depends on the way how the tokens are counted. If each word token is taken into account, then c = 1; if groups of 10, 100 etc. tokens are registered, then c becomes larger and larger. Figure 2.14 shows the graph of a typical word TTR function. The equation (2.5.28) represents the fact that the relative increase rate of types is proportional to the relative increase rate of text length. This result was already found by Herdan (1966:76). Analogous considerations are also known from other disciplines such as biology and ethology (cf. Fagen, Goldman 1977). Such inter-disciplinary analogies in the structuring of the relation system size / sub-system size can be valued as a strong support to the model.
Figure 2.14. A typical TTR function.
In every relation between a system and its sub-systems, the system exerts dominance and integrative pressure whereas sub-systems often display a tendency to increase their autonomy. As far as this tendency occurs in texts (where we can consider style, text sort, intended readership etc. as sub-systems) it can cause divergences in the statistical behaviour. Modifications of the basic formula (2.5.27) resp. (2.5.28) can capture such divergences (cf. Tuldava's 1980 formulae). Fitting (2.5.29) to the data from "Erlkönig" yields the result as shown in Table 2.16. The function with specified parameter a is T = L
and the value of the F-test with the data in logarithmic transformation is F1,222 = 6410.43, i.e. a perfect result.
84 | Shapeless Repetition Table 2.17: Fitting the power function to the data from “Erlkönig”. The symbol L stands for token No., T for number of types observed, and T* for the calculated number of types. L
T
T*
L
T
T*
L
T
T*
L
T
T*
L
T
T*
1
1
1.00
46 35
27.66
91
56
49.97
136 74
70.79
181
89
90.71
2
2
1.82
47 35
28.18
92
56
50.44
137 74
71.24
182 90
91.14
3
3
2.59
48 35
28.69
93
56
50.92
138 74
71.70
183 90
91.57
4
4
3.33
49 35
29.21
94
56
51.39
139 74
72.15
184 90
92.01
5
5
4.04
50 36
29.73
95
56
51.87
140 74
72.60
185
90
92.44
6
6
4.73
51 36
30.24
96
56
52.34
141 74
73.04
186 90
92.87
7
7
5.40
52 37
30.76
97
57
52.81
142 74
73.49
187
93.31
8
8
6.07
53 37
31.27
98
58
53.28
143 74
73.94
188 92
93.74
9
9
6.72
54 37
31.78
99
58
53.75
144 74
74.39
189 92
94.17
10
10
7.36
55 37
32.29
100 59
54.23
145 75
74.84
190 92
94.60
91
11
11
8.00
56 37
32.80
101 60 54.69
146 75
75.29
191
92
95.04
12
12
8.62
57 38
33.31
102 60 55.16
147 75
75.73
192
92
95.47
13
13
9.24
58 39
33.81
103 60 55.63
148 76
76.18
193
92
95.90
14
14
9.86
59 39
34.32
104 60 56.10
149 77
76.62
194 93
96.33
15
14
10.47
60 40
34.82
105 61
56.57
150 77
77.07
195
94
96.76
16
15
11.07
61 40
35.32
106 62
57.04
151 78
77.52
196 94
97.19
17
15
11.67
62 41
35.82
107 63
57.50
152 78
77.96
197
94
97.62
18
16
12.26
63 42
36.33
108 63
57.97
153 78
78.41
198 95
98.05
19
17
12.85
64 42
36.82
109 63
58.43
154 78
78.85
199 95
98.48
20 18
13.43
65 43
37.32
110 63
58.90
155 78
79.29
200 95
98.81
21
18
14.01
66 44
37.82
111 64
59.36
156 78
79.74
201 95
99.33
22
19
14.59
67 45
38.32
112 65
59.82
157 78
80.18
202 96
99.76
23
19
15.16
68 46
38.81
113 65
60.29
158 79
80.62
203 96
100.19
24 20 15.73
69 47
39.31
114 65
60.75
159 79
81.06
204 96
100.62
25
70 47
39.80
115 65
61.21
160 80 81.51
205 96
101.05
26 21
20 16.30 16.86
71 47
40.29
116 65
61.67
161 80 81.95
206 96
101.47
27
21
17.42
72 47
40.78
117 65
62.13
162 81
207 96
101.90
28 22
17.98
73 48
41.28
118 65
62.59
163 82
82.83
208 97
102.33
29
22
18.54
74 49
41.76
119 66
63.05
164 82
83.27
209 97
102.75
30 23
19.09
75 50
42.25
120 67
63.51
165 83
83.71
210 98
103.18
63.97
166 84 84.15
211
98
103.61
167 85
212
99
104.03 104.46
31
24
19.66
76 50
42.74
121 67
32
25
20.19
77 51
43.23
122 68 64.43
82.39
84.59
33
26 20.73
78 51
43.72
123 68 64.89
168 85
85.03
213
99
34
27
21.28
79 52
44.20
124 68 65.34
169 85
85.47
214
100 104.88
35
28 21.82
80 52
44.69
125 68 65.80
170 86 85.91
215
100 105.51
Some text laws | 85 L
T
T*
L
T
T*
L
T
T*
L
T
T*
L
T
T*
36 29
22.36
81 53
45.17
126 69
66.26
171 86 86.34
216 101
105.73
37
22.90
82 53
45.65
127 69
66.71
172 86 86.78
217
101
106.16
38 30 23.43
83 53
46.14
128 70
67.17
173 87
87.22
218 101
106.58
39
107.00
29 31
23.97
84 54
46.62
129 71
67.62
174 87
87.66
219
40 32
24.50
85 55
47.10
130 71
68.08
175 87
88.09
220 101
107.43
41
33
25.03
86 55
47.58
131 72
68.53
176 87
88.53
221
101
107.85
42
33
25.56
87 55
48.06
132 72
68.98
177 87
88.96
222 101
108.27
43
33
26.08
88 55
48.54
133 73
69.44
178 88 89.40
44 33
26.61
89 55
49.01
134 73
69.89
179 88 89.84
45
21.13
90 55
49.49
135 74
70.34
180 89
34
101
223 102 108.70
90.27
The study of the type-token relation is linguistically promising because it directs the research to a law of text production. It does not cause any problems in the process of systematising, in particular in its form above, and it shows a synergetic aspect of texts. It can also be used as a means to discriminate texts with respect to various criteria. In literary studies, TTR investigations are often conducted with the intention to estimate the vocabulary of an author on the basis of a text. We do not encourage such enterprises because it can be shown that each text written by one and the same author yields very different estimates, which turn out, furthermore, as absolutely unrealistic. There is an ad-hoc hypothesis that the given author places only a part of his/her vocabulary at his/her disposal for a given text; this idea is also very problematic. We assume that every author of literary texts knows more or less the same number of words of his/her mother language even if there may exist differences in specialisation. Other research objects may be much more promising, e.g. estimating the development of children's vocabularies. Nevertheless, the parameter a can be used to characterise individual texts. The value of this parameter is the larger the less words are repeated, i.e. the larger the increase of the number of new words. The value varies within the open interval (0,1); the value 0 cannot be attained as this would correspond to an uninterrupted repetition of a single word, with c = 1. And the value 1 cannot be exceeded as the number of types cannot exceed the number of tokens. For systems-theoretical reasons, the parameter must be smaller than 1. Any text is meant as a vehicle which conveys information, which has to be decoded and extracted by the hearer/reader, whereby the contents should be stored in the hearer's memory at least until the text has been processed. Therefore, the text author is compelled to organise the flow of information in a way which does not overstrain the hearer's capacity to process and store information. The cognitive
86 | Shapeless Repetition system is confronted with a similar situation when new facts or rules or a language have to be learnt. If the learner is overwhelmed with more and more new words there will be little learning success. Repetition is needed so that associations between new information and already known elements can be formed and consolidated. Thus, a text consists of a balanced mixture of new and old items. There is an obvious analogue in other kinds of living systems. Living systems need input for the preservation of their existence (maintenance input) and information input necessary for survival (danger, reproduction) representing signal input. The hearer needs maintenance input (= repetition) in order to conserve the contents of the text, and signal input for obtaining new information, hence keep the communication in process. There must be, however, an invariant relationship between the two kinds of input because, according to Berrien (1968: 80): “…an optimum balance must be struck between maintenance and signal input, for without an adequate supply of the former, the latter cannot be proceeded.” The above law expresses exactly this equilibrium. Although word TTR studies belong to the kind of investigations which are very common and which are frequently performed, there is only little systematic knowledge of the dependences of the parameter a on textual properties and extra-textual factors. We would need empirical studies on data from different text sorts, authors (categorised according to gender, experience, age etc.), audiences, languages, text lengths, etc. Though research is performed currently (cf. e.g. Wimmer, Altmann 1999; Köhler, Galle 1993; Tuldava 1998, Covington, McFall 2010; Kubát, Milička 2013 etc.) the number of boundary conditions is that large that no simple model will ever capture the regularity in all possible texts and languages. Some progress has been made with respect to neutralizing the dependence of TTR measures on text length. After a long history of unsuccessful attempts at balancing the influence of text length by theoretically or empirically obtained means by a large number of authors (cf. lc), Covington and McFall (2010) present a method (“MATTR”), which determines the type-token ratio of text passages called windows such that the first window begins with the first (w = 1) token and ends with token w + d, where w was arbitrarily set to 500. The next window is w +1 to w + d + 1 etc., thus forming a series of moving windows until the end of the text is reached. In this way, TTR measures are obtained which are absolutely independent of text length and enable fair comparisons between texts of different sizes. But still, the theoretical distribution of TTR measures is unknown such that the significance of differences cannot be determined in a satisfying way. The relation between types and tokens was mainly investigated with respect to words. In principle, however, any linguistic unit can be studied from this
Some text laws | 87
point of view, at least any unit which can and is repeated within a text. Units on a lower level of linguistic analysis such as sounds, phonemes, syllables, and morph(eme)s as well as units on a higher level than words such as syntactic constructions, grammatical functions, discourse markers, reference types etc. occur more than once in a text and can therefore be investigated in a similar way as shown above for words. We will see that there are some differences between the individual kinds of units which have consequences for their statistical behaviour and their TTR dynamics. The main difference between linguistic units seems to be caused by the size of their inventory. The lexical inventory of a language is extremely large and cannot be exhausted even in very long texts. Already Orlov (1982) came to the conclusion that more and more hapax legomena are encountered when a corpus is steadily enlarged. As opposed to that, small inventory units such as phonemes, graphemes and musical notes display a TTR course which rapidly increases and soon reaches its end-point. In these cases, the empirical values cannot be captured using the common word TTR function (2.5.29). Instead, function (2.5.30) comes into play, which was presented in (Köhler 2003a,b) and is one of the rare cases where the parameter of the function is determined by the theoretical model instead of being estimated from data. (2.5.30)
T=
L . aL − a +
As an example, we show the TTR curve of the characters in the first few lines of "Erlkönig". When we do not distinguish upper and lower characters, German has an inventory of 30 letters plus the space. In the 251 tokens of the string 'wer reitet so spät durch nacht und wind es ist der vater mit seinem kind er hat den Knaben wohl in dem arm er fasst ihn sicher er hält ihn warm mein sohn was birgst du so bang dein gesicht siehst vater du den erlkönig nicht den erlenkönig mit kron und' 25 character types occur. According to the theoretical model, the parameter a in (2.5.30) should have the value 1/25 = 0.04 when we start from the empirical inventory whereas the inventory of the system is 31, which yields a = 1/31 = 0.0322. Estimation a from the data gives the value 0.0367. The graph of the function with this parameter value is shown in Figure 2.15.
88 | Shapeless Repetition
Fig. 2.15. The TTR of the characters in Goethe's "Erlkönig".
Other linguistic units with an intermediate inventory size follow still another kind of TTR dynamics. We illustrate this on the type-token relation of syntactic construction types. It was shown that these units cannot be modelled by means of one of the previously presented functions. In (Köhler 2003a,b), function (2.5.30) was successfully applied to the TTR of syntactic constructions in German and English texts, which were analysed by means of phrase structure grammars. Fig. 2.16 shows the result of one of the calculations.
Fig. 2.16. The TTR of syntactic construction types in an English text (N06 in the Susanne corpus; cf. Sampson 1995)
Some text laws | 89
2.5.5 Perspectives Apparently, and in agreement with Bunge’s dictum „Everything abides by laws“ (Bunge 1983), not a single phenomenon in language or text can be found that falsifies hypotheses about the lawful behaviour of linguistic units and their properties. Laws are not, of course, deterministic rules; the author of a text is not chained to a Procrustean bed where he/she would not have any freedom in forming the text. The system of laws which play a role in a system together with the relevant boundary conditions operate in form of a plastic control: when an author makes increased use of a certain element, other elements will consequently occur less frequently. When some characteristic of a text becomes prominent, others will necessarily recede. This is exactly what can and must be expected from a systems-theoretical point of view: The properties of the collateral sub-systems of a system co-operate and compete among each other while the system as a whole maintains more or less its balance. The theoretical perspectives of quantitative text science are to exactly determine the processes of text generation by investigating the functions and distributions of observable properties. On this basis, it will be possible to unveil the latent self-regulating and self-organising mechanisms. One part of this task is, of course, to study as many texts as possible within a broad spectrum of extra-linguistic factors. The results of these empirical investigations are indispensable for testing theoretical hypothesis and also play a role of heuristic tools for enlarging our factual knowledge about the multitude of textual phenomena and their interrelations.
3 Positional repetition Any entities can prefer repetitions at certain positions in a text, e.g. at the beginning, in the mid or at the end of a frame (sentence, verse etc.). Positions in front of or behind other text entities of the same class can be preferred as well. The best know phenomenon of the first class is the occurrence of a word at the end of the verse and placing parts-of-speech in preferred positions in a sentence (cf. Průcha 1967). In linguistics this preference was called functional relation. An example of the second kind is placing nouns in front of the verb - in some languages. This is called distributional relation. Here we shall not examine poetic figures because poetics is interested in their uniqueness or their stereotypy. We shall scrutinize rather phenomena displaying a tendency which can be found by means of the methods used in this book.
3.1 The rhyme in “Erlkönig” Let us consider the phonetic transcription of “Erlkönig” by Goethe as presented by Grotjahn (1979): Apparently, the sound [t] occurs frequently in verse-final position. We are interested in the question whether this is just a reflex of German word-formation and inflection or a special effect in the poem. We will test, therefore, the Null hypothesis that there is no effect but just a regular behaviour of the sound against the alternative hypothesis that [t] occurs more frequently in final position than expected. Counting the frequencies of all sounds in final position in "Erlkönig" yields the following numbers: Sound [f] [m] [n] [o] [r] [t]
fi 2 2 6 2 2 18
A specific effect in verse-final position can be excluded if we can show that [t] occurs in other places at the end of words as frequently as here. Consequently, we determine the frequency of [t] also in the rest of the poem. In “Erlkönig”
92 | Positional repetition there are 225 words, out of which 32 are rhyme words. In the rest, i.e. in 193 words, there are the following final sounds: Sound [i:] [e] [e:] [ə] [o:] [u:] [f] [m]
fi 1 1 2 14 4 7 1 6
Sound [s] [r] [l] [n] [t] [ç] [x] [ŋ]
fi 12 29 2 50 43 19 1 1
The proportion of words ending with [t] is pt = 43/193 = 0.2228. Hence, in 32 verses, we expect Npt = 32(0.2228) = 7.1296 verses ending with a [t] while we observed 18. Is the difference between Npt and ft significant, i.e. can we conclude the existence of a tendency? The problem can be solved by means of a binomial test. We reformulate our hypothesis and ask how is the probability that out of 32 final sounds 18 or more are [t], if the occurrence probability of [t] is pt. The solution can be formulated as follows: The probability that out of N final sounds exactly x are [t] and the rest, N - x, are not [t] is
N Px = P X = x = p x q N − x x n
where q = 1-p and x is the binomial coefficient. The sought probability is then (3.2)
P X ≥ xc =
N
N
x p q
x = xc
x
N −x
xc − N = − p xq N − x . x = x
Now, if P(X ≥ xc) ≤ 0.05, then we consider the tendency of ending the final words with [t] as real. In our case, we have to compute
P X ≥ =
x − x − x =
x
The rhyme in “Erlkönig” | 93
which is identical with P X ≥ = − x − − x . x = x
It is common to apply the second formula. First we compute the first term of the sum,
P = P X = = − = = The other terms of the sum are computed using the recurrence equation (3.3)
Px =
N − x + p Px − x q
P =
N −+ p P q
hence
in our case P =
P =
=
=
etc. In this way we obtain
P ( ≥ ) = − ( + + + …+ ) = − = (The computations have been performed to ten decimal places and rounded.) The probability is much smaller than the critical boundary α = 0.05, hence the conclusion that there is a “t-tendency” in “Erlkönig” has a high probability. This way of computation is exact but in case of large N often tedious without a computer program. For large N the following approximation can be applied: (a) If p ≈ 0.5, the normal approximation is appropriate: (3.4)
x − NP = z NPq
94 | Positional repetition For instance, with N = 100, xc = 70 and p = 0.49 we obtain according to (3.2) P(X ≥ 70) = 0.00001679 while (3.4) yields − = .
The respective probability that can be found in tables is P = 0.000013307, an acceptable approximation. (b) If p is very small, an approximation by means of the Poisson distribution is recommended. Instead of (3.2) one computes (3.5)
xc −
e − Np Np x x x=
P X ≥ xc = −
using the recurrence function (3.6)
P = e − Np Px =
Np Px − x
The exact test is, of course, the best way when a computer is available.
3.2 Open rhymes Rhyme words have different functions and their form is a promising research object. They have special phonic, metrical, grammatical and semantic properties whose appearance at the end of the verse awards the poetic text a special air. These properties manifesting themselves by repetition tendencies can be constant with a poet, a historical epoch, a “school”, but in the course of time they change their initial prominence and may wander to a another extreme. In that case they become irrelevant and discontinue their repetition tendency. In the process of self-organisation, which is present both in language and its products, this is a quite usual wandering from one attractor to another, which in poetry is called “regularity”. An example of such a “regularity” is the use of open rhyme words, i.e. rhyme words ending with a vowel (for Slovak poetry cf. Štukovský, Altmann
Open rhymes | 95
1964, 1965, 1966 from which we take the data). The analytic method is the same as in section 3.1. From the work of S. Chalupka (Spevy Sama Chalupku 1921) 206 rhymes have been randomly sampled, out of which 162 end with a vowel, and 44 with a consonant. The predominance of open rhymes is evident but the comparison must be made with a sample of “non-rhyme” words, in order to state whether the Slovak language displays the same tendency in general (i.e. to have a great proportion of words ending with a vowel). An appropriate sample is the rest of the verses (rest words). In this way we obtain for Chalupka:
Rhyme words Rest words
Ending with a vowel 162 638
Ending with a consonant 44 410
Sum 206 1048
The proportion of words ending with a vowel is 0.7864, that of the rest words 0.6088. The significance of the difference must be stated by means of a statistical test. Let us symbolize nvr = number of rhyme words ending with a vowel nvg = number of rest words ending with a vowel nr = number of rhyme words ng = number of rest words N = number of all words in the poem (N = nr + ng). Further, let
pv =
nvr + nvg nr + ng
pvr = nvr nr pvg = nvg ng We test the difference pvr - pvg by means of the criterion (3.7)
t=
pvr − pvg pv − pv + nr ng
.
96 | Positional repetition Due to the great number of degrees of freedom the criterion t is normally distributed if the Null hypothesis is true. We insert the numbers in (3.7) and obtain
pv =
+ = +
t=
− = +
and
Hypotheses of this kind are always considered as one-sided because we test such a hypothesis only when we observe pvr > pvg. The computed or a still more extreme value of t or z can be obtained with probability P ≈ 6(10-7), signalizing a clear tendency towards open rhymes. Similar counts have been conducted in texts by 12 Slovak writers (cf. Štukovský, Altmann 1964), viz. S. Chalupka, Spev Sama Chalupku J. Kráľ, Básne A. Sládkovič, Spevy b´snické II J. Botto, Spevy Jána Bottu P.O. Hviezdoslav, Krvavé sonety Š. Krčméry, Keď sa sloboda rodila I. Krasko, Dielo K. Kostra, Ľúbostné verše V. Turčány, Jarky v kraji Š. Žáry, Aká to vôňa A. Plávka, Sláva života J. Stacho, Svadobná cesta. The data and the results are presented in Table 3.1. Except for Kostra, Plávka and Stacho, all authors display a tendency to end the rhyme word with a vowel. However, it has been shown that this tendency changed in within the time period from 1840 to 1960: it decreased linearly, and after 1960, when the proportions were approximately equal, the trend to write rhymeless poetry developed (cf. Štukovský, Altmann 1965, 1966).
The gradual climax | 97 Table 3.1: Frequencies of open and closed rhymes and “rest-words” with Slovak writers (according to Štukovský, Altmann 1964)
Author
1 2 3 4 5 6 7 8 9 10 11 12
Number of rhyme words Vowel Cons. 44 162 28 128 80 348 32 175 159 287 76 179 32 187 62 132 38 152 79 187 78 135 75 91
Number of rest words Vowel Cons. 414 638 312 570 404 564 379 603 1078 1202 325 544 469 680 321 656 292 515 410 566 409 566 375 573
t 4.85 4.27 8.35 6.35 4.51 2.23 7.38 0.24 4.27 3.64 1.43 -1.36
P 6(10-7) 9.8(10)-6 3.5(10-17) 10-10 3 2(10-4) 0.0129 8(10-14) 0.4052 9.8(10-6) 0.0001 0.0764 (0.0869)
3.3 The gradual climax Another kind of positional repetition is the so-called climax (cf. Groot 1946). This means a statistically demonstrable increase of a quantitatively expressed property of a unit in the framework of a greater unit. If the property is e.g. word length and the higher unit is verse, then the climax is given if word length increases from one position to the next one. In the next section, we shall scrutinize the linear, the reduced and the exponential climax.
3.3.1 The linear climax This kind of climax will be exemplified by pantuns, the Malay folk quatrines. An example of pantun is as follows Anak beruk dikayu rěndang, turun mandi didalam paya. Hodoh buruk dimata orang. Cantik manis dimata sahaya.
98 | Positional repetition
No tendency to form gradual climax can be observed in these four verses regardless of how you determine length, e.g. in terms of the number of syllables. However, in a random sample of 250 verses (cf. Štukovský, Altmann 1965) from a collection of pantuns (Wilkinson, Winstedt 1914), the results as presented in Table 3.2 are obtianed. Table 3.2: Frequencies (nij) of word lengths in Malay pantuns
Word length in syllables yj 1 2 3 4 Mean length yi
Position in verse xi 1 6 181 62 1 2.232
2 163 86 1 2.352
3 148 97 5 2.428
4 131 118 1 2.480
As can be seen, the mean length increases with position; the interesting question is whether this increase can be considered as a random effect or not. A corresponding test can be conducted by computing the linear regression of word length as a function of position, i.e. we test whether the observed trend follows the linear function y = a + bx. The classical way to fit a function of this kind to data and estimating the parameters from the data is done by the method of least squares. Linear regression, however, is available as a basic function in every statistics program; we will therefore refrain from demonstrating the individual steps here. The independent variable x corresponds to the position values (x = 1,2,3,4), the dependent variable y stands for the mean length values as given in the last line of the Table. Computation without weighting yields y = 2.168 + 0.082x. The result of fitting the function to the data is very good, viz. R2 = 0.97. Since b is positive, we have an increasing trend of word length from the beginning to the end of the verse. The significance of parameter b itself, i.e. whether the increase
The gradual climax | 99
is significantly larger than 0, can easily be tested. Statistical software packages provide the relevant information automatically.
3.3.2 Reduced climax Length and other properties manifest themselves, of course, not only within the frame of words but also within other units. When larger units, such as halfverses or clauses, are concerned the number of these units may be too small to perform a regression analysis. Nevertheless, differences also in these variables can be evaluated. Furthermore, most relevant linguistic units do not display equal lengths; as opposed to poems, the frame units of prose texts vary considerably in length. We will show the procedure again on an example from Malayan pantun verses, which do not always consist of four words. Our previously presented method would imply that we had to calculate a separate regression for each verse length. Let us consider again a random sample of 25 pantun verses as they were stated by Altmann and Štukovský (1965) (cf. Table 3.3, first and second column). When we apply the first procedure, the deviation of the mean difference from zero is pairwise compared in dependent samples (cf. Sachs 1972: 242). We compute d i = xi − xi
N di N i =
(3.15)
d =
(3.16)
d i sd = di − d = N − i di − i N N − i
then we can perform the t-test according to the formula (3.17)
t=
d sd N
100 | Positional repetition Table 3.3: Length of half-verses in Malay pantuns
No. of syllables First half-verse 4 4 4 4 4 5 4 4 5 4 4 4 4 4 4 4 6 4 6 3 4 5 4 4 4
Second half-verse 5 5 5 4 4 4 5 5 5 5 5 5 5 5 4 5 5 4 4 5 5 5 5 5 5 Sum
Difference di di2 1 1 1 1 1 1 0 0 0 0 1 -1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 -1 0 0 4 -2 4 2 1 1 0 0 1 1 1 1 1 1 13 25
Difference type A A A B B D A A B A A A A A B A D B D A A B A A A
where t is the Student variable with N-1 degrees of freedom. For the sake of simplicity we write d i = xi − xi or d i = xi − xi The values needed for computation are in the third and the fourth column of Table 3.3. We obtain
d =
=
according to (3.15)
The gradual climax | 101
sd =
t=
− = according to (3.16)
= according to (3.17).
In a two-sided test with 24 degrees of freedom the result corresponds to P = 0.006, a value which can lead us to the decision that the second half-verse is in fact longer than the first one. Another method is McNemar’s test for the significance of changes (cf. Siegel 1956: 63-67). Here we consider only the direction of the difference, not its size, and choose the following symbols: A if xi2 > xi1 (i.e. all positive numbers in the third column of Table 3.3) D if xi2 < xi1 (i.e. all negative numbers in the third column of Table 3.3) B if xi2 = xi1 (i.e. all zeroes in the third column of Table 3.3) The symbols are shown in the fifth column of Table 3.3. The test for the significance of change of length in the second half verse as compared with the first one can be performed by means of (3.18)
X =
A − D − . A+ D
X2 is distributed like a chi-square with 1 degree of freedom. For our data we obtain: A = 16, D = 3, B = 6, hence
X =
− − = +
This result corresponds to P = 0.0059, which is almost identical with the first test. In formula (3.18), the number -1 serves as a correction for continuity. If this correction is omitted, both tests yield identical results.
3.3.3 The exponential climax Word-length climax is observed not only in folk poetry; it can be found even in artistic poetry, sometimes in a stronger form. The analysis of some Slovak poems shows that there is a non-linear function – if a sufficient number of poems is analysed.
102 | Positional repetition Let us consider the word-lengths in the poem “Samota” by A. Sládkovič (1820-1872), in which each verse has maximally 8 words. We count in reverse order, i.e. in a verse such as “šľachetnosť pevne poobjíma dušu”, the last word has position 8, the last but one position 7, etc. In this way, we obtain the mean positional lengths of words as shown in the fourth row of Table 3.4. Table 3.4: Mean word length in Sládkovič’s poem “Samota” (in syllables)
Position x No. of words in position x Total length Mean length of words y Computed length y
1 4
2 25
3 59
4 89
5 94
6 94
7 94
8 94
5 1.25
35 1.40
88 1.49
141 1.58
167 1.78
155 1.65
195 2.08
244 2.60
1.19
1.32
1.45
1.60
1.77
1.95
2.15
2.37
The best fitting can be attained by the exponential function
y = aebx = e x where x is the position and y the mean word length (cf. Figure 3.1). It is not necessary to take into account the individual word lengths, the means are sufficient. The coefficients a and b can be obtained by means of linear regression. Taking logarithms
y = a + bx Y = A + Bx and computing
(3.19)
B =
n X iYi − X i Yi i
i
i
n X i − X i i
and
a = Y − BX
i
The gradual climax | 103
Fig. 3.1. Plot of the exponential model of the word lengths in Table 3.4
where Y is always ln y. The test, yielded automatically by the fitting software, can be performed easily. Let (3.19)
SSR = Y − Yi i
be the sum of the squared deviations of the computed values from the mean, i.e. the “explained” variance. Let (3.20)
SSE = Yi − Yi
be the sum of squared deviations of the observed values from the computed ones, i.e. the “unexplained” variance, then (3.21)
F n − =
SSR SSE n −
is the F-variable with 1 and n-2 degrees of freedom, where n is the number of observations. The computations will be demonstrated using the above example and presented in Table 3.5. Formula (3.21) yields
F =
= .
Such a great F-value (P = 0.0003) signalizes a real exponential trend. It is to be remarked that the above coefficients a and b were not computed with the above method but iteratively - this yields an improvement of the fit.
104 | Positional repetition Table 3.5: Test of the exponential regression
x 1 2 3 4 5 6 7 8
yi 1.25 1.40 1.49 1.58 1.78 1.65 2.07 2.60
yi 1.19 1.32 1.45 1.60 1.77 1.95 2.15 2.37
yi = Yi 0.173953 0.277632 0.371564 0.470004 0.570950 0.667829 0.765468 0.862890
ln yi = Yi 0.223146 0.336472 0.398776 0.457425 0.576613 0.500775 0.727549 0.955511 Y = 0.522033
Y − Yi 0.121160 0.059732 0.022641 0.002707 0.002396 0.021256 0.059261 0.116183 0.405336
Yi − Yi 0.002420 0.003462 0.000741 0.000158 0.000032 0.027907 0.001438 0.008579 0.042316
This trend (climax) is possibly a special kind of secondary rhythm whose carriers are whole words. The straight line or the exponential function are merely the first approximations, because one may suppose that the secondary rhythm is very complex. May be, the right-to-left oriented counting conceals it. Let us consider the analysis of the Slovak poem “Morho” by S. Chalupka where we separate verses with different number of words. For the verses of length 6, 7, 8 (words) we obtain different mean word lengths, as presented in Table 3.6. Other verse lengths are not frequent enough to be representative. Table 3.6: Mean word lengths in individual verse positions in the poem “Morho” by S. Chalupka
Verse length 6 7 8
Position 1 1.75 1.58 1.29
2 1.97 1.64 1.37
3 2.18 1.77 1.37
4 1.83 1.84 1.92
5 2.59 1.81 1.61
6 7 8 2.66 1.92 2.42 1.63 1.13 2.35
Number of verses 29 64 51
The data can be captured by a linear or by an exponential function. Fig. 3.2 shows the results of fitting the function y = axbe-cx to the data for verse lengths 6, 7, and 8. The parameter estimations yielded verse length 6 7
a 1.601 1.4462
b 0.1066 0.1725
c 0.1166 0.1127
R2 0.7261 0.8328
The gradual climax | 105
Fig. 3.2. Fitting a function to the data
The data for verse length 8 cannot be captured in a satisfying way by a function such as the shown one (cf. Fig. 3.3). Instead we present an alternative interpretation of the data. We assume that the underlying trend is an oscillation with growing amplitude, which we try to capture by the function x = A exp(-alpha Position) cos(w (Time-Phase))+beta.
Fig. 3.3. Data for verse length 8
The different scales of the x axes make it difficult to compare the settings. The value for x=2 was omitted (the exact repetition of the value caused a mathematical problem).
106 | Positional repetition
3.4 Other positional repetitions The positioning of parts-of-speech in the i-th position in sentence was initiated by Průcha (1967). His data cannot be presented here because he published only proportions. Instead, we shall illustrate the problem on the placing of nouns in “Erlkönig”. Some units or categories can be distributed more or less freely within a frame such as the sentence or the verse. In poetic texts, we expect that the placement of categories or units is less constrained by grammar than in other kinds of text (cf. e.g. word order) because in poetic texts other kinds of constraints dominate (rhythm, metrical foot, verse length). The verses of “Erlkönig” contain 5 to 9 words (= positions). The placing of nouns at individual positions is presented in Table 3.7. The search for a trend meets here two difficulties. (a) The frequencies are too small to draw conclusions on their basis; (b) the verses have different lengths, hence position 5 of a verse with five words is not the same as the position 5 in a verse with 9 words. In order to overcome these two difficulties we proceed as follows: we form relative intervals separately for each verse length. The upper boundary will be defined as
Position in verse . Number of positions in verse
Table 3.7: Frequency of nouns in individual positions in “Erlkönig"
Verse length 5 6 7 8 9
Position 1 1
2 2
3 -
4 -
5 1
6 -
7 -
8 -
9 -
1 -
7 2 3
1 6 -
1 2 3 2
2 3 -
6 1 -
2 -
3 -
2
For verses of length 5, we obtain the following upper bounds
or
1/5, 0.2,
2/5, 0.4,
3/5, 0.6,
4/5, 0.8,
5/5 1.0.
Other positional repetitions | 107
For a verse with six words, we obtain
or
1/6, 0.1667,
2/6, 0.3333,
3/6, 0.5000,
4/6, 0.6667,
5/6 0.8333.
6/6 1.0000
etc. The intervals of the shortest verse are taken as norm and we obtain , (0.2, 0.4>, (0.4, 0.6>, (0.6, 0.8>, (0.8, 1.0>. The frequencies in all the verses regardless of their lengths will be ascribed to these intervals. For example, the frequencies of nouns placed on the fifth or sixth position in a verse with six words fall in the interval (0.8, 1.0>. Thus, we obtain a new distribution of nouns in 5 positions as Relative position Frequency
1 2
2 14
3 14
4 5
5 16
N 51
As can be seen, the nouns are not uniformly distributed. We test the uniformity using the information statistics (3.22)
I = ni i =
ni Ei
where ni are the individual observed frequencies and Ei is the expected frequency. Under the hypothesis of uniformity, we expect for each position
Ei =
N = = n
Formula (3.22) can be written explicitly as
I = ni ni − ni Ei N n = ni ni − N N + N n = ni ni − ni
Computation yields 2I = 2(2 ln 2 + 14 ln 14 + 14 ln 14 + 5 ln 5 + 16 ln 16) - 2(51)ln 51 + 2(51)ln 5 = 255.37702 - 401.04621 + 164.162667 = 18.49.
108 | Positional repetition
Here, 2I is distributed like a chi-square with n-1 = 4 degrees of freedom and the computed value 18.49 corresponds to the probability P < 0.001, hence we can confidently accept the hypothesis of non-homogeneity. That means, there are possibly positions at which the writer avoids nouns or prefers nouns. Though we know that the occurrence of a noun in position 1 is not independent of its occurrence in positions i-1, i-2,…, we cannot take into account the real dependencies because our data (1) fall in relative intervals, (2) are the results of pooling several verses, hence no or extremely weak dependencies exist. Therefore, we consider the positions mutually independent. The question whether a certain position displays a positive or a negative tendency, can be tested using the known binomial test (see Chapter 2) as follows: The probability that a noun falls in one of the five intervals is p = 1/5 = 0.2. We set the critical boundary at α = 0.05 and consider all position in which P(X ≥ x) ≤ 0.05 as bearer of a “nominal tendency”, and, on the contrary, those where P(X ≤ x) ≤ 0.05 bearer of “anti-nominal tendency”, i.e. we set up a 90% confidence interval. In practice, we seek a value x0 for which
N
N
x p q
x = x
x
N −x
≤
and a value xu for which xu
N
x p q x =
x
N −x
≤
Here we make a concession of comfort: we consider not only p equal for all positions but also N, though it does not hold for the first position: here N cannot be 51 but only 32 (number of verses) while it is irrelevant for other positions where several of them were pooled. For the first position, we need no interval, merely the lower bound.
Other positional repetitions | 109
Computation yields:
P = q = − = = 0.00079
Since
P =
P =
0.00713
P =
P =
0.03169
P ≤ , we consider the first position as “anti-nominal”, which is in x =
x
agreement with the structure of German where there is mostly an article in front of the noun. Frequency 3 would not be significant anymore, because
P = . x =
x
For the other positions, we can set N = 51. For p = 0.2 the lower bound is xu = 5 with
P = , x =
x
and the upper bound xo = 16 with
P = ,
x =
x
i.e. positions at which there are 5 or fewer nouns are “anti-nominal”, those with 16 or more nouns are “nominal”. As can be seen, in “Erlkönig” there are three positions displaying such a part-of-speech tendency, namely the first, the fifth and the sixth.
4 Associative repetition The concept of association we are interested in differs from that known from psychology (cf. Cramer 1968). While 'psychological' association is observed on elicited linguistic reactions we will study a relation between two linguistic units or properties which have a common occurrence in a text or part of a text above the corresponding average value. Common occurrence or coincidence means the appearance of two units (e.g. words) in the same textual frames which can have different sizes. The smallest frame is the sentence (clause) or verse; larger ones are paragraph, chapter, strophe, text, text-sort, etc. The average common frequency is the theoretical mean value of all common frequencies of all pairs of units of the same type as the ones under study in all texts. It goes without saying that such a value cannot be obtained by means of empirical studies. It is rather the result of a theoretical calculation. Such a value is called 'mathematical expectation'. During the second half of the last century, associative repetitions were studied under various aspects and by means of various methods of which we can mention here only a few. Osgood (1959) was the first to introduce and illustrate a number of research problems and techniques of analysis. His approach was refined by other researchers; the idea was extended from the study of associated pairs to the association structure of conceptual systems (in psychology) and semantic fields (in linguistics). Some authors analyze only individual texts (cf. Berry-Rogghe 1973; Geffroy, Lafon, Seidel, Torunier 1973), other ones study sets of texts (Rieger 1971, 1972, 1974; Dannhauer, Wickmann 1972; Tuzzi, Popescu, Altmann 2010). The frame of reference in which coincidence occurs varies from minimal frame (= word pair) to complete texts; besides, also left and right coincidence is distinguished (cf. e.g. Dolphin 1977). To mention an example, Rieger (1971, 1974) scrutinised poems written in the years 1820-1840 and found a semantic environment of the word “Blüte” characterised by a specific distance measure, as shown in Table 4.1. Similar environments were found also by Geffroy, Lafon, Seidel and Tournier (1973) and visualised (cf. Figure 4.1). Dolphin (1977) developed an association indicator and set up a “lexicogram” for “yeux” in a French text (cf. Figure 4.2)
112 | Associative repetition Table 4.1 Environment of “Blüte” according to Rieger (1971, 1974)
Semantic environment U(i;s) of i = “Blüte” Duft 3.412 Frühling 2.768 Schön/heit 3.641 Rose 3.435 Wiese/Aue 3.971 Garten 3.788 Zärt/lich 3.995 Hold 3.983 Grass/halm 4.030 Berg, Gebirg 4.006 Wunder 4.050 Nachtigall 4.031 Lust 4.119 Neu 4.084 Fracht 4.148 Blatt 4.120 Wonne 4.306 Liedweise 4.164 Feld/Gefild 4.365 Hügel 4.341 Märchen 4.398 Anmut 4.370 Laub 4.410 Quelle/n 4.398 Hoffnung 4.439 Eiche 4.428 Silber/n 4.444 Liebe/n 4.439 Grün/en 4.484 Land 4.457
Baum 3.339 Lenz 3.598 Vogel 3.859 Zweig/Ast 3.987 Traum 4.028 Blume 4.042 Sonne 4.098 Winter 4.137 Treu/e 4.290 Herz/en 4.362 Zeit 4.392 Mai 4.399 Bach 4.432 Leise 4.440 Früh/e 4.482
Figure 4.1. A restricted environment graph (according to Geffroy, Lafon, Seidel, Tournier 1973)
Associative repetition of two words | 113
Figure 4.2. Environment field of “yeux” (according to Dolphin 1977)
4.1 Associative repetition of two words We will show here how a more thorough method can be developed on the basis of an elementary argument. Let us assume that we scrutinize the association of two nouns A and B in a text. Our example text is “Der Erlkönig” by Goethe. Let noun A be “Vater” and noun B “Erlkönig”. If we ask whether “Vater” is associated with “Erlkönig” we must first determine the frame of reference. Let the frame be the verse.
114 | Associative repetition 4.1.1 Short texts In Goethe's poem we may proceed as follows: we set up a table of occurrences of “Vater”and “Erlkönig” in the 32 verses, cf. Table 4.2 where we mark the occurrence of these two words with “+”. Table 4.2: Occurrences of “Vater” and “Erlkönig” in individual verses of Goethe's poem “Erlkönig”. The first row specifies the verse, the second and the third one occurrences of "Vater" and "Erlkönig" resp. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 +
+
+
+ +
+ +
+ +
+ +
The only coincidence can be found in verse no. 6. If there is only one coincidence, the presence of association can be excluded. But intuitively we know that there must be a kind of association. Hence we increase the frame of reference to two verses (half strophe), as can be seen in Table 4.3. Table 4.3: Occurrences of “Vater” and “Erlkönig” in half strophes of Goethe's poem “Erlkönig” 1 +
2
3
4
+ +
+
5
6
7
8
9
10
11
12
13
14
15 +
+
+
+
+
+
+
16
Now we seek the probability that under the given circumstances: N = 16 half strophes M = 6 occurrences of “Vater” n = 5 occurrence of “Erlkönig” there is a “half- strophe coincidence” of “Vater” and “Erlkönig”. The probability P(X = x) of x coincidences can be computed as follows. The number of all possibilities, to place M cases of “Vater” and n cases of “Erlkönig” in N half strophes is
N N n M a simple combinatorial result. The number of coincidences (= occurrences in a half strophe) can be computed as follows: The x coincidences can be placed in N
Associative repetition of two words | 115
N places in ways; the rest of n-x occurrences of “Erlkönig” can be placed in x
N − x N-x free places in ways, and the rest M-x occurrences of “Vater” can be n − x N −n placed in the remaining N-n places in ways. The number of “favourM − x
able” cases is then
N N − x N − n x n − x M − x from which we obtain the sought probability as
(4.1)
N N − x N − n x n − x M − x P X = x = . N N n M
Reordering the factorials yields
(4.2)
M N − M x n−x P X = x = N n
x = n M
in which we recognize the hypergeometric distribution. Since we want to know the probability of the given or a more extreme event, we obtain
(4.3)
P X ≥ xc =
n M
x = xc
M N − M x n − x N n
In our case we had N = 16, M = 6, n = 5, xc = 4.
.
116 | Associative repetition Hence we obtain
− − P X ≥ = +
− −
= + = + = This probability is smaller than 0.05, hence we can accept the existence of a tendency for association. But this association is not of “first order” because the coincidences do not occur in the minimal frame (verse), but rather of “second order” (half-strophe). The strength of association should not be confused with the size of the frame which represents a different dimension, though both can be combined in one indicator. But a two-dimensional presentation would be better. The computation of (4.3) can sometimes be simplified, considering that
(4.4)
M N − M x n−x . P X ≥ xc = − P X < xc = − N x = n xc −
The first value of the sum yields
(4.5)
M N − M n − N − m N − M − N − M − n + P X = = = , N N − N − n + N
the other ones can be computed by means of the recurrence formula (4.6)
P X = x =
M − x + n − x + P X = x − . x N − M − n + x
Associative repetition of two words | 117
4.1.2 Long texts If the text is long (i.e. N is large), the computation of the hypergeometric distribution is tedious, even with the recurrence formula. In such cases, the fact that under special conditions the hypergeometric distribution converges to the Poisson distribution can be utilised. The computation procedure is as follows: Let pA be the probability of the occurrence of word A in a population; it can be estimated by its relative frequency p A = nA/N. Let pB and p B be the probability and the estimation of B respectively. The relative frequencies can be estimated from the given (long) text; we assume that these values are small. Further, let N be the number of frames in which the coincidence of A and B are scrutinized. Then under the hypothesis of the independence of A and B, their probability of coincidence is pApB and the expected number of frames in which A and B co-occur is (4.7)
NpApB = a.
We consider the coincidences (xc) as Poisson-distributed and draw the following conclusions: (1) If xc > a and (4.8)
P X ≥ xc =
e− a a x ≤ , x x = xc ∞
then we consider the association as significant, i.e. the coincidence is associative. (2) If xc < a and (4.9)
xc
e− a a x ≤ , x x =
P X ≤ xc =
then we consider the coincidence as dissociative. (3)In all other cases we consider the coincidence as neutral. Example. Let us consider five poems concerning Laura by Schiller (Phantasie an Laura; Laura am Klavier; Entzückung an Laura; Das Geheimnis der Reminiszenz; Mel-
118 | Associative repetition ancholie). In N = 117 sentences, “Tod” (death) occurs in nTod = 7 sentences, “Leben” (life) in nLeben = 9 sentences, and they co-occur in xc = 2 sentences. We estimate
= = =
p Tod = p Leben
and the expected number of coincidences is a = NpTodpLeben = 117(0.0598)(0.0769) = 0.5385. Sinc xc > a, we use the formula (4.8) and compute
e− a a x =− x x= ∞
P X ≥ =
e− a a x x x =
a a = − ea + = 1 - e-0.5385(1 + 0.5385) = 0.1021. Since this value is greater than 0.05, we have to do here with a neutral coincidence. For the words “Wange”(cheek) and “Blut” (blood) we have nWange = 6 nBlut = 3 xc = 2. Hence a = 6(3)/117 = 0.1538 and
P(X ≥ 2) = 1 - P(X ≤ 1) = 1 - P0 - P1 = 1 - e.- 0.1538(1 + 0.1538) = 0.0107.
Associative repetition of two words | 119
Since P(X ≥ 2) < 0.05, we conclude that “Wange” and “Blut” are positively associated (the exact probability in this case is 0.0065). Associative analysis calls forth some problems which will be here at least mentioned. (1) Lemmatization. For mechanical processing of texts by means of a computer, a lemmatising programme should be applied first if the language under study possesses inflectional morphology, otherwise the word-forms of each lexeme would be counted as a separate unit. (2)Compounding The constituents of compounds may be counted separately, depending on the language and the exact research question. In the short story “Die ruhelose Kugel” by K. Kusenberg, which will be analyzed below, the following words occur: “Kugel”, “Kugelschütze”, “Kugelhascher”, “Höllenkugel”, “Schütze”, “Schützenverein”. Should “Kugelschütze” be considered a distinct word, although here “Kugel” and “Schütze” have the strongest association? (3)Synonyms When associations between concepts or meanings are analysed synonyms and related meanings may be considered as instants of the same unit. In the mentioned text, e.g. “feuern” and “schießen” (to fire and to shoot) as well as “Kugel” and “Geschoß” (bullet and missile) may be counted as one and the same unit. (4)Hidden concepts Some lexemes may be omitted because they occur only in compounds. For example, in the given text there are words like “Hexenblut” (witch blood) and “Hexenkugel” (witch bullet) but no “Hexe” (witch). One must decide whether they are to be taken into account or not. (5)Key words One can omit all parts-of-speech except for nouns, verbs and adjectives, even modal verbs or verbs which in some phrases do not have a special meaning, e.g. “zu Fall bringen” (to let fall). (6)Homonymy and polysemy How should one analyse polysemantic words such as “Lauf der Pistole” and “Lauf der Kugel”?
120 | Associative repetition
4.2 Presentation After determining the individual associations, there are several ways to visualise the association structure of the text. Here we show only the minimal graph. Since the association strengths are given by means of probabilities, i.e. values in interval , for , it is better to transform them in such a way that 0 is the minimal and 1 is the maximal association. This can be done simply by (4.10)
As W W = −
Pcomputed
α
Thus if we obtain the “empirical” probability P = 0.04 and choose α = 0.05, we obtain As = −
=
from P = 0.004 we obtain
As = −
=
and from P = 0.0004 we obtain
As = −
=
For other different significance levels, the corresponding value forms the denominator in (4.10). The graphic presentation will be appropriately changed as can be seen in Figures 4.3 and 4.4. There are, of course, other methods of normalisation. After calculating the association values by means of this or another method, it is recommendable to set up a symmetric matrix which can be processed also manually.
The minimal (acyclic) graph | 121
4.3 The minimal (acyclic) graph When presenting the association network in form of a minimal graph, merely the strongest associations of words are shown as edges of the graph. The procedure begins with any word A and seeks the word B with the strongest association with A(there may be even several ones). An edge from A to B represents this association. This is the beginning of the tree. Then the next strongest word association is joined. This continues until all words are presented in the graph. In the analysis, which serves only as an illustration, we shall proceed as follows: 1. The text will be lemmatised. 2. Keywords like “Kugel”, “Schütze”, “Hexe” are recognized also in compounds. 3. A conceptual identification of synonyms will not be performed. 4. Compounds will be partitioned only with keywords. 5. Only nouns, verbs and adjectives will be taken into account, modal verbs are excluded. 6. Homonyms will be considered as one lemma. We assume that this method does not lead to distortions. In the text “Die ruhelose Kugel” by K. Kusenberg containing N = 38 sentences, the following autosemantic words occurring at least twice will be scrutinized for associations: Kugel Geschoß Mann Schütze Schuß Bahn Garten Pistole Mensch Ziel Zeit
geraten bringen wissen
18 11 5 5 5 4 3 3 3 2 2
6 6 3
Lauf Welt Kraft Stadt Spiel Zufall Ehepaar Bildnis Hindernis Postkarte Leuchtturmwärter
handeln abfeuern befinden
2 2 2
2 2 2 2 2 2 2 2 2 2 2
122 | Associative repetition
sitzen stehen halten fliegen
2 2 2 2
groß seltsam hoch vereinzelt schwer
schicken anrichten geschehen ausbleiben
2 2 2 2
6 2 2 2 2
The association probabilities have been computed by means of the hypergeometric distribution (Formulas 4.4 to 4.6). Words occurring together with maximally P = 0.09 (i.e. α = 0.09) were considered as associated. The associations computed according to (4.10) are presented in Table 4.4. The text was too short, hence several associations were omitted. In the table, the words are presented as follows
1. vereinzelt 2. halten 3. hoch 4. Spiel 5. Postkarte 6. Bildnis 7. Welt 8. Hindernis 9. wissen 10. schwer 11. Zufall 12. Schuß 13. Leuchtturmwärter
14. handeln 15. ausbleiben 16. befinden 17. groß 18. sitzen 19. Lauf 20. seltsam 21. schießen 22. Ziel 23. Zeit 24 fliegen 25. Schütze 26. Pistole
27. Stadt 28. bringen 29. Kraft 30. Mensch 31. Mann 32. geraten 33. Hexe 34. stehen 35. Bahn 36. Geschoß 37. abfeuern
The minimal graph (acyclic graph) has the special property that from any vertex (word) to any other vertex there is exactly one way. In case that there are several equally strong associations, the graph can take different forms. In such cases it is, perhaps, better, not to use the minimal graph but to show all edges or to
Vistas | 123
apply a more strict association criterion. In Figure 4.3 one can see the minimal graph of the text with α = 0.09, in Figure 4.4 with α = 0.05.
4.4 Vistas Two kinds of associations can be found in a text: General, as can be found also in daily talk. They became the association repertoire of the language community, they form more or less fixed neighbourhoods in the sense of Rieger, and a part of them can be found also in the association books written by psychologists (cf. e.g. Palermo, Jenkins 1964). Special, which are characteristic of the given text. The difference is not categorical but gradual, and it would be difficult to examine how an association observed in a text becomes general. The general associations are not ad hoc creations because they must display a significant strength. They arise by processes whose study could lead to the discovery of laws because not everything can be associated with everything in language. We suppose that the historical transition of a special association to a general one abides by the Piotrowski law (cf. Best, Kohlhase 1983) but such an examination would be extremely difficult. In texts, we find a part of the general associations, the rest of the significant ones are conditioned by the text (theme, text sort), have a certain strength, appear in a given sequence, they form special nets which can be regularly extended, etc. Up to now there are not even hypotheses about these phenomena. A great number of texts has to be studied in order to attain at least the threshold of this phenomenon (cf. Tuzzi, Popescu, Altmann 2010). Very likely, cooperation of linguists and psychologists will be necessary in order to get insight under its surface and describe quantitatively the internal word of a writer or a patient. Nowadays, texts are available in machine-readable form, hence the domain of this research could be strongly extended.
124 | Associative repetition
Figure 4.3.Minimal graph with α = 0.09
Figure 4.4. Minimal graph with α = 0.05.
Vistas | 125 Table 4.4. Association values of word in the text by Kusenberg
2 3 4 5 6 7 8 9 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08
0.99
0.08
0.08
10
0.08 0.08
0.08
0.08
11
13
0.08
16
0.08
0.08
0.9 0.08
0.08 0.08 0.08 0.08
19
20
0.08
0.9 0.08 0.08 0.08 0.08
126 | Associative repetition Table 4.4. (continued) Association values of word in the text by Kusenberg
21
23
24
0.08 0.08 0.08
25
26
28
29
30
31
32
33
34
0.14 0.72 0.55
0.08 0.08
0.08 0.08
0.14
0.08
0.08 0.08
0.58 0.21
0.08 0.46
0.9
0.79
1 2 3 4 5 9 11 12 14 15 17 18 20 21 22 23 27 30 31 32 34 35 36 37
5 Iterative repetition Iterative repetition can be represented by mathematical runs. A run is an uninterrupted sequence of identical elements in a sequence consisting of different elements. The following sequence of letters AA BBB A BB contains 4 runs. In texts, runs occur especially on the form side of the sign. There are e.g. sequences of words which have, say, the same length, or sequences of sentences with the same structure. But sequences of elements with the same meaning are rare. Runs can occur in sequences containing elements of at least two kinds. Sequences with more than two different elements require complex formulas and working with them is intricate. In poetics, the theory of runs has been applied for the first time by Woronczak (1961), later on Fucks (1970) worked with runs. Their use in text analysis has been treated in detail by Grotjahn (1979, 1980). Nowadays, the application of runs is rather common (cf. Wimmer et al. 2003; Altmann V., Altmann G. 2008; Popescu et al. 2009; Popescu, Čech, Altmann 2011). When we analyze a text whose elements are dichotomised, i.e. partitioned in two classes, e.g. “hypotactic sentences” and “all other sentences”, besides many other properties such as their frequency also their succession can be studied. We are, in particular, interested to determine whether their sequential arrangement is random or not. The sequence AAABBB differs from the sequence ABABAB or BAABBA, though all of them contain three A's and three B's. The third sequence seems to contain more randomness than the first two.
128 | Iterative repetition It is, however, plausible to ask which of these sequences is a random succession of symbols, or in which one a tendency is hidden. From this point of view, Fucks (1968, 1970, 1971) examined sequences of long and short sentences, Grotjahn (1980) sequences of stressed and unstressed sequences of syllables in poems, sequences of verse lengths in number of syllables (cf. also Woronczak 1961), sequences of long and short syllables in prose and poetry, sequences of vowels and consonants, as well as sequences of rhythmic patterns in hexameter. This is a very extensive research domain but the theory of runs gives answers only to a part of the questions concerning repetitions.
5.1 Binary sequences Every sequence of linguistic units can be considered as dichotomic by underlining the contrast between a focused unit (A) and all the others (B). If the variables are qualitative (nominal), it is simple, even if there are boundary cases, e.g. the dichotomy vowel vs. consonant, where the semivowels and glides can arbitrarily be ascribed one of these classes. If the variable is quantitative, the mean or the median can serve as a threshold. Let us consider the number of syllables in individual verses of Goethe’s “Erlkönig” as counted by Grotjahn (1979: 144) (see Table 5.1). Table 5.1: Syllabic length of verses in “Erlkönig” by Goethe
Number of syllables x Frequency fx
8 2
9 16
10 7
11 6
12 1
The mean verse length is x̄= 9.625. Let us consider each verse with 9 or fewer syllables as short (S) and those with 10 or more syllables as long (L). Then this special structure of “Erlkönig” can be represented by the sequence SSSS L SSSSSS LLL SSS LLLL SS LLLL S LL SS (5.1) We shall use the following symbols n1 = number of elements of the first kind n2 = number of elements of the second kind r1 = number of runs of S r2 = number of runs of L
Binary sequences | 129
n = n1 + n 2 r = r1 + r2. For the sequence (5.1), we obtain the following numbers: n1 = 18 n2 = 14 n = 32 r1 = 6 r2 = 5 r = 11. Our question is: Is there in the poem a tendency to structure the sequences of short and long verses, or is the observed sequence to be considered random? The question can be specified by asking whether there is a tendency to form many or few runs. If we specify the question in one of these ways we obtain a one-sided hypothesis. Thus, the a priori hypothesis that the author lets a short verse follow rather a short one and a long verse a long one means to expect few runs, because the short verses lump together and also the long ones do it. In this case the probability that one finds the given or still smaller number of runs is of interest, i.e. the value (5.2)
P(R ≤ r)
where R is the variable “number of runs”. The assumption that there is a rather regular alternation of short and long verses corresponds to the question for the probability that one finds the given or a still greater number of runs, i.e. (5.3)
P(R ≥ r) .
Computational effort can considerably be reduced when the hypotheses are formulated not too much a priori. When the expected number of runs is smaller than its mathematical expectation, (5.3) yields a great probability value whereas a great probability is obtained for (5.2) when the expected number of runs is larger than the mathematical expectation. It is, therefore, useful to calculate the mathematical expectation by means of the formula (5.4)
E R = +
nn n n + n = . n + n n
In our example we obtain E R =
+ =
130 | Iterative repetition Since the observed r = 11 < E(R), it is reasonable to ask whether the sequence is random at all, or whether P(R ≤ r) is smaller than the fixed number, e.g. 0.05 which is considered significance level. Let us first consider the second case and compute (5.2). The probability function of R is given as (cf. Mood 1940; Gibbons 1971; Grotjahn 1980)
n − n − r − r − if r is even n n P R = r = n − n − + n − n − r − r − r − r − if r is odd n n for r = 2,3,…,n such that we have to evaluate P(R ≤ 11). According to (5.5),we obtain for r = 2
P R = = = = We compute the next values using recurrence formulas, which make calculations somewhat easier (cf. Grotjahn 1979: 148). We designate the number of iterations r = 2k and calculate (5.7)
P R = k + =
n − k P R = k k
P R = k + =
n − k + n − k + P R = k − k − n − k +
for odd r
and (5.8)
Equation (5.7) yields for r = 3, with k = 1, P R = =
− =
for even r
Binary sequences | 131
For r = 4, we obtain according to (5.8) with k = 2
P R = =
− + − + = − +
For r = 5 using (5.7) with k = 2 P R = =
− =
The values of the next probabilities are P(R = 6) = 0.00004177 P(R = 7) = 0.000181 P(R = 8) = 0.000765 P(R = 9) = 0.002297 P(R = 10) = 0.006700 P(R = 11) = 0.014741. The sum of these probabilities is
P R ≤ = P R = R = r =
As P R ≤ < , we conclude that there is a tendency in the "Erlkönig" for verses to be followed by verses of the same lengths. If r > E(R) we could test the hypothesis of too many iterations and calculate (5.9)
P R ≥ r =
n +
P R = x x=r
n < n
using the recurrence formulas (5.7) and (5.8) in a modified form, viz. (5.10)
P R = k + =
k P R = k + n − k
(5.11)
P R = k − =
k − n − k + P R = k n − k + n − k +
for even r
for odd r .
For the sake of illustration, we calculate a simple case with n1 = 5, n2 = 7, n = 12 for several values of r. In this case, not more than 2n1 + 1 runs are possible; we calculate therefore P(R = 11). As r is odd, we calculate using (5.6), in which the fist part is omitted because of
132 | Iterative repetition
n − = = r − n since binomial coefficients obtain the value 0 if m > n. m
Therefore, we have
P R = =
=
=
If there were, under the above conditions, r = 11 runs, we could conclude that this number is significant high because P(R=11) < 0.05. For P(R = 10), we obtain according to (5.10) with k = 5 P R = =
= −
The sum P R ≥ = + =
is still smaller than 0.05 and hence still significant. For r = 9 with k = 5, we obtain from (5.11)
P R = =
=
and P R ≥ = + =
which is larger than 0.05 and indicates that this number of runs has to be considered as random. It is rather common to test a two-sided hypothesis (“is there a trend?”) without any direction. In this case, and for n1 ≤ n2 ≤ 20, ready-made tables are available, in which the critical values are given (cf. Swed, Eisenhart 1943; Siegel 1956; Bradley 1968; Bortz, Lienert, Boehnke 1990).
Large samples | 133
5.2 Large samples Linguistic samples are usually very large, hence in most cases the approximation to the normal distribution can be applied. In order to test whether the observed number of runs is random, r is transformed to the normal variable as (5.12)
z=
n r − − nn nn nn − n n −
or on the basis of absolute values for a two-sided hypothesis. If z is greater than the critical value z1-α found in the table of the normal distribution, one can consider the number of runs as significantly large. If z < zα, then the number of runs is significantly small; if |z| > z1-α/2, then the number of runs (with two-sided hypothesis) is not random. Let us illustrate the computation with the above numbers. For n1 = 18, n2 = 14, n = 32, r = 11 we obtain z=
− − = . − −
Since this value is greater than 1.96 (= z0.975), we consider the number of runs as not random, a result which is in agreement with the above one. This test serves well even if n1 and n2 are somewhat smaller. In our second example, where n1 = 5, n2 = 7, n = 2, we obtain for r = 10 the value z = 2.85, which is still significant as above. For r = 9, however, z = 1.95, which is smaller than 1.96 and therefore (twosided test) not significant. This test is applicable with good approximation for n1, n2 > 10.
5.3 Comparison of runs in two texts If the expectation and the variance of a random variable are known and the samples are large they can be transformed to a normal variable. This fact can help us to compare two texts for equality. Let us consider two texts, A and B, in which we found the following numbers
134 | Iterative repetition r n n n r n n n
The expectations and the variances following from the distribution of runs are given as (5.4)
E R =
nn + n n
(5.13)
V R =
nn nn − n n n −
hence the quantity (5.14)
z=
rA − rB − E RA − E RB V RA + V RB
is asymptotically normally distributed with N(0,1). Sometimes, a correction for continuity is useful which yields (for rA > rB) (5.15)
z=
rA − rB − − E RA − E RB . V RA + V RB
This criterion can be employed to test whether the number of runs in text A is significantly greater than in text B. For an illustration, let us consider the runs of stressed and unstressed syllables in Goethe’s ballads “Erlkönig” and “Totentanz”. Let A = Erlkönig, B = Totentanz and n1A = 128 (number of stressed syllables in “Erlkönig”) n2A = 180 (number of unstressed syllables in “Erlkönig”) nA = 308 rA = 252 (number of runs in “Erlkönig”)
n = n = n = r =
the corresponding data from “Totentanz”
Runs of more than two kinds of elements | 135
In order to be able to insert in (5.14) and (5.15) we need
+ = − = V RA = + = E RB = − = V RB = E RA =
Inserting these numbers in (5.14), we obtain
z=
− − − = . +
With the correction for continuity, we obtain z = 1.61. With a two-sided test, this value is not significant at α = 0.05 (one cannot test here the one-sided hypothesis) because its critical value is 1.96. Other examples can be found in the references given above.
5.4 Runs of more than two kinds of elements The probability distribution of the number of runs becomes the more complex, the more kinds of entities there are. Also the computations are more difficult. Fortunately, it is always possible to transform the number of runs to the normal variable by means of the usual transformation. If we consider (cf. Bortz, Lienert, Boehnke 1990) (5.16)
m=n–r
and the following definitions k
(5.17)
F = n j n j −
(5.18)
F = n j n j − n j −
j = k
j =
136 | Iterative repetition then (5.19)
E m =
(5.20)
σm =
F n
n − F F F + − nn − n n − n n −
and (5.21)
z=
m − E m
σm
.
Let us illustrate the computation on data from Grotjahn’s (1980) observation of numbers and runs of dactyls, spondees and trochees in 30 verses of “Aeneis”:
nGDFW\OV = nVSRQGHHV = nWURFKHHV = n = r = m = 180 - 127 = 53 F2 = 89(88) + 81(80) + 10(9) = 14402 F3 = 89(88)87 + 81(80)79 + 10(9)8 = 1194024 E(m) = 14402/180 = 80.0111
σm =
− + − = .
Inserting these values in (5.21) we obtain
z=
− = − .
This result is highly significant and shows that the feet are not placed randomly but obey some regularity. Evidently, the author had a rhythmic ideal. Since we considered m = n - r, the result shows a negatively significant tendency: there are too many runs.
6 Aggregative repetition Uninterrupted sequences containing iterations are special cases of “clustering” of identical units. Their extreme frequencies may be signs of strong tendencies. In many cases, a unit, e.g. a certain word, a certain syllable, cannot stay in uninterrupted sequence, but one can observe its frequent occurrence in some places of the text, namely in smaller distances than one would expect on the basis of its frequency. One says that these are “clusterings” of “aggregations“ which manifest themselves in many small and few great distances between the occurrences of the respective unit. The theory of runs is not adequate to state tendencies of this kind, one must use other methods. The study of distances between identical units has been initiated by G.K. Zipf (1949). Today, it is a well developed linguistic domain whose results are satisfactory. The majority of authors used binary units and obtained the geometric distribution of distances (Spang-Hanssen 1956; Yngve 1956; Epstein 1953; Uhlířová 1967) which was derived by Brainerd (1976) from a Markov chain. Herdan (1966) and Králík (1977) starting from other assumptions obtained the exponential distribution, Strauss, Sappok, Diller and Altmann (1984) showed the model of a tendency-free distribution, and supposing a clustering tendency they obtained the negative binomial distribution. A generalization of the random distribution of distances of more units has been developed by Zörnig (1984a,b; 1986).
6.1 Random distances: binary data There are two problems concerning the distribution of distances: First, one can ask whether identical units are placed in random distances. But this is at variance with the hypothesis of Skinner according to whom the uttering of an entity increases the probability of its repetition in short distance (Skinner 1939, 1941). Hence it is necessary to derive the model of this tendency-free, purely random distribution. Second, we have the following problem: if the distances are not quite random but abide by a stochastic law, what is the form of this law? There are surely several answers to this question because the form of the distribution may depend both on some psycholinguistic, communication theoretical or subjective factors and on the kind of the repeated entity, its level in the hierarchy of lin-
138 | Aggregative repetition guistic units, on the text-sort, etc. Hence it is to be expected that here a wide research domain will be developed. Let us consider a unit A in a complete text. The distances between the occurrences of two A will be measured by the number of all different units of the same type (Ā), i.e. if A is a word, then Ā is also a word, if A is a letter, then Ā is also a letter. If there is no Ā between two A-s, then the distance is 0, if there is one Ā, then the distance is 1, etc. One can imagine two neighbouring A like an urn in which one throws Ā balls. However, the distance can be measured also by the number of steps necessary to come from a preceding A to the following A. In such a case, one can simply shift the resulting distribution one step to the right. If in a text there are k occurrences of A, then there are k - 1 gaps between them, i.e. k - 1 urns. For the sake of simplicity we write k - 1 = n. We place in these urns randomly r balls representing the r occurrences of Ā (the text before the first and behind the last occurrence of A will be omitted). Our question is: what is the probability that placing randomly r balls in n urns there will be exactly n0 empty urns, n1 urns each containing exactly one ball, n2 urns each containing exactly 2 balls, etc. The sum of all urns must be n, i.e. n 0 + n1 + … + nr = n and the number of balls is r, i.e. n1 + 2n2 + 3n3 + … + rnr = r. The number of possibilities to place r balls in n urns is nr; the number of possibilities to partition n urns in groups of n0, n1,…,nr is (6.1)
n n n nr
and the number of possibilities to partition r balls in such a way that in each ni urn there will be exactly i balls is (6.2)
r . i ni r nr n
n
If we multiply the “positive” results (6.1) and (6.2) and divide by the number of all possibilities, nr, then we obtain the sought probability as
Random distances: binary data | 139
(6.3)
P n n nr =
nr r
r
i =
i=
n r ∏ ni ∏ i ni
.
Interpreted linguistically, (6.3) yields the probability that between n0 units A there is the distance 0, between n1 units A distance 1, etc. The expected number of urns ni can be computed as (cf. David 1950; Strauss, Sappok, Diller, Altmann 1984) (6.4)
i
r E ni = n − r − i n i n
from which we can compute the individual frequencies stepwise as
E n = n− r n (6.5)
E n = r − r − n E n =
r r − − r − n n
etc., or simply by using the recurrence formula (6.6)
E ni + =
r −i E ni . i + n −
Let us illustrate the procedure using an example from Strauss, Sappok, Diller, Altmann (1984). In the poem written in hexameters, “Poems in Classical prosody, Epistle II: To a Socialist in London” by Bridges, the rhythmic patterns in 300 verses were scrutinized in such a way that a D meant a dactyl, an S a spondee, and the last two feet were omitted because they are always identical. The first verses yielded 1. DSSS 2. SDSS 3. SDSS 4. DDSS 5. SDSS 6. DDSS 7. DSDD 8. SDSS 9. DSSS 10. SSSS
11. SDSS 12. DSDS 13. SDSS 14. SSSS 15. DSSD 16. SDDS 17. SDSS 18. DSSS 19. DSSD 20. DDDS
21. DDDS 22. DSDD 23. DDSS 24. SDSS 25. DSDS 26. DSDS 27. SSDS 28. DDSS 29. DSSS 30. DSSS
140 | Aggregative repetition The distances of the pattern DSSS are 7, 8, 10, 0. The count of all distances between DSSS is presented in Table 6.1. We found k = 65 DSSS pattern, hence the number of urns is n = k - 1 = 64 and the number of “balls” is = 300 - 65 = 235. The expected sizes of individual distances are according to (6.5) and (6.6): = − = E n = + − E n = = + E n = −
etc. (cf. Table 6.1). Table 6.1: Distances between the repetitions of the pattern DSSS in the poem by Bridges Distance i 0 1 2 3 4 5 6 7 8 9 10 11 13 33 ∑
Observed ni 17 13 4 4 6 3 6 2 2 1 2 2 1 1 64
Expected E(ni)
1.58 5.90 10.95 13.50 12.43 9.12 5.55 2.88 1.30 0.52 0.18 0.06 0.02
The difference between the observation and the model is even optically that great that one can reject the hypothesis of randomness without performing a test. The chi-square for which the values in half-brackets are pooled yields an extremely high value X2 = 109.48 with 5 degrees of freedom telling us that there is some clustering trend in the data. The strongest aggregation must, of course,
Models of aggregation trends | 141
be with n0, hence it is sufficient to compare n0 with E(n0) in order to make a decision about aggregation or randomness. To this end we can transform n0 into a normal variable using the well known formula (6.7)
n − E n =z V n
where (6.8)
E n = n − r n
V n = nn − − r + n − r − n − r n n n
(cf. David 1950). For our data we obtain
− = − + − − − This z-value is that great that it excludes randomness and supports Skinner’s hypothesis.
6.2 Models of aggregation trends Now, if the distribution of distances in not quite random, there must be some kind of control in the background. There are several answers to this question and all are quite plausible. We shall mention briefly three solutions, the Brainerd model will be treated in more detail. (a) Herdan (1966: 127-130) and Králík (1977) suppose that the occurrence of a unit A in texts is controlled by a Poisson process leading to the formulas (6.10)
P'x(t) = aPx-1(t) - aPx(t).
The periods between two Poisson events are distributed exponentially according to the probability function f(x) = ae-ax
142 | Aggregative repetition yielding an appropriate model. The parameter a, which can be estimated from the data but can also be freely chosen can be interpreted as “aggregation parameter”. The greater is a, the stronger is the aggregation. The fact that the periods between Poisson events are continuous is not relevant, because it is usual to approximate discrete data by means of continuous models. (b) Epstein (1953), Spang-Hanssen (1956) and Yngve (1956) start from the assumption that unit A occurs in text with probability p; the first occurrence where the computation begins has, of course, probability 1; the following x A units occur each with probability q = 1- p, then again a unit A appears. On the basis of independence we obtain (6.11)
Px = 1qxp = pqx,
x = 0,1,…
i.e. the geometric distribution. One can see that the results (6.10) and (6.11) except for normalizing constant and continuity are identical, since a is the normalizing constant with the integration of the continuous function e-ax from 0 to infinity, while p the normalization constant of the summing of the discrete probability mass function qx from 0 to infinity is, and e-a just as q lay always in the interval . Here one can consider p as aggregation parameter: the greater is p the steeper is the curve. However, one sees that both functions are monotonous decreasing, hence they can be adequate only in places where there is true aggregation, but not for all possible distributions of distances. It is with some parts of speech, e.g. prepositions, directly against the rules of grammar which forbid the occurrence of the same preposition in immediate neighbourhood. A good fitting can be achieved only if one pools some variable classes in intervals in such a way that the first interval has the greatest frequency. Herdan (1966: 127) pooled the distances between the occurrences of the preposition “k” (to) in Pushkin’s Captain’s daughter in intervals 1-20, 21-40,… . This yields, of course, only a feigned impression of aggregation which cannot exist with “k” at all. (c) Strauss, Sappok, Diller and Altmann (1984) did not consider the appearance of the unit A but the placement of units Ā in the space between the occurrence of two A's as a Poisson process. The interspace can again be considered as an urn in which balls are thrown. The urn can react neutrally to this event, i.e. the number of A̅ increases in the urn constantly and regularly. In that case one obtains formula (6.10) and the distribution of distances is Poisson. The interspace can, however, influence the process:
Models of aggregation trends | 143
(i) If an interspace (an urn) the strongr reject new balls that more there are already in it, then we replace a in (6.9) by a function fx(t) = c - bx, i.e. P´x(t) = fx(t)Px(t) + fx-1(t)Px-1(t) and the solution yields the binomial distribution. (ii) If the interspace attracts a new ball the more, the more there are already in it, then we replace a by fx(t) = c + bx which yields the negative binomial distribution. Just this is our case because this behaviour yields few interspaces with many balls, i.e. a small number of great distances, and many interspaces with few balls, i.e. many small distances which represent aggregation. The resulting formula is (6.12)
k + x − k x Px = p q x
x =
where p and k are the aggregation parameters. One can see that (6.11) is a special case of (6.12), when k = 1. While the approaches (a) and (b) are possible only with class pooling or for special units, formula (12) is generally applicable for testing aggregation trends. It is not necessarily monotonically decreasing, it can also be concave with a maximum ≠ 0. Different kinds of linguistic units will have different pairs of k and p and this can help to characterize the “distance behaviour” of linguistic units. The fitting of (6.11) and (6.12) can be performed as follows. Since we analyze aggregations, we can use the first frequencies for estimating the parameters. For the geometric distribution one can take (6.13)
p = f N ; q = − p
where f0 is the frequency of the zero-th class and N is the number of distances, N = ∑fx. For the negative binomial distribution the simplest way is
f f − f f p = − q
q = (6.14)
k=
f qf
or
(6.15)
x s −x x p = s
k =
144 | Aggregative repetition One can, naturally obtain better estimators (cf. Chapter 2) but today it is usual to perform the fitting iteratively using optimization methods (cf. Altmann-Fitter 1999). In that case (6.13) to (6.15) can be used as starting values. Let us illustrate the fitting using the data in Table 6.1 (see Table 6.2). The estimation of p for (6.11) yields
p = 17/64 = 0.2656 q = 1 - p = 0.7344. The fitting using these parameters is presented in the third column of Table 6.2, the optimized fitting is in the fourth column. For the negative binomial distribution one cannot use the estimation in (6.14) because of the “pathological” frequency f2 = 4, hence we use the estimation (6.15). We obtain x = 3.3281 s2 = 12.1892,
from which k =
= −
p =
=
follows. The computed values are presented in the fifth column of Table 6.2 (cf. also Figure 6.1 and 6.2). As can be seen, the geometric distribution is quite sufficient. The optimized negative binomial yields the best chi-square but the number of degrees of freedom is smaller than with the optimized geometric distribution, consequently one obtains a smaller P. Of course, this need not be always the case. Adequate pooling of classes can slightly improve the results but it is not relevant here.
Models of aggregation trends | 145 Table 6.2: Fitting the geometric and the negative binomial distributions to data in Table 6.1
Distance
Geometric
Observed
x
NPx
fx
17 13 4 4 6 3 6 2 2 1 2 2 2 64
Negative binomial
NPx
17.00 12.48 9.17 6.73 4.94 3.63 2.67 1.96 1.44 1.06 0.78 0.57 1.57 p = 0.2656 X2 = 11.04 DF = 8 P = 0.20
Optimised negative binomial
NPx
14.61 11.27 8.70 6.71 5.18 4.00 3.09 2.38 1.84 1.42 1.09 0.84 2.86 p = 0.2283 X2 = 9.78 DF = 10 P = 0.46
NPx
12.63 11.48 9.39 7.39 5.71 4.36 3.30 2.49 1.86 1.39 1.04 0.77 2.19 k = 1.2500 p = 0.2730 X2 = 11.89 DF = 9 P = 0.22
15.24 10.75 8.12 6.27 4.90 3.85 3.04 2.40 1.91 1.52 1.21 0.96 3.82 k = 0.8751 p = 0.1941 X2 = 9.58 DF = 9 P = 0.39
0
0
5
5
10
10
15
15
0 1 2 3 4 5 6 7 8 9 10 11 ≥12
Optimised geometric
0
1
2
3
4
5
6
7
8
9
10
11
12
0
1
2
3
4
5
6
7
8
9
10
11
12
Fig. 6.1. Plots of the fittings in Table 6.2: Geometric distribution (left) and optimised geometric distribution (right)
15 10 5 0
0
5
10
15
146 | Aggregative repetition
0
1
2
3
4
5
6
7
8
9
10
11
12
0
1
2
3
4
5
6
7
8
9
10
11
12
Fig. 6.2. Plots of the fittings in Table 6.2: Negative binomial distribution (left) and optimised negative binomial distribution (right)
6.3 Brainerd’s Markov-chain model The theory of Markov chains is a powerful instrument in language and text research. It allows us to find dependencies in linearly ordered data. This circumstance is for text analysis especially important because a text can always be conceived as a chain of elements; if Skinner’s hypothesis holds true, then there must always be some kind of dependence between the occurrences of the same entity whose character could be captured by this method. Even if the procedure does not give a direct answer to the problem of aggregation, it opens a wide research domain. The theory of Markov chains can be applied also to sequences of non binary data. Here we restrict ourselves to binary ones because we are interested in distances. Let us consider the text as a sequence of A and Ā elements which are called the states of the sequence. One usually calls the state A as 1, the state Ā as 0. In the first 30 verses of the above mentioned poem by Bridges where DSSS is symbolizes as 1 and D̅S̅S̅S̅ as 0 we obtain the sequence 100000001000000001000000000011. Every text can be coded in this way, thus one can scrutinize the sequential properties of any kinds of units. If there are some dependencies, then it must be possible to compute the probability of the occurrence of a unit in the given position on the basis of the knowledge of its predecessors. The conditional probabil-
Brainerd’s Markov-chain model | 147
ity that in position n the unit E appears (= event E occurs) if in the first n - 1 positions known units (E1 to En-1) occur, is (6.3.1)
P(En|E1E2…En-1).
Here we have to do only with two units (events), namely 1 and 0, hence the fact that e.g. En = 1 means that in the n-th position the text has the state 1. The states can generally be symbolized as x (x = 0,1), hence (6.3.1) can be explicitly written as (6.3.2)
P(En = xn |E1= x1, E2 = x2, …En-1 = xn-1)
If the n-th unit is quite independent of the other ones, (6.3.2) reduces to (6.3.3)
P(En = xn),
and the chain of this kind is called Markov chain of zero-th order. If the occurrence of the n-th unit depends only on the immediately preceding unit, then (6.3.2) has the form (6.3.4)
P(En = xn | En-1 = xn-1),
and this is the very Markov chain or the Markov chain of first order. Higher orders can be obtained by increasing the number of relevant predecessors, e.g. the chain of second order (6.3.5)
P(En = xn| En-2 = xn-2, En-1 = xn-1) ,
of third order (6.3.6)
P(En = xn| En-3 = xn-3, En-2 = xn-2, En-1 = xn-1)
etc. If the units are binary, i.e. if there are only two states, 0 and 1, then the sequence of zeroes represents the distance between the ones. The size of this distance is a variable, marked as X. Its probability distribution can be reconstructed form the Markov chain.
148 | Aggregative repetition (a) Markov chain of zero-th order In a chain of zero-th order the units are mutually independent, and the probability of the sequence of units is according to (6.3.1) P(E1) P(E2)…P(En). If we have binary data with two states (0,1) and the first occurs with probability 1, we obtain 1•P(E1 = 0)P(E2 = 0)…P(Ex = 0)P(Ex+1 = 1) or simpler 1•P(0)P(0)…P(0)P(1) = P(0)xP(1)
(6.3.8)
i.e. the probability that the variable X (distance) attains the value x, is P(X = x) = P(1)P(0)x,
x = 0,1,2,…
(6.3.9)
Here P(1) is simply the probability of the occurrence of 1 which was symbolized above as p. P(0) is the probability of 0 which can be symbolized as P(0) = 1 - P(1) = 1 - p = q, hence we obtain the geometric distribution as above: Px = pqx,
x = 0,1,2,…
(b) Markov chain of the first order Here we have the distance 0 if state 1 is followed by state 1, i.e. P(X = 0) = P(Xn = 1|Xn-1 = 1) = P(1|1); the rest consists of one transition from 1 to 0, of x-1 transitions from 0 to 0, and of one transition from 0 to 1, i.e. P(X = x) = P(0|1)P(0|0)x-1P(1|0), i.e. we obtain together
Brainerd’s Markov-chain model | 149
for x = P P X = x = Px = x − P P P for x =
(6.3.10)
Here we have only two parameters, since P ( ) = − P ( )
(6.3.11)
and P ( ) = − P ()
If we write
and
P(1|1) = α P(0|0) = q,
P(1|0) = 1 - q = p,
then (6.3.10) can be written as
(6.3.12)
x= α Px = x − − α pq x =
In this form one can recognize the so called extended displaced geometric distribution. As estimators, the pair
(6.3.13)
P = α = f N − f N P = p = x
or the pair
(6.3.14)
P = α = f N P = p =
f N − f N
can be used. For illustration we fit this distribution to the repetitions of the DSSS pattern in Bridges’ poem, cf. Table 6.3. According to (6.3.13) we obtain
150 | Aggregative repetition
P = f N = = P = − = − P = = P = − = According to (6.3.14) we have
P = = − P = − = The values computed according to (6.3.13) are presented in the third column of Table 6.3. The fitting with the estimators (6.3.14) is worse and will not be presented here. Optimizing the results one obtains the values in the fourth column of Table 6.3. Both variants are graphically shown in Figure (6.3). Table 6.3: Fitting the Markov chain of first order to the data of Bridges x
0 1 2 3 4 5 6 7 8 9 10 11 ≥12
fx
Markov chain of the first order
17 17.00 13 9.40 4 7.52 4 6.02 4.81 6 3 3.85 6 3.08 2.46 2 1.97 2 1.58 1 2 1.26 2 1.01 2 4.01 P(1|1) = 0.2656 P(1|0) = 0.2000 = 7.91 X2 DF =9 P = 0.54
0.2481 0.2081 7.71 9 0.56
Optimised fitting 15.88 10.01 7.93 6.28 4.97 3.94 3.12 2.47 1.96 1.55 1.23 0.97 3.70
As can be seen, the two fittings in Table 6.3 are better than all preceding ones. One sees especially the improvement against the Markov chain of zero-th order. Would we use a chain of second, third etc. order, we would obtain each times a better fit. The decision to prefer the chain of first order must be made by
Brainerd’s Markov-chain model | 151
0
0
5
5
10
10
15
15
means of a test. To this end Brainerd brought the likelihood-ratio criterion, consisting of the ratio of the two chains. For the Markov chain of the zero-th order the likelihood function is
0
1
2
3
4
5
6
7
8
9
10
11
12
0
1
2
3
4
5
6
7
8
9
10
11
12
Fig. 6.3. Fitting the Markov chain of first order to the data of Bridges (cf. Table 6.3): Markov chain of the first order (left) and optimised fitting (right)
(6.3.15)
n
L = ∏ P P x f x x =
and for the first order it is (6.3.16)
n
L = P f ∏ P P x − f x x =
hence the likelihood ratio is f
(6.3.17)
L P P P λ = = L P P P
N − f
P P
Nx
We insert in this formula the maximum likelihood estimators of the individual parameters - which can be replaced by optimized values - and obtain
λ =
L = L
−
= .
152 | Aggregative repetition The quantity 2 ln λ is distributed approximately as a chi-square with 1 DF, and since 2 ln(2.10) = 1.40 we can decide that the chain of first order does not bring any significant improvement. The distances are, of course, not always distributed so as in the above example, because aggregation does not presuppose 0 distances. If one scrutinizes e.g. the distances between the occurrences of “Kind” (with all its references (Sohn, Knabe, du, dir, dich, dein, mein, mir, ihn) in Goethe’s “Erlkönig”, one obtains a distribution presented in Table 6.4. Here one can see that the number of 1-distances is greater than that of 0-distances. The fitting of the geometric distribution (chain of zero-th order) yield a P = 0.27 while a chain of first order yields P = 0.49. The negative binomial yields P = 0.40. Table 6.4: Distribution of distances between “Kind” and its references in Goethe’s “Erlkönig” x
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
fx
0. order
5 9 3 4 2 0 4 2 1 0 0 2 0 2 2 1 37
1. order
7.92 4.39 6.22 5.83 4.89 4.79 3.84 3.93 3.02 3.23 2.38 2.65 1.87 2.18 1.47 1.79 1.15 1.47 0.91 1.21 0.71 0.99 0.56 0.81 0.44 0.67 0.35 0.55 0.27 0.45 1.00 2.07 P̂ = P(1) = 0.2139 P(1|1) = 0.1186 X2 = 11.13 P(1|0) = 0.1787 DF = 9 X2 = 7.40 P = 0.27 DF = 6 P = 0.49
The likelihood ratio yields
λ =
=
Brainerd’s Markov-chain model | 153
from which 2 ln λ = 2.18 follows. Even if the second fitting yields a monotonically decreasing distribution and seems to be more adequate, it does not furnish us with a significant improvement. (c) Markov chains of higher orders The distribution of distances on the basis of a Markov chain of second order yields (cf. Brainerd 1976)
(6.3.18)
P Px = P P P P P x − P
for x = for x = for x =
and on the basis of a Markov chain of third order
(6.3.19)
P P Px = P P P P P x − P
for for for for
The maximum-likelihood estimators for the second order are
P = f N P = − P (6.3.20)
P = f N − f P = − P P = N − f − f Nx − N + f P = − P
and for the third order
x= x = x= x =
154 | Aggregative repetition
P = f N P = N − f − f N P = f N
(6.3.21)
P = f N − f − f P = − P P = N − f − f − f Nx − N + f + f P = − P
The likelihood ratios for a test of individual orders are as follows:
(6.3.22)
(6.3.23)
f
P P P P P P
λ =
f
N − f − f
P P
f
P P P P P P P P
λ =
P • P
Nx − N + f + f
P P
N − f − f − f
Nx − N − f
N − f − f
•
P N − f
Brainerd used this method for analyzing various data and showed in which order some units (articles, pronouns, long words) form their distances. Surely, behind these phenomena some language laws are hidden but their examination is a future task. The theory of Markov chains furnishes us an important instrument. As to aggregations, there is a disadvantageous circumstance for their interpretation. The chain of zero-th order yields a model in which aggregations are peculiarly present according to Skinner’s hypothesis. In places where the aggregations are weaker, one must use chains of higher order which presuppose dependencies. This circumstance can be interpreted as follows. Aggregations do not imply a greater number of zero distances - which are not even admitted with many units - but rather small non-zero distances. Thereby the underlying geometric distribution must be ever more modified. This yields a better fit but at the same time an increase of the number of parameters which cannot be all considered as “aggregation parameters”. In any case, one can compute the underlying dependence degree of the given distance distribution, and this would be very
Non-binary data | 155
important both for text theory and grammar theory. A different use of Markov chains can be found in Grotjahn (1979: 212-219). The choice of a model for a given concrete case presented in chapters 6.1 to 6.3 should not be performed on the basis of better fit but on the best interpretation possible.
6.4 Non-binary data The above mentioned distance models concerned always distances between the occurrences of the same unit, all the other ones have been considered as complementary. But we may also ask how the distances are distributed if there are m different units and we compute the distances between all equal units. The assumption that the distances are distributed randomly can be tested using Zörnig’s (1984a,b) model. But this is quite seldom the case because the rules of grammar, the customs of style and Skinner’s principle of reinforcement “force” the writer to apply some non-random patterns. Though the distances are considered discrete, one can approximate the regularity by means of a continuous model. As well known, reality is neither discrete nor continuous, these are merely our concepts helping us to find orientation in the reality and express the patterns we find. In modelling this phenomenon, we simply assume that the relative rate of change of frequencies - or relative frequencies - dy/y is proportional to the relative rate of change of the distance, i.e. dx/x. We can write (6.4.1)
dy dx . ≈ y x
The proportionality is not constant but a function consisting of the factor of language/unit and a factor expressing the influence of the writer on the distances. The distances cannot be estimated directly by the writer, hence we assume merely a proportional influence on the logarithm of the distance. These two forces can be symbolized as a for language/unit and k log x as the influence of the writer on the distance. Inserting these assumptions in formula (6.4.1) we obtain (6.4.2)
dy a + k x = dx . y x
Solving this differential equation by integration we obtain
156 | Aggregative repetition (6.4.3)
y = cxa + b ln x ,
where b = k/2 and c is the integration constant (cf. Tuzzi et al. 2012). The computations can proceed in different ways: (i) We can consider distance as the number of different units occurring between the repetition of the next identical unit. (ii) We can consider distance as the number of steps necessary to find the next occurrence of the given unit; in practice this simply means the displacing the distances in method (i) one step to the right; in that case parameter c can be estimated directly as f1. (iii) For fitting, one can consider only, say, the first x = 1,2,…,20-30 distances because the rest of the distribution contains mostly zero's and a small number of 1's in irregular sequence. (iv) One considers the complete distribution but omits all those distances x which have frequency zero. Since (6.4.3) yields 0 for x = 0, it is simpler to use method (ii). Further, since we know that both grammatical restrictions and “Skinnerian forces” reduce the distance of repetitions, it is sufficient to consider only a long first part of the distribution, i.e. to use method (iii). In practice, to combine (ii) and (iii). In order to illustrate the procedure, we state the vector of parts-of-speech in the End-of-year speech of the Italian president Einaudi in the year 1949: ].
Here the distance between the first and the next PREP is 2 because we need two steps to achieve the second. Evidently, those units which do not occur any more have an “infinite” distance which is simply omitted. Computing all distances we obtain the vector: [2, 3, 4, 4, 4, 5, 7, 8, 3, 12, 2, 5, 22, 6, 3, 11, 2, 19, 2, 4, 4, 11, 19, 2, 3, 4, 4, 4, 18, 6, 9, 6, 1, 18, 14, 3, 16, 3, 4, 4, 4, 4, 8, 4, 9, 12, 19, 15, 1, 17, 5, 18, 6, 1, 8, 13, 3, 53, 2, 8, 1, 8, 1, 9, 12, 8, 8, 3, 1, 9, 2, 6, 7, 23, 2, 6, 6, 16, 2, 10, 6, 2, 2, 17, 13, 2, 5, 1, 2, 19, 8, 1, 2, 23, 1, 4, 5, 5, 9, 6, 3, 12, 2, 3, 10, 12, 4, 4, 7, 2, 17, 10, 5, 10, 4, 33, 3, 5, 16, 1, 8, 3, 23, 2, 2, 10, 3, 3, 3, 3, 25, 2, 4, 9, 5, 2, 2, 3, 23, 2, 7, 2, 2, 3, 6, 6, 3, 7, 11, 3, 7, 1, 5, 9, 1, 2, 9, 2, 3, 3, 3, 3, 5, 2, 1, 2, 2].
Non-binary data | 157
If we order these distances according to their frequency, we obtain (up to x = 26 because there are only three greater distances, viz. 33, 35, 53, and the rest are zeros) the result presented in Table 6.5. Table 6.5: Fitting the Zipf-Alekseev formula to the distribution of distances between subsequent identical parts-of speech in an Italian text fd
(6.4.3)
1
15
17.52
2
31
28.12
3
29
26.03
d
4
18
21.03
5
12
16.28
6
12
12.46
7
7
9.53
8
9
7.33
9
8
5.68
10
6
4.44
11
4
3.50
12
5
2.78
13
2
2.22
14
2
1.79
15
1
1.46
16
3
1.19 0.98
17
3
18
3
0.81
19
5
0.67
20
0
0.56
21
0
0.47
22
1
0.40
23
4
0.34
24
0
0.29
25
1
0.25
26
1
0.21
158 | Aggregative repetition
Fig. 6.4. Fitting the Zipf-Alekseev formula to the distribution of distances (cf. Table 6.5)
Applying formula (6.4.3) by means of the software NLREG we obtain the expected values in the third column of the table. The coefficient of determination is R2 = 0.93, the parameters are a = 1.2332, b = -0.7946, and c = 17.5240. It is not prolific to test the fitting using the chi-square test because many theoretical frequencies are smaller than 5 (and 1). Since we used a continuous model, the determination coefficient is sufficient. For other distributions in Italian cf. Tuzzi et al. (2012).
6.5 Similarity aggregation If we assume that in text there is a self-stimulation causing the appearance of identical units in sense of Skinner, then we must admit that the unit stimulates also the appearance of a similar unit whereby the similarity may be formal or semantic. The corroboration of this hypothesis would be a strong support of the Skinner hypothesis. However, the testing of a similarity aggregation is associated with some problems, two of which are especially important. (i) Similarity is a very delicate concept that can be measured in different ways (cf. e.g. Bock 1974). One must take into account the kind of data, the features, the problem itself, etc.
Similarity aggregation | 159
(ii) The spontaneity of the text generation is a factor favouring similarity aggregation. If the author writes spontaneously, then the units used have the ability to stimulate something. But if (s)he makes long or many pauses, the stimulus expires and does not evoke similar units. On the other hand, even a text generated spontaneously with great similarity aggregation can be subsequently corrected in such a way that even the traces of this tendency disappear. Hence, one can suppose that one will find similarity aggregation only in few texts, especially folklore texts. However, the consequence cannot be turned around: if one discovers no phonetic similarity aggregation in a text, one cannot conclude that the text has not been generated spontaneously; there can be other, quite different complex similarities that have not been discovered. But one can conclude that in the given respect there was no spontaneity. Here we shall examine only the phonic structure of the Malay epos “Shair Cinta Berahi” (Djadjuli 1961) according to Altmann (1968). Eposes of this kind were generated spontaneously by a story teller who improvised about a given theme. If the epos has been written down in the improvised state, we can assume that verses in near distance to one another are phonetically more similar than distant verses. In other words, the phonetic similarity of verses is a function of their distance. In order to find this function we take into account the fact that the sound inventory of a language is restricted, hence even very distant verses will display some degree of similarity which cannot be smaller than 0, hence this function cannot be a decreasing straight line. That means that the relative change of similarity is not constant with increasing distance but inversely proportional to it, hence it decreases. Expressed formally, this yields (S = similarity, D = distance)
(6.5.1)
dS dD = −b , S D
with the solution (6.5.2)
S = aD- b.
In order to test the adequacy of this function to phonetic similarity, we must first define it and measure on data. The phonetic similarity of two verses is a very complex problem that must be simplified. We consider the first two verses of the above mentioned poem in phonetic/phonemic transcription:
160 | Aggregative repetition 1. děŋarkan tuan suatu cěrita 2. dikaraŋ oleh dagaŋ yaŋ lata and set up the sets A1 and A2: A1 = {a1, a2, a3, a4, a5, c, d, ě1, ě2, i, k, n1, n2, ŋ, r1, r2, s, t1, t2, t3, u1, u2, u3} A2 = {a1, a2, a3, a4, a5, a6, a7, d1, d2, e, g, h, i, k, l1, l2, ŋ1, ŋ2, ŋ3, o, r, t, y} Here the identical sounds are distinguished by lower case number. Further we set up the sets with subsequent sound pairs B1 and B2:
uc}
B1 = {an1, an2, ar, at, cě, dě, ěŋ, ěr, it, ka, ns, nt, ri, rk, su, ta, tu1, tu2, ua1, ua2, B2 = {ag, aŋ1, aŋ2, aŋ3, ar, at, da, eh, ga, hd, ik, ka, la, le, ŋl, ŋo, ŋy, ol, ra, ya}
which are sufficient as approximation. Now we state the intersection of A1 and A2 of the same sound: A1 A2 = {a1, a2, a3, a4, a5, d, i, k, ŋ, r, t} and from B1 and B2 the intersection of equal pairs B1 B2 = {ar, at, ka, ta}. We define the similarity measure between the two verses i and j as (6.5.3)
S i j =
Ai ∩ A j Ai • A j
+
Bi ∩ B j Bi • B j
where |x| is the cardinal number of the set x. We obtain from our data |A1| = 23 |B1| = 22 |A2| = 23 |B2| = 22 |A1 A2| = 11 |B1 B2| = 4. If we insert these numbers in (6.5.3) we obtain S = + =
Vistas | 161
In this way we compute for each distance d = j - i all (n) cases and compute the average similarity at distance d as (6.5.4)
Sd =
n
n
S i j .
i j d = j −i
For our case, we selected randomly for each distance 150 pairs of verses from the given epos and computed the mean phonic similarity for distances 1 to 100. Since the rhyme of neighbouring verses represents intentional similarity (in Malay shair it has the form aaaa/bbbb/…) which had strongly influenced the indicator of similarity, the last four sounds of each verse were omitted. Their inclusion had resulted in an excessive similarity in immediate neighbourhood and the results would be disturbed. The results up to d = 10 are presented in Table 6.6. Table 6.6: Average phonic similarity of verses up to distance D = 10 d
S̅d S̅t
1
30.06 35.87
2
34.60 35.06
3
34.44 34.59
4
34.77 34.27
5
33.78 34.02
6
33.84 33.81
7
34.01 33.64
8
33.76 33.50
9
33.21 33.37
10
32.91 33.25
The function has the form S d = d −
and the determination coefficient is R2 = 0.87. Both the t-tests for the parameters and the F-test for the regression are highly significant.
6.6 Vistas Aggregative repetition is associated with many problems which can be set up and solved only after analyzing many texts. We mention only some of them: 1. Which units display aggregative repetitions? Are they merely phonic, or also metric, formal, grammatical, semantic, metaphoric, etc. units? 2. Is there a development of aggregative repetitions in the history of texts? 3. Do some texts display more aggregative similarities then other ones? 4. Can we infer from aggregative similarity to spontaneity? 5. What kind of similarity indicators should be used for individual units?
7 Repetition in blocks An experiment conducted by Brainerd (1972) can serve as an introductory example: He partitioned a text from Cheevers Wapshot Chronicle into blocks (passages) consisting of 50 words each and examined the distribution of the number of English articles over these blocks (cf. Table 7.1). The numbers in the Table should be read as follows: There are 8 passages (consisting of 50 words each) which do not contain any article; there are 14 passages containing 4 articles, etc. Table 7.1: Distribution of the article in a sample from Cheevers Wapshot Chronicle (Brainerd 1972) Number of articles in the passage x 0 1 2 3 4 5 6 7 8 9 10
Number of such passages fx
8 8 12 11 14 4 8 8 6 2 2
In general, the appropriate block size, which is here 50, depends on the probability of the word under study; the number of blocks obtained depends on the block size and on text length, of course. Since long time researchers supposed that there is a law behind these distributions controlling in some way the occurrence of words in passages on grammatical, communicative, semantic, etc. grounds. The first to examine the repetitions of some Russian words in blocks was Frumkina (1962). She supposed that some words have small frequencies, hence abide by the Poisson distribution (= the law of small numbers), which was considered an adequate model. However, it has been shown (Altmann, Burdinski 1982) that in 5 cases out of 12 scrutinized by Frumkina the Poisson distribution is not adequate. Brainerd (1972) nevertheless succeeded to show many good fittings of the Poisson distri-
164 | Repetition in blocks bution to repetitions of articles in passages. The only data set which resulted in an inadequate fit could be modelled by means of a mixed Poisson distribution. Piotrowski (1984: 111-119) brought a survey of results attained by Russian researchers (Maškina 1968; Bektaev, Lukjanenkov 1971; Paškovskij, Srebrjanskaja 1971) who applied the Poisson, the normal and the lognormal distributions and draw consequences according to the successful fitting of the respective distributions to a word. Mosteller and Wallace (1964) used the Poisson and the negative binomial distributions with good results (cf. also Francis 1966). Similar block distributions are obtained not only for words but also for other linguistic units. Köhler (2001), Vulanović and Köhler (2005) and Köhler (2012) have shown that for some syntactic phenomena the negative binomial distribution is adequate. A new model for distributions in blocks has been derived by Altmann and Burdinski (1982). The model will be presented here. It was given the name “Frumkina law”.
7.1 The Frumkina law Let us consider a linguistic unit A as, say, a morpheme, word, phrase, etc. occurring with probability p. The question whether this p holds for the language as a whole or only for the given text, may be postponed. If we do not consider the context, then p is constant in the whole passage. But if we take into account the conditioning of units by the context, we can easily see that p must be zero in some places because the given unit cannot occur in a given place; an uninterrupted sequence such as "finds finds" seems to be very unlikely; on the other hand there are positions in which A can occur with different probabilities, i.e. p is not constant in a text. But let us start from the assumption that p is constant. Further, let us suppose that in the given passage A can occur maximally n times and n is not known a priori. Then the probability that in a passage containing maximally n units A, this A will be found exactly x times can be computed by means of the binomial distribution (7.1)
n P X = x p = f x p = p x − p n − x x
x = n
Of course, the probability depends on p which is now considered a variable with its own probability distribution f(p), with 0 < p < 1. The common distribution of x and p is then given as
The Frumkina law | 165
(7.2)
f ( x p ) = f ( x p ) f ( p )
from which the distribution of x can be derived as the marginal distribution of (7.2), viz. (7.3)
f x = f x p f p dp
However, the distribution of p is associated with a problem. As has been shown by Orlov (in Orlov, Boroda, Nadarejšvili 1982), there is no population consisting of all texts of a language having fixed laws and constant parameters. The chasing of such an entity by corpus linguists is something like Don Quijote’s battles. There are only individual texts abiding necessarily by laws but their parameters and boundary conditions are different from text to text . Hence the distribution of p cannot be generally derived, preliminarily we must content ourselves with a special approach. We assume that p is beta-distributed in the form (7.4)
f p =
Here Ba b =
p M − − p K − M − B M K − M
< p M > −K n
166 | Repetition in blocks This distribution is called negative hypergeometric (NH) or Beta-binomial. It is a special case of the generalised hypergeometric distribution (Type IIA of Kemp, Kemp 1956a; cf. also Johnson, Kotz 1969; Ord 1972; Wimmer, Altmann 1999). It is very elastic and adequate for our purposes. It has a number of special cases; three of its limiting cases are frequently applied in linguistics, namely i. ii. iii.
the binomial distribution attained with K→ ∞, M → ∞ and M/K → p. the Poisson distribution attained with K → ∞, M → ∞, n → ∞ and MN/K →a the negative binomial distribution attained with K → ∞, n → ∞ and K/(K + n) → p.
We are sure that the limiting behaviour and the special cases of probability distributions play an important role in linguistics. Since there are no populations in textology or language, it is reasonable to use the NH-distribution only when its limiting cases with fewer parameters are nor adequate. Sometimes it happens that a unit is distributed according to the basic model in one text, in another text it behaves according to a limiting case, and in a third text according to another limiting case. This may depend on style differences and need not be considered as a falsification of the model. In all cases one should begin to fit the simplest distribution, i.e. that with the smallest number of parameters. The mathematical convergence of some parameters to infinity should be considered linguistically rather as “to a very great number”. Though n cannot be greater than the length of the passage, the passage may be that great that it is practically infinite.
7.2 Testing the Frumkina law When fitting probability distributions to empirical data we proceed as follows: First the basic model (NH) is tested. To this end we compute the initial values of the parameters by means of simple methods; then we improve the fitting results iteratively using an optimisation method until we obtain the smallest chi-square or the minimal amount of squared deviations. Here we employ the algorithm by Nelder and Mead (1964) and that by Hook and Jeeves (1961). Tedious calculations are not necessary anymore because special software is available which can do this work mechanically (cf. Altmann-Fitter); corresponding algorithms are implemented also in R packages. The initial values of the distribution which are usually sufficient for fitting are as follows:
Testing the Frumkina law | 167
For the Poisson distribution a = x , where
xf x (the mean of the sample); N for the binomial distribution
(7.7)
(7.8)
x=
n = x p = − f N n
for the negative binomial distribution M = p =
x s −x
x , where s
x − x f x (sample variance), N and for the negative hypergeometric distribution (cf. Kemp, Kemp 1956b)
(7.9)
s =
n = x (7.10)
− x − s n nx K = n s − x + x Kx M = n
The recurrence formulas for computing the individual probabilities are as follows: For the Poisson distribution with the probability mass function (7.11)
Px =
e− a a x x
x =
it is (7.12)
P = e− a Px =
a Px − x
For the binomial distribution it is (q = 1 - p)
168 | Repetition in blocks
(7.13)
P = q n Px =
n − x + p Px − x q
For the negative binomial distribution with the probability mass function (7.14)
M + x − M x Px = p q x = x
it is (7.15)
P = p M Px =
M + x − qPx − x
For the negative hypergeometric distribution
(7.16)
K − M K − M + K − M + n − K K + K + n − M + x − n − x + Px = Px − x K − M + n − x P =
The choice of one of the limiting cases can be decided on the following criterion: if x > s , then the binomial distribution should be appropriate if x = s , then the Poisson distribution should be appropriate (7.17) if x < s , then the negative binomial distribution should be appropriate. When optimizing (or stepwise improving) the fitting results, one can observe the change of the parameters and apply the following rules: If K and M increase binomial distribution if K and n increase negative binomial distribution if K and M and n increase Poisson distribution These criteria can be combined and we obtain M → ∞ ∩ x > s K →∞
n → ∞ ∩ x < s M → ∞ ∩ n → ∞ ∩ x > s
binomial d negativebinomial d Poisson d
Let us consider now some examples from various languages.
(7.18)
Testing the Frumkina law | 169
Frumkina (1962) examined the Russian preposition “bez” (without) in 110 passages each consisting of 1000 words from Puškin’s works. The results are presented in Table 7.2 and Figure 7.1. Since here x ≈ s , the Poisson distribution can be tried. The computed values are shown in the third column of Table 7.2. As can be seen, the fitting result is very good (P = 0.88). The negative hypergeometric distribution yields better results in all x-points except for x = 4, but it is exactly this class which furnishes the greatest contribution to the chi-square value – because of small numbers, one of the weak points of the chi-square test. It can be shown that increasing the parameter n (beginning with n = 4) the fitting result of the negative hypergeometric distribution first improves, then becomes worse and worse. Table 7.2: Distribution of “bez” (data from Frumkina (1962) Occurrences in the passage
Number of passages with x occurrences of “bez”
x 0 1 2 3 4 ∑
fx 36 42 23 6 3 110 1.0727 0.9947
x
Negative hypergeometric d.
NPx 37.20 40.33 21.86 7.90 2.70 a = 1.0841 X2 = 0.6565 DF = 3 P = 0.8834
NPx 36.38 40.95 22.50 7.87 2.31 K = 96.1572 M = 11.5778 n=9 X2 = 0.6955 DF = 1 P = 0.4043
0
1
2
3
4
0
0
10
10
20
20
30
30
40
40
50
50
s2
Poisson d.
0
1
2
3
4
Fig. 7.1. Diagrams corresponding to the presentation in Table 7.2: Poisson distribution (left) and negative hypergeometric distribution (right)
170 | Repetition in blocks In other cases we may state that the negative hypergeometric distribution with increasing parameters yields improving fitting results, but it is almost impossible to find the end of this progress. In Table 7.3 one finds the stepwise fitting of the negative hypergeometric distribution to the distribution of the German article “das” in the nominative case in passages from S. Lenz “Deutschstunde”. Table 7.3: Distribution of “das” in the nominative case in passages from S. Lenz Number of “das” in the passage x
Number of passages containing x “das” fx
Negative hypergeometric distribution
x = 0.95
n
96.22 51.04 27.33 14.19 6.77 2.88 1.04 0.29 0.05
96.14 52.23 27.40 13.64 6.35 2.72 1.05 0.35 0.12
95.93 53.88 27.15 12.91 5.86 2.55 1.06 0.42 0.24
95.96 54.48 26.96 12.60 5.68 2.49 1.07 0.45 0.31
95.87 54.62 26.93 12.55 5.66 2.50 1.08 0.46 0.33
8
10
20
50
100
M = 1.40
K
7.02
9.91
24.36
67.63
139.2
p = 0.59
M
0.87
0.97
1.18
1.21
1.35
-
X
5.99
4.59
2.09
2.59
2.47
2.35
DF
3
3
3
3
3
4
P
0.11
0.20
0.38
0.46
0.48
0.67
0 1 2 3 4 5 6 7 ≥8
NPx
95 58 28 11 3 2 1 1 1
s = 1.68
2
NPx
95.81 54.86 26.91 12.46 5.60 2.47 1.08 0.46 0.35
0
20
40
60
80
100
2
Negative binomial d.
0
1
2
3
4
5
6
7
8
Fig. 7.2. Fitting the negative binomial distribution to the data from Lenz (cf. Table 7.3)
Testing the Frumkina law | 171
We let the parameter n increase; this leads also to the increase of K but not to an increase of M. This is a sign of the fact that the negative hypergeometric distribution converges to the negative binomial d.; this has been shown already by the criterion (7.17). In the last column of Table 7.3 one sees the best fitting by the negative binomial distribution. There are cases where the criterion (7.17) prefers the negative binomial d. but fitting with increasing n and K does not bring an improvement. In such cases one should apply the negative hypergeometric d. (cf. Table 7.4 and Figure 7.3). With n = 10 we obtain X27 = 6.04, with n = 11, X27 = 6.10 and by increasing n and K the fitting result becomes worse. The negative binomial yields a still worse result, though the deviation is not significant here either. This is caused by the fact that the empirical distribution has three maxima. Table 7.4: Fitting two distributions to the frequencies of the article in Cheevers Wapshot Chronicle (data from Brainerd 1972) x
fx
0 1 2 3 4 5 6 7 8 9 10
8 8 12 11 14 4 8 8 6 2 2
Negative hypergeometric 7.26 9.67 10.63 10.76 10.30 9.42 8.20 6.74 5.09 4.34 1.59 n = 10 K = 3.7090 M = 1.4950 X2 = 6.04 DF = 7 P = 0.54
Negative binomial
5.38 10.21 12.46 12.39 10.92 8.89 6.83 5.03 3.58 2.48 4.83 M = 3.4998 p = 0.4575 X2 = 10.82 DF = 8 P = 0.21
A further example will illustrate that the fitting procedure can be rather complex. Piotrowski, Bektaev and Piotrovskaja (1985: 217) counted the nouns in 400 passages consisting of 25 words in the Kazakh novel “Puť Abaja” by M. Auezov and obtained the results presented in Table 7.5 and Figures 7.4-7.5. The negative hypergeometric distribution in the third column yields a very good result (P = 0.35) with very large parameter values. But if one increases n to 16, one obtains
172 | Repetition in blocks
12 10 8 6 4 2 0
0
2
4
6
8
10
12
X2 = 10.63, i.e. a slightly worse result than with n = 15; a further increase yields even worse results.
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Fig. 7.3. Fitting the negative hypergeometric distribution (left) and the negative binomial distribution to the data from Brainerd 1972 (cf. Table 7.4)
Criterion (7.17) shows that the binomial distribution could be fitted. The authors (P/B/P) obtained the result in the fourth column which is slightly worse than that with the negative hypergeometric. If one increases the parameter n of the binomial d., one can state that the result slightly improves. But if one applies criterion (7.18), the Poisson distribution should be the best. However, the best result is obtained with a = 6.6757, X212 = 17.28, P = 0.14. An appropriate pooling of some frequency classes (which is an acceptable technique), e.g. the first two, one can improve the results considerably – whereas pooling is not necessary with the negative hypergeometric d. Several other tests of the Frumkina law for various languages can be found in Altmann, Burdinski (1982) and Piotrowski, Bektaev, Piotrovskaja (1985). Data concerning the distribution of “kai” in sentences of the work of Isocrates can be found in Morton, Levinson (1966). As can be seen, text block (passage) size is not a constant quantity since sentence length varies within Isocrates' work. In the same way, Altmann and Burdinski (1982) partitioned a (printed) text page by page. This is a practical approximation if one works without a computer. All the results presented in the publication by Morton and Levinson are very good and show that in those cases the easiest way to fit a distribution to the data is the application of the negative binomial d.
Testing the Frumkina law | 173 Table 7.5: Distribution of nouns in passages of the novel “Puť Abaja” by M. Auezov (according to Piotrowski, Bektaev, Piotrovskaja 1985) x
fx
Negative hypergeometric Binomial
3 5 7 24 40 52 64 66 47 48 24 14 3 2 1
0.56 3.37 10.58 23.15 39.20 54.28 63.32 63.24 54.44 40.37 25.57 13.59 5.87 1.95 0.50 n = 15 K = 25.2950 M = 11.18 X2 = 9.80 DF = 9 P = 0.35
0.63 2.96 9.71 22.88 41.21 58.87 68.48 66.03 53.44 36.64 21.44 10.72 4.60 1.68 0.69 n = 25 p = 0.3 X2 = 13.69 DF = 10 P = 0.19
0
10
20
30
40
50
60
70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Fig. 7.4. Fitting the negative hypergeometric distribution to the data from the novel “Puť Abaja” by M. Auezov (cf. Table 7.5)
0
10
20
30
40
50
60
70
174 | Repetition in blocks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Fig. 7.5. Fitting the binomial distribution to the data from the novel “Puť Abaja” by M. Auezov (cf. Table 7.5)
7.3 Vistas What has been said in this chapter allows for the following conclusions. The model presented here shows that text laws can currently be set up only on the basis of assumptions which may later on prove to be lucky or unlucky shots. The choice of the beta distribution for the probability function of p is merely a first approach. In the same way one could choose a discrete probability mass function for n but so far there was no need for it. Moreover, there is obviously not only one way to model a phenomenon; the alternatively applicable models are not necessarily contradictory but can often be considered special or limiting cases of a background model. This fact can be very useful in developing a text theory because an entity can behave very differently from text to text, which can be interpreted as an indicator that the given model is not characteristic of the given entity but of the given individual text or text sort. Probability distributions with several parameters have both, advantages as well as disadvantages. The disadvantage of a complex estimation of the parameters and tedious computing is eliminated by the use of optimization techniques and computers but the disadvantage of a difficult interpretation of the parameters remains. The parameters are surely linked with the properties of textual units, and it will be a long tedious work with many texts in many languages until we shall be able to interpret them. The advantage is
Vistas | 175
that probability distributions with more parameters are more elastic and can easier be fitted to “pathological” data. At the same time they force us to search for interrelations which are the sources of a future text theory. Piotrowski (1984) mentions the following possible applications of a theory of repetitions in blocks: i. Mechanical ascription of a word to a word class. ii. Identification of terminologically and semantically dominant units. iii. Measurement and stating of the stylistic individuality of the given text. iv. Diagnostics of centres of psychic diseases (cf. Paškovskij, Srebrjanskaja 1971). v. Construction of learning automata. Solutions to the following problems can be searched for by means of the above complex model: i. How do the parameters change with increasing block length? ii. Are there words or parts of speech which prefer one of the four distributions in all texts? iii. Which properties of texts are the parameters of the distributions connected with? Such properties are e.g. word frequency, polysemy, textsort, etc. iv. Can the negative hypergeometric distribution be derived from the approach in Chapter 2.4 or, how is it related to the unified theory (cf. Wimmer, Altmann 2005)? v. Can the probability function of p in (7.3) and (7.4) be chosen differently. If so, how? Every new law sheds new light on the phenomena in texts and provokes new research questions. It is, therefore, also an important task to collect as many data as possible which describe text phenomena of all kinds and can be used to test the models and at the same time to determine the parameters of these models.
8 Parallel repetitions The best known kind of parallel repetition is the end rhyme. It can also be subsumed under the positional repetition but this type is rather the repetition of identical units in given positions. The rhyme is not very interesting from the point of view of a statistical analysis because it is consciously and deterministically formed. Parallelism is rather a similar structuring of two hyper-units, not necessarily requiring the identity of constituents. As opposed to this type of repetition, parallel repetition describes two or more structures which display identity with respect to a property while the constituting units concerned need not be identical. What can be observed is rather the mirror image of an idea, an image of a grammatical structure, of a phonetic structure, frequently a combination of several of these variants placed in succession. Such parallel structures can serve to enhance an expression, to introduce a methaphor or to transfer an entity into a new domain. Parallelism can be found probably everywhere in folk poetry (for a survey see e.g. Newman, Popper 1918), in magic formulas, in litanies, in proverbs, and also as stylistic means in modern texts. Individual parallelisms can easily be identified, whereas a tendency to parallelisms is not as easily detected, i.e. it is not necessarily found in every individual case. A tendency to parallelism can be found only statistically. We present here three simple methods for this purpose. They will be exemplified on Malay pantuns, folk poems consisting of four verses, which, according to Wilkinson (1907) are constructed in such a way that the second half strophe represents a parallel assonance structure of the first half strophe. Assonance is defined here as the identity of the two vowels of Malay basic words, of which 84% are disyllabic. Considering the pantun in Chapter 3.3, it can be coded in the following way: 1 Anak 5
Turun 9 Hodoh 13 cantik
2 béruk 6
mandi 10 buruk 14 manis
3 (di)kayu 7
(di)dalam 11 (di)mata 15 (di)mata
4 rĕndang 8
paya 12 orang 16 sahaya
178 | Parallel repetitions The numbers give the individual positions in which assonance will be sought; the parentheses contain prefixes. The word in the 16th position has three syllables, nevertheless it displays an assonance with the word in the 8th position. As can be seen, the positional pair (i, i+8) does not always display an assonance, e.g. (1,9): anak/hodoh, or (2,10): beruk/buruk, but (1,7,8,11) or (13,14) or (5,10) are assonant, a fact that contradicts Wilkinson’s hypothesis. His hypothesis is supported by the pairs (6,14), (7,15), (8,16). In order to be able to decide whether there are any significant assonances, or only parallel assonances, or also non-parallel assonances, statistical methods are indispensible.
8.1 Cochran’s Q-test It is often useful in empirical examinations to first apply a simple, quick test method before tedious and time-consuming computations are performed. In this way, it is possible to determine whether the effort of a full statistical analysis would be worthwhile. Non-parametric tests cause little effort and are therefore frequently applied for this purpose. Here we shall present Cochran's Q-test (1950; cf. also Siegel 1966: 161-166). We take a random sample of 20 pantuns from a collection of Malay pantuns (Pantoen Melajoe. Weltevreden 1929) and examine the assonance in the positions in front of the caesura, i.e. the word in positions 2, 6, 10 and 14. If there is an assonance (= identity of stem vowels) in a pair, we ascribe it the value 1, otherwise 0. The results of the investigation are presented in Table 8.1. The sums of columns are symbolized as Gj (j = 1,2,…k), the sums of rows Li (i = 1,2,…n) and we compute the criterion
(8.1)
k k − k G j − G j j = Q= n
n
i =
i =
k Li − Li
,
where k is the number of columns (here k = 6) and n is the number of rows (here n = 20). Inserting the numbers from Table 8.1 in formula (8.1) we obtain Q=
− + + + + + − = . −
Analysis of variance | 179 Table 8.1: Occurrence of assonance in pre-caesural positions Pantun 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sums
Position pairs
2-6 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 2 G1
2-10 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 15 G2
2-14 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 4 G3
6-10 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 4 G4
6-14 1 1 1 1 1 1 0 0 1 0 0 1 0 1 0 1 1 1 1 1 14 G5
10-14 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 4 G6
Li
Li2
2 2 1 1 2 4 3 1 2 2 1 4 3 2 1 2 2 2 4 2 43 ΣLi
4 4 1 1 4 16 9 1 4 4 1 16 9 4 1 4 4 4 16 4 111 ΣLi2
Q is distributed like a chi-square with k-1 degrees of freedom. The critical value of the chi-square statistic with 6-1 = 5 degrees of freedom at the level α = 0.05 is 11.1. Since our computed value is greater than the theoretical value, we conclude that assonance is not uniformly distributed in the examined positions, there is a significant tendency. One sees in the values of columns (Gj) where the tendency may be expressed but this method gives merely an impetus for further examinations. A generalization should be considered merely as a hypothesis.
8.2 Analysis of variance The following approach yields a global answer to the problem of placing assonances and can be relatively easily computed. Let us write the 32 vowels of the above pantun (omitting the prefixes) which occur in 16 stems (in the stem sahaya we take only two vowels) on two stripes of paper in the given order. If we
180 | Parallel repetitions put the stripes above one another, then both vowel sequences are identical. If we shift the lower stripe one place to the right, we obtain a a é u a u ĕ a u u a i a a a a o o u u a a o a a i a i a a a a | | | | | | | | | | | | a a é u a u ĕ a u u a i a a a a o o u u a a o a a i a i a a a a
Here we have 12 concordances (each one marked with a line) and the measure of correspondence is 12/31 = 0.38709, where 31 is the number of comparisons. If we shift the lower stripe a step further, we obtain 10/30 = 0.33333. In this way we continue shifting the lower stripe for the first 17 steps (shiftings) in 50 pantuns and obtain N = 17(50) = 850 correspondences for which we perform the analysis of variance. Let xij = measure of correspondence in step i (i = 1,2,…,17) in pantun j (j = 1,2,…50)
xi = mean correspondence in step i, i.e. xi =
n xij n j =
x = mean correspondence in the complete sample k
SSB = sum of squared deviations between the steps = n xi − x with k-1 i =
= 17 - 1 = 16 degrees of freedom SSW = sum of squared deviation within the steps =
k
n
x i = j =
ij
− xi with
N-k = 850 - 17 = 833 degrees of freedom SST = the total sum of squared deviations = SSB + SSW = k
n
x i = j =
ij
− x
with N-1 = 850 - 1 = 849 degrees of freedom. The F-test shows that the variability between the 17 steps is to be considered significant. Now we can test in which step i the given correspondence ( xi ) significantly differs from the mean ( x ). As can be seen above, the greatest correspondence is attained in step 16 (= 0.5298). We test the difference using the ttest:
Analysis of variance | 181
t=
x − x , s n
where s2 = SSW/(N-k) = 0.013377 and obtain
t=
− =
which yields the probability P = 2.97E-28 with 833 degrees of freedom. The test indicates that there is a strong association exactly in step 16 corresponding to the parallelism between the semistrophes. In order to state the strength of assonance in individual positions, one must perform a test for each position separately. The means of correspondences in individual shifting steps in 50 pantuns are as follows: Step i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
xi
0.3542 0.3620 0.3546 0.3464 0.3430 0.3661 0.3408 0.3760 0.3382 0.3690 0.3237 0.3740 0.3632 0.3689 0.3447 0.5298 0.3547
The results of computation are presented in Table 8.2. Table 8.2: Analysis of variance of assonances in 50 pantuns (according to Altmann 1963) Variability
Between steps (SSB) Within steps (SSW) Total
SS
1.6008 11.1434 12.7442
DF
16 833 849
Variance
0.1000 0.0134
F-test
P
7.479
1.35(10-16)
182 | Parallel repetitions
8.3 The chi-square test Examinations applying this method were performed by Sebeok and Zeps (1969) in Cheremis folk poetry and by Altmann (1963) in Malay poetry. In order to analyse the assonance structure of pantuns in detail we proceed as follows. We select 100 pantuns which are adequate for the analysis (e.g. do not contain monosyllables). In Malay there are six vowels (a,ĕ,i,o,u,e), i.e. 36 vowel patterns (aa, aĕ, ai,…). We use the following symbols: N = number of different pantuns (here 100) M = set of vowel patterns, M = {aa, aĕ, ai,…}; v = any vowel pattern v ε M, |M| = 36; i,j = two positions in pantun, i ≠ j; i,j = 1,2,…,16; fi(v), fj(v) = observed number of pantuns with the pattern v in positions i and j; fij(v) = expected number of pantuns with pattern v simultaneously in position i and j. This number will be computed assuming independence, just as in preceding chapters, as (8.3)
fij v =
f i v f j v N
The summation of (8.3) for all patterns v in position i and j yields (8.4)
Eij A =
f
v∈M
ij
v
i.e. the expected number of pantuns with assonance in positions i and j. The quantity (8.3) must be computed for each pattern separately. The observed number of assonances (any v) obtained by simple counting in pantuns yields Oij(A) = observed number of pantuns with an assonance in positions i and j. We compute the chi-square criterion on the basis of these numbers as follows:
X = (8.5)
=
Oij A − Eij A Eij A
+
N Oij A − Eij A Eij A N − Eij A
N − Oij A − N − Eij A N − Eij A
=
The chi-square test | 183
or simply as (8.6)
X =
N O − E EN − E
X2 is distributed as a chi-square variable with 1 DF. The critical value at α = 0.05 is 3.84. If the X2 value computed according to (8.6) is greater than 3.84, under the condition that O > E, one can speak of a significant assonance. The results of the computation are presented in Table 8.3. Even if one sporadically finds significant values in different columns, a systematic assonance structure can be found only for pairs (i,i+8), i.e. just for the phonic parallelisms.
First member i
Table 8.3: Assonances in Malay pantun (bold numbers indicate significant assonance)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Second member of the positional pair j
i+1 i+2 i+3 i+4 i+5 i+6 i+7 i+8 i+9 i+10 2.55 4.15 0.02 0.68 0.04 0.13 0.59 12.48 1.77 0.01 6.14 0.97 0.66 1.15 0.00 0.12 8.46 53.75 2.20 1.60 0.43 5.98 0.02 2.61 0.09 2.21 0.00 2.36 0.57 0.01 3.31 0.01 0.43 0.20 1.50 0.05 1.99 119.75 0.12 1.79 1.59 0.08 0.01 1.03 1.20 1.53 2.17 1.06 0.58 2.11 1.06 4.61 0.01 2.38 0.73 0.26 2.39 34.45 0.01 4.56 0.61 1.38 0.38 0.02 0.00 0.28 0.04 1.87 0.26 0.89 0.11 0.63 0.50 0.24 0.47 0.47 70.86 0.64 0.01 3.07 1.68 0.36 0.06 1.39 0.24 0.04 0.99 0.49 1.33 2.94 6.11 0.70 0.11 0.18 2.48 0.48 0.84 0.18 0.81 0.01 0.01 0.05 1.40 0.00 1.13 The bold numbers display a significance at α = 0.0005
i+11 0.66 0.12 0.04 0.17 4.96
i+12 2.62 0.12 0.73 0.70
i+13 i+14 i+15 0.92 4.15 7.09 4.11 0.25 0.57
Since the vocalic pattern “a - a” is in Malay that frequent that it scarcely evokes the feeling of a phonic association, we perform the computations without the pattern “a - a”. The results are presented in Table 8.4. Here the column i+8 is still more visible than in Table 8.3.
184 | Parallel repetitions
First member i
Table 8.4: Assonance without the pattern “a - a”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Second member of the positional pair j
i+1 i+2 i+3 i+4 i+5 i+6 i+7 i+8 i+9 i+10 4.19 1.72 0.02 0.49 0.27 0.00 0.07 24.88 1.80 0.48 0.43 1.05 1.52 0.00 0.54 1.30 4.83 71.76 0.46 0.58 3.78 6.23 0.01 2.13 3.11 1.99 0.76 10.41 1.10 0.04 2.52 0.88 0.02 2.00 1.59 0.09 1.19 75.92 0.61 0.74 0.01 0.10 0.09 0.02 1.82 1.58 2.02 4.68 3.52 1.46 1.01 3.44 0.15 2.47 3.54 0.05 0.22 37.36 0.35 5.12 0.61 2.47 2.04 0.00 0.22 2.26 0.00 2.57 0.86 0.41 0.37 0.30 0.55 0.16 0.07 0.12 79.38 0.76 1.64 6.53 1.32 0.01 0.37 3.56 1.65 0.67 2.42 0.06 0.65 3.58 3.53 0.13 0.08 0.21 2.68 0.00 0.16 0.59 0.13 1.75 0.09 0.01 0.62 0.25 0.57 The bold numbers display a significance at α = 0.0005
i+11 0.82 0.07 1.59 0.03 6.80
i+12 4.12 0.71 1.41 0.31
i+13 i+14 i+15 1.07 5.00 4.45 3.54 0.89 2.61
Perhaps still stronger than assonance is the tendency to inner rhyme in parallel positions in Malay pantun. Considering the identity of the two last sounds in the word as a rhyme, the results presented in Table 8.5 are obtained. This tendency is the strongest in pairs (4,12) and (8,16) where they can be expected (deterministically), but it is obvious also in all the other position pairs (i,i+8) that inner rhyme tendency is stronger than the assonance. Such a simple computation can unexpectedly reveal quite new aspects of a text-sort. The methods used here can, of course, be applied also to the study of other “parallel” phenomena.
The chi-square test | 185
First member i
Table 8.5: Rhyme in Malay pantun
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Second member of the positional pair j
i+1 i+2 i+3 i+4 i+5 i+6 i+7 i+8 i+9 i+10 7.44 3.76 1.31 1.75 1.87 3.30 0.83 59.03 2.75 1.95 0.37 4.57 0.08 0.48 3.19 0.69 4.74 415.24 5.52 0.05 0.93 1.15 2.51 0.00 0.85 1.06 0.03 27.18 0.86 0.76 1.06 0.52 0.50 3.54 0.49 0.18 0.83 1462.40 0.00 0.67 1.37 1.59 0.23 0.11 0.12 0.13 4.57 17.16 0.97 0.94 1.23 0.03 0.85 0.29 0.49 0.04 2.22 182.35 2.10 0.01 1.34 0.34 7.55 2.15 0.09 0.77 0.46 17.19 1.39 1.13 0.18 0.28 2.68 0.04 0.00 7.02 1713.13 0.87 0.54 0.28 0.02 0.01 2.02 1.19 0.34 1.50 0.01 0.52 0.16 0.11 0.65 0.26 1.72 0.38 0.26 0.23 0.96 0.62 3.65 0.07 0.31 0.04 2.68 0.00 6.61 The bold numbers display a significance at α = 0.0005
i+11 0.35 0.91 0.17 1.83 0.05
i+12 0.67 2.42 1.06 3.46
i+13 i+14 i+15 0.01 1.83 2.50 0.13 0.63 0.83
9 Cyclic repetition Cyclic repetitions are suggestive of a wave-like motion in text. The properties of cyclically repeated elements can often be represented as quantities, which opens up a wide research domain concerning not only methods but also new problems. Analyses of cyclic repetitions are best know from studies on poetic texts. Metrical elements such as stressed and unstressed syllables were frequently represented in form of up-and-down oscillating curves. However, this kind of study never yielded any deeper insight as these curves cannot be interpreted as theoretically justified models of a metrical mechanism because they remained unsubstantiated. In this chapter, only a restricted number of the many principally applicable methods, e.g. Fourier analysis, theory of time series, Markov chains, wavelets, etc., will be discussed. First, let us present some examples of quantitative representations of corresponding phenomena. (1) Let x be the number of dactyls in the first four positions of the hexameters (x = 0,1,2,3,4) in the poem by Bridges (cf. Chapter 6.1), then the first 30 verses can be written as 111212311012102211233321221211
and presented graphically as in Fig. 9.1. The question is whether this sequence displays a periodic oscillation, and if so, what kind of oscillation.
Fig. 9.1. Number of dactyls in the first 30 verses in the poem by Bridges
188 | Cyclic repetition (2) In the Slovak poem “Smrť Jánošíkova” by J. Botto, the following numbers of stresses in the individual positions can be found (cf. Kochol 1968): 42, 27, 31, 52, 43, 5, 90, 16, 37, 41, 39, 5
The sequence is presented graphically in Figure 9.2.
Figure 9.2. The sequence of stresses in the poem by Botto (Kochol 1968)
(3) In the first chapter of Heisenberg’s book “Der Teil und das Ganze” the following sequence represents the sentence lengths measured in terms of clause numbers: 1 1 6 7 2 9 2 3 1 3 5 2 5 1 1 3 2 1 5 1 1 4 1 3 3
the sequence is presented graphically in Figure 9.3.
Figure 9.3. Sequence of sentence lengths in Heisenberg's text
Such observations of apparent patterns suggest questions whether there are really any regularities, whether there is a simple or a superposed oscillation, whether other units and properties display similar patterns, and whether there
Fourier analysis | 189
are differences between text sorts and between texts of the same sort in different languages, what the oscillations depend on, etc. The data in example (2) differ from those in (1) and (3) because here the course is given by the position in the verse and all verses are taken into account. The positions can be “numbered” and are fixed. In examples (1) and (3) it is different: here one can begin to measure from any position and end at any position; if the sequence is long enough, the results must be very similar. In example (2) the course is fixed by the language, because in Slovak the main accent is on the first syllable of the word, the secondary accents are positioned at the odd syllables; in examples (1) and (3) the authors possibly stepwise create a rhythm which may be very complex and can attain new forms by corrections of the text; here one can assume that the given value of the variable depends on the preceding values. All these sequences of values can be considered as time series. In texts we shall probably meet only those that remain in equilibrium around an average and can be considered stationary. Trends belong to a different phenomenon. Although numerous methods are available for the study of time series (cf. Box, Jenkins 1970; Pandit, Wu 1983; Schlittgen, Streitberg 1984; Grotjahn 1981) we have to refrain from voluminous analyses since we lack a sufficient amount of data and, even more importantly, linguistic insight into the linguistic nature of text-generating processes.
9.1 Fourier analysis The simplest way to just describe a cyclic regularity is Fourier analysis, which allows us to compute the amplitudes of frequencies concealed by noise. As a result, we obtain sine and cosine oscillations by means of which we approximate the given sequence, i.e. we can separate the deterministic regularity from the stochastic one. Hence we assume that the observed values yx (x = 1,2,…,N) can be captured by the expression q
y x = A + Ai π xf i + Bi π xf i + ex . i =
Here A0, Ai, Bi (i = 1,2,…,q) are coefficients, fi is the i-th harmonic of the basic frequency 1/N, i.e.
fi =
i N
190 | Cyclic repetition and q is computed as follows: if N is even, then q = N/2; if N is odd, then q = (N 1)/2. The individual coefficients can be estimated by means of the method of least squares as (9.1.2)
A′ = y
(9.1.3)
Ai′ =
N yx π xfi N x =
(9.1.4)
Bi′ =
N yx π xfi N x =
where fi = i/N and i = 1,2,…q. The intensity I(fi) of the frequency fi is given as (9.1.5)
I fi =
N Ai + Bi
If N is even, the qth coefficients are given as (9.1.6)
Aq′ =
N − x yx N x =
Bq′ = and the intensity is (9.1.7)
I f q = NAq
The intensities yield the periodogram of the sequence and their sum is equal to the sum of quadratic deviations of the observed values from their mean, i.e. (9.18)
q
N
I f = y i =
i
x =
x
− y
which can serve as a check of the computations. At the same time, I(fi) represents the share of coefficients Ai and Bi on the variance. We illustrate the computation with the stress pattern of syllables in Botto’s “Smrť Jánošíkova” according to Kochol (1968), cf. Table 9.1. The variable X shows the positions in the verse. The raw data representing proportions are given in the second row (yx). We compute Ai and Bi according to formulas (9.1.3) and (9.1.4). To this end, we must first compute cos(2πx(1/N)) and sin(2πx(1/N)), cf. Table 9.1.
Fourier analysis | 191
A =
+ + + + = −
B =
+ + + − + =
Table 9.1: Computing the coefficients of the Fourier series for the data of Kochol x
1
yx cos(2πx/N) sin(2πx/N)
2
3
82 27 31 0.87 0.5 0.0 0.5 0.87 1.0
4
5
6
7
8
9
10
52 43 5 90 16 37 41 -0.5 -0.87 -1.0 -0.87 -0.5 0.0 0.5 0.87 0.5 0.0 -0.5 -0.87 -1.0 -0.87
11
39 0.87 -0.5
12
5 1.0 0.0
All coefficients are presented in Table 9.2. The check yields s2 = (1/N)Σ(yx - y )2 = 641 which is equal to 7692/12 = 641. Table 9.2: Fourier analysis of the data in Table 9.1 i
1 2 3 4 5 6 Σ
fi
0.08 0.17 0.25 0.33 0.42 0.50
Period
12 6 4 3 2.4 2
Ai
-1.74 0.17 0.00 -19.50 1.74 -14.67
Bi
1.85 5.77 0.33 20.21 -4.51 0.00
I(fi)
38.36 200.17 0.67 4731.50 139.97 2581.33 7692
% s2
0.50 2.60 0.01 61.51 1.82 33.56
In the next step one chooses some coefficients in such a way that the total variance is maximally reduced. The significance of the coefficients or of I(f) can also be tested (cf. Anderson 1971: 102 ff.) but it has been shown that it may lead to controversial results (cf. Tintner 1965: 223 ff.). Hence, one can choose the coefficients by a trial-and-error method and combine them, or one chooses them on the basis of linguistic assumptions. As can be seen in Figure 9.2, the verse is divided in two parts consisting of 6 syllables each. The half verses can then be divided uniformly in three different ways, namely in periods of 2, 2.4 or 3 positions. The periods 4 and 12 can be omitted, as can be seen from the small value of I(fi) in Table 9.2. As can easily be stated, one obtains a relatively good fitting result with the coefficients Ai and Bi, i = 2,4,5,6. Table 9.3 shows the function
192 | Cyclic repetition
y x = 39 + 0.17cos(2πx(2/12)) + 5.77sin(2πx(2/12)) - 19.50cos(2πx(4/12)) + + 20.21sin(2πx(4/12)) + 1.73cos(2πx(5/12)) - 4.51sin(2πx(5/12)) - 14.67cos(2πx(6/12)) as fitted to the data in Table 9.1. Table 9.3: Fitting the Fourier series to the data in Table 9.1 x
1
yx ŷx x yx ŷx
2
82 82.25 7 90 89.75
3
27 26.27 8 16 16.73
4
31 29.49 9 37 38.51
52 49.54 10 41 43.46
5
43 40.25 11 39 41.45
6
5 3.27 12 5 6.73
In Fig. 9.4 the function is presented graphically. This fitting yields the residual sum of the quadratic deviation 39.03, which looks “optically” well but costs many coefficients. A polynomial of seventh order had served equally well. Here we dispense with further analyses of cyclic repetitions. Many time series seem to contain only the stochastic element. Models have not been developed, linguistic assumptions are not known. One must wait for individual analyses with new impulses.
Figure 9.4. The course of stressing in Botto’s “Smrť Jánošíkova”
10 Summary "A start in mathematization or mathematical modelling, however unrealistic, is better than either a prolix but unenlightening description or a grandiose verbal sketch." (Mario Bunge 1967: 469)
We hope that the present volume shows how much deeper the analysis of repetitive structures in texts can go when quantitative methods and models are applied. We also hope to have illustrated that this is the way to the construction of a theory and that it is the only way to objectively test hypotheses. We tried to present the required mathematical tools in such a way that quantitatively untrained linguists can apply the formulas to her/his own data and obtain the required results. Though for the most kinds of repetitions there are still both, few data and few hypotheses, one should begin to introduce mathematical instruments very early because this brings a number of advantages (cf. Bunge 1967: 474-476) with which a mature science cannot dispense. It is just the methodological instrumentation that signals the maturing of a science. The elementary mathematical equipment presented here is not sufficient for constructing an encompassing theory. It merely opens a door and shows a direction. But taking this way one can hope that the research will not stay at the level of concept formation for dozens of years – as is the case in qualitative text analysis or standard linguistics – but gradually penetrates in deeper layers of text formation and links this discipline with more general ones, e.g. synergetics and systems theory. We hope that the set of methods and problems presented here provides a first informative survey about the extent of this domain and stimulates both to further development of models and to further counts and measurement on different texts and languages. The inductive part of this research can be advanced by: (1) examining as many texts as possible with respect to the validity of the methods and results shown in this volume; (2) extending the application of the methods discussed here to other units and properties and testing the results in order to obtain new knowledge of mechanisms behind the observed text phenomena;
194 | Summary (3) enriching a model – if it fails with certain kinds of data – by another parameter and trying to determine its function or connection with properties of that data – if the modified model works. Deductive research in this field requires: (1) to theoretically assume other kinds of repetition and test whether they exist in reality; (2) to find quantities and factors which influence the mechanisms of text creation. This would enrich the overall model of the structure and dynamics of text and improve its interpretability; (3) to find interrelations between the kinds of repetition – a problem which we did not even touch in this volume– and formulate general laws under which several mechanisms of repetition can be subsumed; (4) to derive new and hopefully more adequate models starting from other assumptions and postulated boundary conditions; (5) to set up an axiomatic system as the basis of a future text theory which can explain all the observed phenomena and their interrelations. It goes without saying that this step can start only when much more knowledge of textual mechanisms has been acquired. Nevertheless, we should not hesitate to enter this demanding endeavour. The study of repetitions in texts is a discipline linked from its very beginning with mathematics, it cannot be pursued without mathematics. It represents a philological discipline in which useless, sophistic and fully irrelevant discussions about the “priority” of the search for “qualitative structures” do not even arise. This circumstance will surely contribute to its quicker development.
References Aitken, A.J., Bailey, R.W., Hamilton-Smith, N. (eds.) 1973. The computer and literary studies. Edinburgh: Edinburgh University Press. Altmann, Gabriel. 1963. “Phonic structure of Malay pantun.” In: Archiv orientální 31, 274-286. Altmann, Gabriel. 1968. “Some phonic features of Malay shaer.” Asian and African Studies 4, 9-16. Altmann, Gabriel. 1973. „Mathematische Linguistik.“ In: Koch, Walter (ed.), Perspektiven der Linguistik I. Stuttgart: Kohlhammer 208-232. Altmann, Gabriel. 1978. „Zur Anwendung der Quotiente in der Textanalyse.“ In: Glottometrika 1, 91-106. Altmann, Gabriel. 1988. Wiederholungen in Texten. Bochum, Brockmeyer. Altmann, Gabriel; Burdinski, Violetta. 1982. “Towards a law of word repetitions in text-blocks.” In: Glottometrika 4, 146-167. Altmann, Gabriel; Buttlar, Haro von; Rott, Walther; Strauss, Udo. 1983. “A law of change in language.” In: Brainerd, Barron (ed.), Historical Linguistics. Bochum, Brockmeyer 104-115. Altmann, Gabriel; Lehfeldt, Werner. 1980. Einführung in die quantitative Phonologie. Bochum, Brockmeyer. Altmann, Gabriel; Schwibbe, Michael; Kaumanns, Werner; Köhler, Reinhard; Wilde, Joachim. 1989. Das Menzerathsche Gesetz in informationsverarbeitenden Systemen. Hildsheim: Olms. Altmann, Gabriel; Štukovský, Robert. 1965. “The climax in Malay pantun.” In: Asian and African Studies 1, 13-20. Anderson, Theodore W. 1971. The Statistical Analysis of Time Series. New York: Wiley. Antosch, Friederike 1969. “The diagnosis of literary style with the verb-adjective ratio.” In: Doležel, Bailey 1969, 57-65. Arapov, Michail V.; Efimova, E.W.; Šrejder, Ju.A. 1975a. “O smysle rangovych raspredelenij.” In: Naučno-techničeskaja informacija Ser. 2, No. 1, 9-20. Arapov, Michail V.; Efimova, E.W.; Šrejder, Ju.A. 1975b. „Rangovye raspredelenija v tekste i jazyke.“ In: Naučno-techničeskaja informacija Ser. 2, No. 2, 3-7. Austerlitz, Robert. 1961. „Parallelism.“ In: Davie et al. 439-443. Basharin, Georgij P. 1959. “On a statistical estimate for the entropy of a sequence of independent random variables.” In: Theory of Probability and its Applications 4, 333-336. Bektaev, Kaldybay B.; Luk’janenkov, Kuz’ma F. 1971. “O zakonach raspredelenija edinic pis’mennoj reči.” In: Piotrowski, Rajmund .G. (ed.), Statistika reči i avtomatičeskij analiz teksta. Leningrad: Nauka, 47-112. Belonogov, Gerolʹd G. 1962. “O nekotorych statističeskich zakonomernostjach v russkoj pis’mennoj reči.” In: Voprosy jazykoznanija 11, 100-101. Berrien, F. Kenneth. 1958. General and Social Systems. New Brunswick, N.J.: Rutgers. Berry-Roghe, G.I.M. 1973. “The computation of collocations and their relevance in lexical studies.” In: Aitken, Bailey, Hamilton-Smith 103-113. Bertalanffy, Ludwig von. 1949. “General system theory.” In: Biologia Generalis 1, 114 -129. Bertalanffy, Ludwig von. 1950. “The theory of open systems in physics and biology.” In: Science 111, 23-9. Best, Karl-Heinz; Kohlhase, Jürgen (eds.). 1983. Exakte Sprachwandelforschung. Göttingen: Herodot.
196 | References Bock, Hans-Hermann. 1974. Automatische Klassifikation. Göttingen: Vandenhoeck & Rupprecht. Boder, David P. 1940. „The adjective-verb quotient: a contribution to the psychology of language.” In: Psychological revue 3, 309-343. Bowman, Kelsey O.; Hutcheson, K.; Odum, Eugene P.; Shenton, Leanne R. 1969. “Comments on the distribution of indices of diversity.” In: Patil, Pielou, Waters 315-359. Box, George Edward P.; Jenkins, Gwilym M. 1970. Time Series Analysis, Forecasting and Control. San Francisco: Holden-Day. Bradley, James V. 1968. Distribution-free Statistical Tests. Englewood Cliffs, N.J.: Prentice Hall. Brainerd, Barron. 1973a. “Article use as an indirect indicator of style among English-language authors.” In: Jäger, Siegfried (ed.), Linguistik und Statistik. Braunschweig: Vieweg 11-32. Brainerd, Barron. 1973b. „On the relation between types and tokens in literary texts.” In: Journal of Applied Probability 9, 507-518. Brainerd, Barron. 1976. “On the Markov nature of the text.” In: Linguistics 176, 6-30. Breidt, R. 1973. „Lassen sich Perseverationen durch Hirnschädigungen erklären?“ In: Psychiatrica clinica 6, 357-369. Brookes, Bertram C. 1982. “Quantitative analysis in the humanities: The advantage of ranking techniques.” In: Guiter, Arapov 65-115. Bunge, Mario. 1961. “The weight of simplicity in construction and assaying of scientific theories.” In: Philosophy of Science 28, 120-149. Bunge, Mario. 1967. Scientific Research I. Berlin: Springer. Busemann, Adolf. 1925. Die Sprache der Jugend als Ausdruck der Entwicklungsrhythmik. Jena: Fischer. Carroll, John B. 1960. “Vectors of prose style.” In: Sebeok, Thomas A. (ed.), Style in Language. Cambridge, Mass.: The M.I.T. Press, 283-292. Carroll, John B. 1968. “Word-frequency studies and the lognormal distribution.” In: Zale, Eric M. (ed.), Proceedings of the Conference on Language and Language Behavior. New York. Čebanov, Sergej G. 1 9 7 4 . “O podčinenii rečevych ukladov "indoevropejskoj" gruppy zakonu Puassona.” In: Doklady Akademii Nauk SSSR. Novaja serija 55/2 Cochran, W.G. 1950. “The comparison of percentages in matched samples.” In: Biometrika 37, 256-266. Covington, Michael A.; McFall, Joe D. 2010. "Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR)". In: Journal of Quantitative Linguistics 17(2), 94-100. Cramer, Phebe. 1968. Word association. New York: Academic Press. Daneš, František; Viehweger, Dieter (eds.) 1977. Probleme der Textgrammatik II. Berlin, Akademie Verlag. Dannhauer, Heinz-Martin; Wickmann, Dieter. 1972. „Quantitative Bestimmung semantischer Umgebungsfelder in einer Menge von Einzeltexten.“ In: Literaturwissenschaft und Linguistik 2, 29-43. David, F.N. 1950. “Two combinatorial tests of whether a sample has come from a given population.” In: Biometrika 37, 97-110. David, J.; Martin, Robert (eds.) 1977. Etudes de statistique linguistique. Paris, Klincksieck Davie, D. et al. (eds.) 1961. Poetics, poetyka, poetika. Warszawa: Panstwowe Wydawnicztwo Naukowe. Dijk, Teun A. van. 1980. Textwissenschaft: eine interdisziplinäre Einführung. Tübingen: Niemeyer. Djadjuli. 1961. “Transkripsi Sjair Tjinta Berahi.” In: Bahasa dan Budaja 9, 91-133.
References | 197 Doležel, Lubomír; Bailey, Richard W. (eds.) 1969. Statistics and style. New York: Elsevier. Dolphin, C. 1977. “Evaluation probabiliste des cooccurrences.” In: David, Martin 1977, 21-34. Dressler, Wolfgang U.; Beaugrande, Robert A. de. 1981. Introduction to textlinguistics. London: Longman. Drobisch, Moritz W. 1866. „Ein statistischer Versuch über die Formen des lateinischen Hexameters.“ In: Berichte über die Verhandlungen der Königlichen Sächsischen Gesellschaft der Wissenschaften zu Leipzig, Philologisch-historische Klasse 15. 79-139. Estoup, Jean-Baptiste. 1916. Gammes stenographigues. Paris: Institut Stenographique. Fagen, Robert M.; Goldman. Robert N. 1977. „Behavioral catalogue analysis methods.“ In: Animal Behavior 25, 251-274 Fischer, H. 1969. „Entwicklung und Beurteilung des Stils.“ In: Kreuzer, Gunzenhäuser 171-183. Francis, Ivor S. 1966. “An exposition of a statistical approach to Federalist dispute.” In: LEED 1966, 38-78. Frumkina, Revekka M. 1962. “O zakonach raspredelenija slov i klassov slov.” In: Mološnaja, T.N. (ed.), Strukturno tipologičeskie issledovanija. Moskva, AN SSSR 1962, 124-133. Fucks, Wilhelm. 1968. Nach allen Regeln der Kunst. Stuttgart: Deutsche Verlagsanstalt. Fucks, Wilhelm. 1970. „Über den Gesetzesbegriff einer exakten Literaturwissenschaft, erläutert an Sätzen und Satzfolgen.“ In: Zeitschrift für Literaturwissenschaft und Linguistik 1, 113-137. Fucks, Wilhelm. 1971. „Possibilities of exact style analysis.” In: Strelka, Joseph (ed.), Patterns of literary style. University Park, Pennsylvania: State University Press 51-75. Fucks, Wilhelm. 1955. Mathematische Analyse von Sprachelementen, Sprachstil und Sprachen. Köln/Opladen: Westdeutscher Verlag. Gani, John. 1975. “Stochastic models for type counts in a literary text.” In: Gani, John (ed.), Perspectives in Probability and Statistics. London: Academic Press 313-323. Geoffroy, Annie; Lafon, Pierre; Seidel, Gill; Tournier, Maurice. 1973. “Lexicometric analysis of co-occurrences.” In: Aitken, Bailey, Hamilton-Smith 1973, 113-133. Gibbons, Jean D. 1971. Nonparametric statistical inference. New York: McGraw-Hill. Gonda, Jan. 1959. Stylistic repetition in the Veda. Amsterdam, N.V.: Noord Hollandsche Uitgevers Maatschappij. Gottman, John M.; Parkhurst, J.T. 1980. “A developmental theory of friendship and acquaintanceship processes.” In: Collins,W.A. (ed.), Development of cognition, affect, and social relations. Hillsdale, New Jersey: Erlbaum 197-253. Groot, Albert W. de. 1946. Algemene Versleer. Den Haag. Grotjahn, Rüdiger. 1979. Linguistische und statistische Methoden in Metrik und Textwissenschaft. Bochum: Brockmeyer. Grotjahn, Rüdiger 1980. „The theory of runs as an instrument for research in quantitative linguistics.” In: Glottometrika 2, 11-43. Grotjahn, Rüdiger 1982. „Ein statistisches Modell für die Verteilung der Wortlänge.“ In: Zeitschrift für Sprachwissenschaft 1, 44-75. Guiter, Henri; Arapov, Michail V. (eds.) 1982. Studies on Zipf's law. Bochum: Brockmeyer. Gunzenhäuser, Rul. 1969. “Zur literaturästhetischen Theorie G. D. Birkhoffs.“ In: Kreuzer, Gunzenhäuser 1969, 295-311. Haight, Frank A., Jones, Robert B. 1974. “A probabilistic treatment of qualitative data with special reference to word association tests.” In: Journal of Mathematical Psychology 11, 237-244. Haken, Hermann. 1978. Synergetics. Berlin: Springer.
198 | References Halliday, Michael A.K.; Hasan, Ruqaiya. 1976. Cohesion in English. London: Longman. Harweg, Roland. 1974. „Textlinguistik.“ In: Koch, Walter A. (ed.), Perspektiven der Linguistik II. Stuttgart: Kröner 88-116. Herdan, Gustav. 1962. The calculus of linguistic observations. The Hague: Mouton. Herdan, Gustav. 1964. Quantitative linguistics. London: Butterworth. Herdan, Gustav. 1966. The advanced theory of language as choice and chance. Berlin: Springer. Herfindahl, Orris C. 1950. Concentration in the steel industry. Diss., New York: Columbia University. Hooke, Robert; Jeeves, T.A. 1961. “Direct search solution of numerical and statistical problems.” In: Journal of the Association for Computer Machines 8, 212-229. Hřebíček, Ludĕk. 1985. “Text as a unit and co-references.” In: Ballmer, Thomas T. (ed.), Linguistic dynamics. Berlin, New York: de Gruyter 190-198. Hřebíček, Ludĕk. 1986. “Cohesion in Ottoman poetic texts.” In: Archiv orientálni 54, 252-256. Hutcheson, K. 1970. “A test for comparing diversities based an the Shannon formula.” In: Journal of Theoretical Biology 29, 151-154. Jakobson, Roman. 1972. „Unterbewußte sprachliche Gestaltung in der Dichtung.“ In: Zeitschrift für Literaturwissenschaft und Linguistik 1, 101-112. Johnson, Norman L.; Kotz, Samuel . 1969. Discrete distributions. Boston: Houghton Mifflin. Kalinin, Valentin M. 1956. “Funkcionaly, svjazannye s raspredeleniem Puassona, i statističeskaja struktura teksta.” In: Trudy Matematičeskogo Instituta imeni V.A. Steklova 79, 182-197. Kalinin, Valentin M. 1964. “O statistike literaturnogo teksta.” In: Voprosy jazykoznanija 13, No.1, 122-127. Katz, Leo. 1965. “Unified treatment of a broad class of discrete probability distributions.” In: Patil 1965, 175-182. Kemp, C. David; Kemp, Adrienne M. 1956a. “Generalized hypergeometric distributions.” In: Journal of the Royal Statistical Society B 18, 202-211. Kemp, David D.; Kemp, Adrienne M. 1956b. “The analysis of point quadrat data.” In: Australian Journal of Botany 4, 167-174. Kendall, M.G.; Stuart, A. 1967. The advanced theory of statistics. London: Griffin. Koch, Walter A. 1969. Vom Morphem zum Textem. Hildesheim: Olms. Koch, Walter A. 1971. Taxologie des Englischen. München: Fink. Koch, Walter A. 1974. „Tendenzen der Linguistik.“ In: Koch, W.A. (ed.), Perspektiven der Linguistik II. Stuttgart: Kröner 190-311. Kochol, Viktor. 1968. „Syntax a metrum.“ In: Levý, Jiří, Palas, Karel (eds.), Theorie verše II. Brno: Universita J.E.Purkynĕ 167-178. Köhler, Reinhard. 1986. Zur linguistischen Synergetik: Struktur und Dynamik der Lexik. Bochum: Brockmeyer. Köhler, Reinhard; Altmann, Gabriel. 1983. „Systemtheorie und Semiotik.“ In: Zeitschrift für Semiotik 5, 424-431. Králík, Jan. 1977. “An application of exponential distribution law in quantitative linguistics.” In: Prague Studies in Mathematical Linguistics 5, 223-235. Kreuzer, Helmut; Gunzenhäuser, R. (eds.) 1969. Mathematik und Dichtung. München: Nymphenburger. Ku, Harry H. 1963. “A note an contingency tables involving zero frequencies and the 21 test.” In: Technometrics 5, 398-400.
References | 199 Kullback, Solomon; Kupperman, Morton; Ku, Harry H. 1962. „An application of information theory to the analysis of contingency tables, with a table of 2n ln n, n = 1(1)10,000. In: Journal of Research of the National Bureau of Standards - B. Mathematics and Mathematical Physics 66B, 217-243. Lánský, Petr; Radil-Weiss, Tomas. 1980. „A generalization of the Yule-Simon model, with special reference to word association tests and neural cell assembly formation.” In: Journal of Mathematical Psychology 21, 53-65. Leed, Jacob (ed.) 1966. The computer and literary style. Kent, Ohio: Kent State UP. Maas, Heinz-Dieter 1972. „Über den Zusammenhang zwischen Wortschatzumfang und Länge des Textes.“ In: Zeitschrift für Literaturwissenschaft und Linguistik 8, 73-96. Mandelbrot, Benoit. 1953. “An information theory of the statistical structure of language.” In: Jackson, W. (ed.), Communication Theory. New York: Academic Press, 503-512. Mandelbrot, Benoit. 1954a. “Structure formelle des textes et communication.” In: Word 10, 127. Mandelbrot, Benoit. 1954b. “Simple games of strategy occurring in communication through natural languages.” In: IRE Transactions, PGIT 3, 124-137. Mandelbrot, Benoit. 1954c. “On recurrent noise limiting coding.” In: Information Networks, the Brooklyn Polytechnic Institute Symposium 205-221. Mandelbrot, Benoit. 1957. “ Linguistique statistique macroscopique.” In: Apostel L., Mandelbrot, Benoit, Morf, A., Logique, langage et théorie de l'information. Paris: Presses Universitaires de France 1-78. Mandelbrot, Benoit. 1961. “On the theory of word frequencies and an related Markovian models of discourse.” In: Jakobson, Roman (ed.), Structure of Language and its Mathematical Aspects. Providence, Rhode Island: American Mathematical Society 190-219. Mandelbrot, Benoit. 1966. “Information theory and psycholinguistics: A theory of word frequencies.” In: Lazarsfeld, Paul F., Henry, Neil W. (eds.), Readings in mathematical social science. Chicago: Science Research Associates 350-368. Maškina, Ljudmila E. 1968. O statističeskich metodach issledovanija leksiko-grammatičeskoj distribucii. Minsk, Diss. Masson, David I. 1961. “Sound-repetition terms.” In: Davie et al. 1961, 189-199. McIntosh, Robert P. 1967. “An index of diversity and the relation of certain concepts to diversity.” In: Ecology 48, 392-404. McNeil, Donald R. 1973. “Estimating an author's vocabulary.” In: Journal of the American Statistical Association 68, 92-96. Miller, George A. 1957. “Some effects of intermittent silence.” In: The American Journal of Psychology 70, 311-314. Miller, George A.; Chomsky, Noam. 1963. “Finitary models of language users.” In: Bush, Robert R., Galanter, Eugene, Luce, R. Duncan (eds.), Handbook of Mathematical Psychology II. New York: Wiley 1963, 419-491. Miller, George A.; Madow, William G. 1963. “On the maximum likelikood estimate of the Shannon-Wiener measure of information.” In: Luce, R. Duncan, Bush, Robert R., Galanter, Eugene (eds.), Readings in Mathematical Psychology I. New York, Wiley 1963, 448-469. Mittenecker, Erich. 1953. „Perseveration und Persönlichkeit I,II.“ In: Zeitschrift für angewandte Psychologie 1, 5-31, 265-284. Mood, Alexander M. 1940. “The distribution theory of runs.” In: Annals of Mathematical Statistics 11, 367-392.
200 | References Morton, Andrew Q.; Levison, Michael. 1966. “Some indicators of authorship in Greek prose.” In: LEED 1966, 141-179. Mosteller, F.; Wallace, D.L. 1964. Inference and disputed authorship: The Federalist. Reading, Mass.: Addison-Wesley. Müller, Werner. 1971. „Wortschatzumfang und Textlänge. Eine kleine Studie zu einem vielbehandelten Problem.“ In: Muttersprache 81, 266-276. Muller, Charles. 1965. „Du nouveau sur les distributions lexicales: la formule de WaringHerdan.“ In: Cahiers de Lexicologie 1, No.6, 35-53. Muller, Charles. 1968. Initiation à la statistique linguistique. Paris: Librairie Larousse. Muller, Charles. 1977. “Observation, prévision et modèles statistiques.” In: David, J., Martin, Robert (eds.), Etudes de statistique linguistique. Paris: Klincksieck 9-19. Nelder, John A.; Mead, Roger. 1964. “A simplex method for function minimization.” In: Computer Journal 7, 308-313 (8, 1965, 27). Nešitoj, V.V. 1975. “Dlina teksta i ob'em slovarja. Pokazateli leksičeskogo bogatstva teksta.” In: Metody lzučenija leksiki. Minsk: BGU, 110-118. Newman, Louis I.; Popper, William. 1918. Studies in Biblical parallelism. Berkeley: UCP. Nöth, Winfried. 1974. „Kybernetische Regelkreise in Linguistik und Textwissenschaft.“ In: Grundlagenstudien aus Kybernetik und Geisteswissenschaft 15, 75-86. Nöth, Winfried. 1975. "Homeostasis and Equilibrium in Linguistics and Text Analysis." In: Semiotica 14, 222-244. Nöth, Winfried. 1977. Dynamik semiotischer Systeme. Stuttgart: Metzler. Nöth, Winfried. 1978. “Systems Analysis of Old English Literature.” In: Journal for Descriptive Poetics and Theory of Literature (PTL) 3, 117-137. Nöth, Winfried. 1983. “System theoretical principles of the evolution of the English language and literature.” In: Davenport, Mike , Hansen, Eric, Nielsen, Hans F. (eds.), Current topics in English historical linguistics. Odense: Univerity Press 103-122. Oomen, Ursula. 1971. „Systemtheorie der Texte.“ In: Folia Linguistica 5,12-34. Ord, J. Keith. 1967. “On a system of dicrete distributions.” In: Biometrika 54, 649-656. Ord, J. Keith. 1972. Families of frequency distributions. London: Griffin. Orlov, Jurij K. 1982. „Linguostatistik: Aufstellung von Sprachnormen oder Analyse des Redeprozesses? (Die Antinomie "Sprache-Rede" in der statistischen Linguistik)“. In: Orlov, Boroda, Nadarejšvili 1-55. Orlov, Jurij K.; Boroda, Moisej G.; Nadarejšvili, Isabella Š. 1982. Sprache, Text, Kunst. Quantitative Analysen. Bochum: Brockmeyer. Osgood, Charles E. 1959. “The representational model and relevant research methods.” In: I. de Sola Pool (ed.), Trends in content analysis. Urbana: University of Illinois Press 33-88. Palek, Bohumil; Fischer, Gero. 1977. “Ein Modell der Referenzstruktur des Textes.“ In: Studia Grammatica 18, 74-102. Palermo, David S.; Jenkins, James J. 1964. Word association norms. Minneapolis, University of Minnesota Press Pandit, S.M.; Wu, S.M. 1983. Time series and system analysis with applications. New York: Wiley. Pantoen Melajoe. 1929. Weltevreden. Bandoeng, Visser & co. Paškovskij, Vladimir E.; Srebrjanskaja, I.I. 1971. “Statističeskie ocenki pis'mennoj reči bol'nych šizofreniej.” In: Inženernaja lingvistika. Leningrad. Patil, Ganapati P. (ed.) 1965. Classical and contagious discrete distributions. New York: Pergamon.
References | 201 Patil, Ganapati P.; Joshi, Sharadchandra W. 1968. A dictionary and bibliography of discrete distributions. Edinburgh: Oliver & Boyd. Patil, G.P.; Pielou, Evelyn C.; Waters, William E. (eds.) 1971. Statistical ecology 3. University Park: The Pennsylvania State University Press. Piotrovskaja, Anna A.; Piotrovskij, Rajmond G. 1974. “Matematičeskie modeli v diachronii i tekstoobrazovanii.“ In: Statistika reči i avtomatičeskij analiz teksta. Leningrad: Nauka 361400. Piotrowski, Rajmond G. 1984. Text, Computer, Mensch. Bochum: Brockmeyer. Piotrowski, Rajmond G.; Bektaev, Kaldybay B.; Piotrovskaja, Anna A. 1985. Mathematische Linguistik. Bochum: Brockmeyer. Průcha, Jan 1967. “On word-class distribution in Czech utterances.” In: Prague Studies in Mathematical Linguistics 2, 65-76. Rapoport, Anatol 1982. “Zipf's law re-visited.” In: Guiter, Arapov 1982, 1-28. Ratkowsky, David A.; Halstead, WH., Hantrais, L. 1980. “Measuring vocabulary richness in literary works: A new proposal and a reassessment of some earlier measures.” In: Glottometrika 2, 125-147. Rieger, Burghard. 1971. „Wort- und Motivkreise als Konstituenten lyrischer Umgebungsfelder. Eine quantitative Analyse semantisch bestimmter Textelemente. In: Zeitschrift für Literaturwissenschaft und Linguistik 4, 23-41. Rieger, Burghard. 1974. „Eine tolerante Lexikonstruktur. Zur Abbildung natürlich-sprachlicher Bedeutung auf unscharfe Mengen in Toleranzräumen.“ In: Zeitschrift für Literaturwissenschaft und Linguistik 16, 31-47. Sachs, Lothar. 1972. Statistische Auswertungsmethoden. Berlin: Springer: Schlismann, Annemarie. 1948. „Sprach- und Stilanalyse mit einem vereinfachten Aktionsquotienten.“ In: Wiener Zeitschrift für Philosophie, Psychologie und Pädagogik 2. Schlittgen, Rainer; Streitberg, Bernd H.J. 1987. Zeitreihenanalyse. München: Oldenbourg. Schmidt, Franz. 1972. „Numerische Textkritik: Goethes und Schillers Anteil an der Abfassung des Aufsatzes "Die Piccolomini".“ Zeitschrift für Literaturwissenschaft und Linguistik 5, 59-70. Schweizer, Harro. 1979. Sprache und Systemtheorie. Tübingen: Narr. Sebeok, Thomas A.; Zeps, Valdis J. 1959. „On non-random distribution of initial phonemes in Cheremis verse.” In: Lingua 8, 370-384. Segal, Dmitrij M. 1961. “Nekotorye utočnenija verojatnostej modeli Cipfa.” In: Mašinnyj perevod i prikladnaja lingvistika 5, 51-55. Sichel, H.S. 1971. “On a family of discrete distributions particularly suited to represent longtailed frequency data.” In: Laubscher, N.F. (ed.), Proceedings of the Third Symposium on Mathematical Statistics. Pretoria, S.A.C.S.I.R. 51-97. Sichel, H.S. 1974. “On a distribution representing sentence-length in written prose.” In: Journal of the Royal Statistical Society A 137, 25-34. Sichel, H.S. 1975. “On a distribution law for word frequencies.” In: Journal of the American Statistical Association 70, 542-547. Siegel, Sidney. 1956. Nonparametric statistics for the behavioral sciences. New York: McGrawHill. Simon, Herbert A. 1955. “On a class of skew distribution functions.” In: Biometrika 42, 425440. Simpson, E.H. 1949. “Measurement of diversity.” In: Nature 163, 688.
202 | References Skinner, Burrhus F. 1939. “The alliteration in Shakespeare's sonnets: A study in literary behavior.” In: Psychological Record 3, 186-192. Skinner, Burrhus F. 1941. “A quantitative estimate of certain types of sound-patterning in poetry.” In: The American Journal of Psychology 54, 64-79. Sommers, Hobart H. 1962. Analyse statistique du style. Louvain, Paris: Nauwelaerts. Spang-Hanssen, Henning. 1956. “The study of gaps between repetitions.” In: Halle, Morris (ed.), For Roman Jakobson. The Hague: Mouton 497-502. Strauss, Udo 1980. Struktur und Leistung der Vokalsysteme. Bochum: Brockmeyer. Strauss, Udo; Sappok, Christian; Diller, H.J.; Altmann, Gabriel 1984. „Zur Theorie der Klumpung von Textentitäten.“ In: Glottometrika 7, 73-100. Štukovský, Robert.; Altmann, Gabriel. 1964. „Fonická povaha slovenského rýmu.“ In: Litteraria 7, 65-80. Štukovský, Robert.; Altmann, Gabriel. 1965. „Vývoj otvoreného rymu v slovenskej poézii.“ In: Litteraria 8, 156-161. Štukovský, Robert.; Altmann, Gabriel. 1966. „Die Entwicklung des slowakischen Reimes im XIX. und XX. Jahrhundert.“ In: Levý, Jiří, Palas, Karel (eds.), Teorie verše I. Brno 259-261. Swed, F.S.; Eisenhart, C. 1943. „Tables of testing randomness of grouping in a sequence of alternatives.” In: Annals of Mathematical Statistics 14, 66-87. Tešítelová, Maria. 1967. “On the role of nouns in lexical statistics.” In: Prague Studies in Mathematical Linguistics 2, 121-131. Tintner, Gerhard. 1965. Econometrics. New York: Wiley. Tuldava, Juhan. 1980. “K voprosu ob analitičeskom vyraženii svjazi meždu ob'emom slovarja i ob'emom teksta.” In: Lingvostatistika i kvantitativnye zakonomernosti teksta. Tartu 113144. Uhlířová, Ludmila. 1967. “Statistics of word order of direct object in Czech.” In: Prague Studies in Mathematical Linguistics 2, 37-49. Wildgen, Wolfgang. 1985. Archetypensemantik. Tübingen: Narr. Wilkinson, Richard J. 1907. Malay literature. Part I. Kuala Lumpur. Wilkinson. Richard J.; Winstedt, Richard O. 1914. Pantun Melayu. Singapore. Woronczak, Jerzy. 1961. „Statistische Methoden in der Verslehre.“ In: Davie, D. et al. (eds.), 607-624. Woronczak, Jerzy. 1967. “On an attempt to generalize Mandelbrot's distribution.” In: To honor Roman Jakobson III. The Hague 2254-2268. Yngve, Victor. 1956. “Gap analysis and syntax.” In: IRE Transactions PGIT 2, 106-112. Yule, George U. 1944. The statistical study of literary vocabulary. Cambridge: UP. Zipf, George K. 1935. The psycho-biology of language. Boston: Houghton Mifflin. Zipf, George K. 1949. Human behavior and the principle of least effort. Cambridge, Mass: Addison-Wesley. Zörnig, Peter. 1984a. “The distribution of the distance between like elements in a sequence I.” In: Glottometrika 6, 1-15. Zörnig, Peter. 1984b. “The distribution of the distance between like elements in a sequence II.” In: Glottometrika 7, 1-14. Zörnig, Peter. 1987. “A theory of distances between like elements in a sequence.” In: Glottometrika 8, 1-22.
Index accent 189 activity 17, 21, 25, 29 Adams 49 adjective 17, 21, 25, 119 adverb 17 Aeneis 136 age 86 aggregation 137, 141ff., 154 –similarity 159 Aktionsquotient 17 algorithm 166 alternation 129 Altmann 4, 9, 13, 17f., 39, 47, 50, 58, 62f., 65, 77, 82, 86, 94, 96, 98f., 111, 123, 127, 137, 139, 142, 159, 163f., 172, 182 Altmann-Fitter 144, 166 ambiguity 67 amplitude 189 analogy 50 anaphora 80 Antosch 17 approximation 133 Arapov 64 article 109, 154, 170 association 5, 86, 111, 113, 116 –phonic 183 association network 121 assonance 177f., 181f. asymmetry 43 audience 86 auditory 52 Auezov 171 Austerlitz 6 author 81, 85f., 89, 129, 136 autonomy 83 autosemantic 121 average 33 balance 86, 89 Basharin 37 Beaugrand de 1 behaviour 21, 51, 89 –limiting 166 –linguistic 61 –statistical 80, 83, 87
Bektaev 164, 171f. Belonogov 73 Berrien 86 Berry-Rogghe 111 Bertalanffy, von 9 Best 62, 65, 123 biology 63, 83 Birkhoff 32 block 163, 172 Bock 30, 158 Boder 17 Boehnke 132, 135 Boroda 50ff., 62, 64, 82 Borodovsky 65 Bortz 132, 135 Botto 96, 188 Bowman 37, 39 Box 189 Bradley 132 Brainerd 81, 137, 141, 151, 154, 163 break 82 Breidt 3 Bridges 139, 146, 187 Brookes 64 Bunge 7, 50, 58, 89 Burdinski 163, 172 Busemann 17 Buttlar, von 50 Caesar 40 caesura 178 Captain’s daughter 142 Carroll 50, 73 cataphora 80 category 2, 106 Čebanov 51 Čech 127 Chalupka 95f., 104 change 78, 101, 159 chapter 82, 111 character 65, 87 Cheremis 182 Chi-square 24, 47, 66, 101 Chomsky 64 Clarkson 49
204 | Index clause 82, 99, 111, 188 climax 104 –exponential 101 –gradual 98 –linear 97 –reduced 99 –word-length 101 cluster 5 cluster analysis 30 clustering 137, 140 coefficient –binomial 15, 92, 132, 165 –of determination 80, 158 coherence 1 cohesion 1 coincidence 111, 114 –associative 117 –dissociative 117 –neutral 117 communication process 86 communication theory 137 communicative behaviour 9 comparison 25f., 32, 35, 86, 95, 133 compounding 119 computational linguistics 81 concept definition 8 concordance 180 confidence interval 108 confirmation –degree of 8 connotation 5 consonant 16, 32, 95, 128 construction –syntactic 65, 87f. control cycle 82 corpus 61, 65, 87 corpus linguistics 61 correction for continuity 134f. cosine 189 counting 8 Covington 86 Cramer 67, 111 creativeness 2 criterion –likelihood ratio 151 Csiszár 49 cyclic 187
dactyl 35, 139, 187 Daneš 1 Dannhauer 111 David 139 Deister 3 Der Teil und das Ganze 188 descriptiveness 17, 21 deterministic 8 Deutschstunde 170 dictionary 61 difference 18, 25, 29, 52, 86, 92, 95, 99, 101, 123 –relative 52 differential equation 78, 82 Diller 137, 139, 142 dimension 82, 116 diphthong 66 dis legomenon 73 discourse marker 87 distance 39, 137ff., 143, 148, 152, 154f., 157, 159 –random 137 distribution –Beta 165, 174 –Beta-binomial 166 –binomial 26, 143, 164, 166f., 172 –block 164 –Chi-square 108, 152, 179 –exponential 137 –extended displaced geometric 149 –Gamma 51 –generalised hypergeometric 77, 166 –geometric 53, 65, 137, 142f., 148, 152, 154 –hypergeometric 115, 117, 122 –Hyperpascal 58, 60, 63 –lognormal 164 –marginal 165 –mixed Poisson 164 –multinomial 27 –negative binomial 51, 55, 58, 62, 137, 143, 164, 166f., 171 –negative hypergeometric 65, 166f., 169ff. –normal (Gauss) 45, 93, 96, 133, 164 –optimised negative binomial 144 –partial sums 65 –Poisson 51, 53, 58, 75, 94, 117, 142, 163f., 166f., 169
Index | 205 –probabilty 33 –rank-frequency 64, 67, 73 –sentence length 54 –t- 38 –tendency-free 137 –uniform 107 –Waring 65, 76f. –word length 38, 55 –Yule 65, 73, 75ff. –Zipf-Mandelbrot 65 diversification 10, 52 diversity 34, 38 Djadjuli 159 documentation sciences 63 Dolphin 111 dominance 83 Dressler 1 Drobisch 35 Đuraš 62 economics 63 effect –aesthetic 52 –eulexical 17 –euphonic 12, 15 –hearer's 54 –random 26 –speaker's 54 –universal 52 effort 64 Efimova 64 Einaudi 156 Eisenhart 132 empirical generalisation 7 English 88 entropy 33 entry 82 epos 159 Epstein 137, 142 equilibrium 18ff., 86, 189 equivalence principle 1 Erlkönig 17, 34, 38, 66f., 81, 83, 87, 91, 106, 128, 134 estimation 11, 14, 22, 27, 33, 61, 81, 87, 98, 104, 143 estimator 21f., 41, 149 –maximum likelihood 153 estimators 144
Estoup 63 ethology 83 event 15 –Poisson 141 excess 43 expectation 133 –mathematical 111, 129 experience 86 extrapolation 50 factor –extra-linguistic 51 –extra-textual 86 factor analysis 49 factorial 15 Fagen 83 falsification 89, 166 field –semantic 3, 111 Fischer 17, 29, 77 formal linguistics 9 formula –recurrence 93, 116, 139 Fourier analysis 187, 189 French 111 frequency 2, 11, 33, 64, 91, 137 –class 11, 73 –expected 4, 67, 107 –statistically significant 11 –unexpected 11, 13 frequency spectrum 65, 73, 77 Frumkina 163, 169 Frumkina law 164, 166, 172 Fucks 51, 53, 127f. function –Altmann's ranking 65 –continuous 142 –discourse 3 –exponential 102, 104 –Gamma 165 –grammatical 3, 87 –likelihood 151 –linear 54, 98 –non-linear 101 –power 84 –probability mass 142, 167 Galle 86 Gani 81
206 | Index gap 138 Geffroy 111 gender 86 generalisation –empirical 50 generative linguistics 63 geography 63 German 29, 36, 44, 66, 81, 87f., 91, 109, 170 Gibbons 130 glide 128 Goethe 17, 38, 40, 55, 91, 134, 152 Goldman 83 Gonda 6 Good 65 goodness-of-fit 75, 80 Gottman 1 grammar 3, 61, 67, 106, 142, 156, 163 –phrase structure 88 graph –acyclic 121 –minimal 121f. grapheme 87 Grotjahn 34f., 40, 46, 51, 55, 66, 127f., 130, 189 Grzybek 62, 65 Guiter 64 Gunzenhäuser 32 Gusein-Zade 65 Haight 75, 81 Haken 9, 82 half strophe 114, 177 half verse 99, 101, 191 Halliday 77 Halstead 82 Hantrais 82 hapax legomenon 73, 87 Harweg 77 Hasan 77 hearer 82, 85 hearer/reader 52, 85 Heisenberg 188 Hellinger distance 49 Herdan 39, 76, 78, 82f., 137, 141f. Herfindahl 38 Herodot 58, 63 hexameter 128, 139, 187 homogeneity 36
homonymy 119 Hook 166 Horace 38 Hřebíček 1, 7, 77ff. Hřebíček’s Reference Law 77 humanities 63 Hungarian 66 Hutcheson 37, 39 Hviezdoslav 96 hyper-unit 177 hypothesis 7, 19, 77, 89, 107, 129, 131, 140, 178 –alternative 13, 91 –confirmed 7 –deductive 7 –inductive 7 –null 13, 19f., 27, 45, 91, 96 –Orlov's 52 –plausible 7 –Skinner's 137, 146, 154, 158 index –activity 17f. indicator 17, 21, 25, 29 –association 111 –Birkhoff's 32 –global 33, 40 –Herfindahl's 38 –of concetration 38 –of entropy 33 –Popescu's 33 –Schmidt's 32 –similarity 161 –Yule's 33 Indonesian 13, 16 induction –statistical 61 inflection 91 information –fragmentary 61 information flow 3, 52, 64f., 82, 85 information input 86 information statistics 107 inhomogeneity 62 integrative pressure 83 intensity 190 intention 3 interpretation
Index | 207 –linguistic 29, 32 inventory 2, 87f., 159 –lexical 87 Isocrates 172 Italian 156, 158 iteration 130, 137 Jakobson 1, 3 Jeeves 166 Jenkins 123, 189 Johnson 77 Jones 75, 81 justification 65, 67 Kalinin 73 Katz 53 Kaumanns 82 Kazakh 171 Kelih 65 Kemp 77, 166 Koch 1, 9 Kochol 188, 190 Köhler 9, 62, 65, 82, 86ff., 164 Kohlhase 123 Kostra 96 Kotz 77 Kráľ 96 Králík 137, 141 Krasko 96 Krčméry 96 Ku 24, 34, 48 Kubát 86 Kullback 24, 34, 49 Kullback-Leibler Divergence 49 Kupperman 24, 34 Kusenberg 121 Lafon 111 language “as a whole” 61 Lánský 75, 81 Latin 36, 44 Laura 117 Laux 3 law 50 –of references 7 –of text production 85 learning automaton 175 Lehfeldt 39 Leibler 49 lemma 65
lemmatizer 81 Lenz 170 Levinson 172 Levison 58 lexeme 65, 67, 80 lexical richness 7 lexicogram 111 Li 64, 73 Lienert 132, 135 Liese 49 likelihood ratio 21, 154 limiting case 58 linguistic interpretation 53, 77f., 139, 154 litany 177 Lucrece 38 Lukjanenkov 164 Maas 82 Mačutek 47, 49, 63 Madow 37 magic formula 177 maintenance input 86 Malay 97ff., 159, 161, 177, 182ff. Mandelbrot 64, 67, 73 Markov chain 137, 146ff., 150f., 153ff., 187 Martindale 65 Maškina 164 Mason 6 mathematics 193, 63 MATTR 86 maximum likelihood 21, 61 McFall 86 McIntosh 40 McKenzie 65 McNeil 81 Mead 66, 166 mean 41, 45, 49, 74, 99, 102, 111, 128 –geometric 28 measure 19, 33, 86, 158 measurement 8 mechanism 51 median 128 memory 85 Menzerath Law 67 methaphor 177 method –Hook-Jeeves 166 –iterative 61, 103, 166
208 | Index –moment 61 –multivariate 50 –Nelder-Mead 66, 166 –of minimal squares 61, 79, 98, 190 –optimisation 61, 166 –trial and error 191 metrical foot 5, 106 metrical pattern 5 Milička 86 Miller 37, 64f., 67, 72 mirror image 177 Mittenecker 3 modal verb 119 model –continuous 63, 142, 155 –mathematical 13, 50f., 81 –Miller's 72 –Popescu-Altmann 65 –representational 50 –Simon-Herdan 73 –statistical 32 –type-token ratio 80 Möller 3 moment –central 42 –initial 42 –of a distribution 42 monkey typing 68 monotony 36 Mood 130 Morho 104 morph 65, 87 morpheme 65, 164 Morton 58, 172 Mosteller 164 motif 65 Muller 76 Müller 82 musical note 87 'musical quality' 32 musicology 50, 52, 63 Nadarejšvili 50, 52, 62, 64, 82 Naumann 65 Nelder 66, 166 Nešitoj 82 Newman 177 nominal tendency 108
non-homogeneity 108 non-uniformity 33 normalisation 120 Nöth 9 noun 106, 108f., 119, 171 occurrence –common 111 Odum 37, 39 Oomen 9 optimisation 75 optimisation technique 22 Ord 53 order parameter 82 Ord's system 58 Orlov 14, 50ff., 61f., 64, 75, 82, 87, 165 oscillation 105, 187f. Osgood 111 Palek 77 Palermo 123 Pandit 189 pantun 97, 99, 177ff., 182, 184 pantuns 181 paragraph 82, 111 parallelism –phonic 183 Parkhurst 1 part of speech 65, 142 partition 138, 163 Paškovskij 164 passage 163f., 172 Patil 60 pattern –euphonic 15 –non-random 155 –rhythmic 128, 139 –vocalic 183 –vowel 182 period 191 permutation 14 phoneme 65f., 87 phrase 164 physics 63 Piotrovskaja 171f. Piotrowskaja 50 Piotrowski 50, 64, 164, 171f., 175 Piotrowski law 123 plastic control 89
Index | 209 Plávka 96 Pollard 49 polysemy 52, 119 Popescu 47, 54, 65, 111, 123, 127 Popper 177 population 14, 61f., 165f. position 5, 12, 81, 91, 98, 102, 106ff., 147, 178f., 181f., 187, 189ff. –final 91 –stress 6 positional pair 178 predicate type 32 prediction 50 preposition 142, 169 principal component analysis 50 principle –scientific 58 probability 11, 14, 33, 51f., 64, 68, 73, 92, 108, 138, 163f. –conditional 147 process –equilibration 52 –Poisson 141f. pronoun 80, 154 property 1f. proportion 19, 26, 83, 92, 95, 106, 155 proverb 177 Průcha 91, 106 psychic disease 175 psycholinguistics 137 psychology 8, 111 Pushkin 142 Puškin 169 Puť Abaja 171 quatrine 97 R packages 166 R2 80 Radil-Weiss 75, 81 random variable 51 range 18 rank 40, 64 Rapoport 64 ratio 21, 25 –likelihood 151 Ratkowsky 82 readership 83 reference 7, 77f., 80
reference type 87 regression –linear 98f., 102 regularity 94 –deterministic 189 –stochastic 189 relation 2 –distributional 91 –functional 91 reliability 50 repeat rate 33, 38 repetition 1ff., 86 –absolute 4 –aggregative 5, 137 –associative 111 –associative 5 –cyclic 6 –in blocks 5 –iterative 5, 127 –parallel 6, 177 –positional 5 –shapeless 11, 16 rhyme 4, 95, 184 –end 177 –inner 184 –open 95 rhymeless 96 rhythm 104, 106, 189 richness –vocabulary 78, 80 Rieger 111 risk of a mistake 12 Roestam Effendi 13 Rott 50 run 127, 129, 133ff., 137 Russian 163f., 169 Sachs 46, 99 Sallust 40 Samota 102 sample 61, 133 Sappok 137, 139, 142 Saussure de 9 scene 82 Schiller 40, 117 Schlismann 17 Schlittgen 189 Schmidt 32, 62
210 | Index Schweizer 9 Schwibbe 82 Sebeok 182 Segal 73 Seidel 111 self-organisation 89, 94 self-regulation 82, 89 seme 17 semi-vowels 128 sentence 54, 78, 82, 91, 106, 111 sentence length 57, 61, 172, 188 sequence 68, 80, 127, 129f., 137, 146, 164, 180, 188f. –binary 128 –geometric 53 shair 161 Shair Cinta Berahi 159 Shenton 37, 39 Shields 49 Sichel 51, 75 Siegel 26, 132, 178 signal input 86 significance 92, 95, 98, 101, 191 –of differences 29 significance level 19f., 120 similarity 158f. –phonetic 159 Simon 73, 75, 81 simplicity 58 Simpson 38 sine 189 size 62, 78, 83, 86ff., 101, 163, 172 skewness 43 Skinner 5 Sládkovič 96, 102 Slovak 94ff., 101, 188f. Smrť Jánošíkova 188 Sommers 50 sound 12, 16, 32, 65, 87, 91f., 159, 184 sound pair 160 space –multi-dimensional 82 Spang-Hanssen 137, 142 speaker 82 speaker/writer 52 spondee 35, 139 spontaneity 159
Srebrjanskaja 164 Šrejder 64 Stacho 96 Stadlober 65 stage play 82 standard deviation 42 stationary 189 stem 178f. stereotypy 36, 91 stochastic process 73, 81 story teller 159 strategy 50 Strauss 9, 50, 137, 139, 142 Streitberg 189 stress 188, 190 strophe 111 structuralism 9, 63 structure –assonance 177 –grammatical 177 –parallel 177 –phonetic 177 Student variable 45, 100 Štukovský 94, 96, 98f. style 83 'style ratio' 32 sub-system 83 succession 127 suppletivism 67 Swed 132 syllable 32, 52, 87, 98, 128, 137, 189ff. –stressed 187 –unstressed 187 synergetic linguistics 9f. synergetics 9, 85 synonymy 119 syntactic phenomenon 164 system 83, 89 –code 9 –cognitive 86 –communication 9 –living 86 –semiotic 9 systems theory 9, 63 Taylor series 24 tendency 16, 91f., 96, 116, 128f. Tešítelová 76
Index | 211 test –binomial 20, 92, 108 –Chi-square 36, 66f., 140, 169, 182 –Cochran's Q- 178 –E- 103 –F- 46, 80, 83, 161, 180 –Fischer's exact 26 –for significance of change 101 –McNemar's 101 –statistical 26 –t- 37, 45, 96, 99, 161, 180 text characteristic 5 text collection 65 text economy 77 text law 5, 11 text length 64, 82f., 86 text mixture 64 text processing 85 text segmentation 6 text sort 52, 54, 77, 80, 83, 86, 111, 189 text typology 80 texts fragment 64 theory 6 –construction 7 –information 33 threshold 13, 15, 18, 128 time series 187, 189 token 85 tokenizer 81 Torunier 111 total variation distance 49 Totentanz 38, 44, 134 transformation 18 transition 148 trend 98, 103 TTR 81ff., 85ff. Tuldava 65, 82f., 86 Turčány 96 Turkish 79 Tuzzi 111, 123, 156, 158 type 85 type-token 85f., 88 Uhlířová 137 unification 10, 52 unified theory 9, 77 uniformity 36, 39, 107 uniqueness 91
unit –textual 1 urn 138f., 142f. Vajda 49 validity 50, 64 variability 180 variable –Student 100 variance 41, 45, 103, 133, 190f. –analysis of 179 variation interval 34 verb 17, 21, 25, 119 verb prefix 66 Vergil 35 verse 91, 99, 106, 111, 113, 128f., 159, 190 verse length 106 verse types 35 vertex 122 Viehweger 1 visualisation 120 vividness 17 vocabulary 77f., 81, 85 –childern's 85 vowel 14, 32, 94ff., 128, 178f., 182 Vulanović 164 Wallace 164 wavelets 187 Welch 45 Wickmann 111 Wilde 82 Wildgen 9 Wilkinson 98, 177f. Wimmer 9, 49, 58, 65, 77, 86, 127 window 86 –moving 86 Winstedt 98 word 82, 86, 137, 154, 163f. –content 3 –function 3 –rhyme 92, 94, 96 word length 34, 54, 61, 98 word order 106 word pair 111 word token 79, 83 word type 78, 83 word-form 65ff., 72, 80 word-formation 91
212 | Index Woronczak 73, 127f. writer 155 Wu 189 Yngve 137, 142 Yoshi 60 Yule 65, 73 Žáry 96 Zeps 182
Zipf 10, 52, 63ff., 67, 137 'Zipf Number' 52 Zipf’s Law 65 Zipf-Mandelbrot law 40, 50, 63f. Zipf-Mandelbrot Law 51f., 65 Zipf-Orlov Length 64 Zipf's force 10, 52 Zörnig 77, 137, 155