Recent Contributions to Quantitative Linguistics 9783110419870, 9783110420296, 9783110420357

Quantitative Linguistics is a rapidly developing discipline covering more and more areas of linguistic and textological

226 27 5MB

English Pages 292 Year 2015

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Editors’ Foreword
Table of Contents
Quantitative Analysis of Poetic Space: Discrimination of Loci in Eugene Onegin by A. S. Pushkin
Application of the Menzerath-Altmann Law to Contemporary Written Japanese in the Short Story Style
Menzerath-Altmann Law in Differently Segmented Texts
A New Universal Code Helps to Distinguish Natural Language from Random Texts
The Advantages of Quantitative Studies for Dialectology
Type-token relation for word length motifs in Ukrainian texts
Gender Identification in Modern Greek Tweets
Syntactic Complexity in Quantitative Linguistics
Measuring Proximity Between Source and Target Texts: an Exploratory Study
Evolutionary Derivation of Laws for Polysemic and Age-Polysemic Distributions of Language Sign Ensembles
Quantitative Studies in the Corpus of Nko Periodicals
The Co-occurrence and Order of Valency in Japanese Sentences
Authorship Attribution Using Political Speeches
Using Rates of Change as a Diagnostic of Vowel Phonologization
Diversification in the Noun Inflection of Old English
Tracing the History of Words
Grammar Efficiency and the Idealization of Parts-of-speech Systems
Structural Complexity of Simplified Chinese Characters
On the Robust Measurement of Inflectional Diversity
The influence of Word Unit and Sentence Length on the Ratio of Parts of Speech in Japanese Texts
References
Index of Names
Subject Index
Authors’ Addresses
Recommend Papers

Recent Contributions to Quantitative Linguistics
 9783110419870, 9783110420296, 9783110420357

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Arjuna Tuzzi, Martina Benešová and Ján Mačutek (Eds.) Recent Contributions to Quantitative Linguistics

Quantitative Linguistics

Editors Reinhard Köhler Gabriel Altmann Peter Grzybek

Volume 70

Recent Contributions to Quantitative Linguistics

Edited by Arjuna Tuzzi, Martina Benešová and Ján Mačutek

ISBN 978-3-11-041987-0 e-ISBN (PDF) 978-3-11-042029-6 e-ISBN (EPUB) 978-3-11-042035-7 ISSN 1861-4302 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2015 Walter de Gruyter GmbH, Berlin/Boston Printing and binding: CPI books GmbH, Leck ♾ Printed on acid-free paper Printed in Germany www.degruyter.com

Editors’ Foreword During the last decades research in linguistics has experienced a rapid change in its methodology. This process has fostered a renewed interest in topics related to quantitative approaches. Both the amount of data available in electronic format and the number of methods to extract information from large databases increase. As a consequence, today many approaches combine linguistic concepts, mathematics, statistics, and information technologies to analyze language and text. Quantitative Linguistics (QL) is a relatively young but rapidly developing discipline covering more and more areas of linguistic and textological research. As in many other disciplines, in QL researchers deal with a large number of objectives, data and methods. The number of methods that can be exploited for analyzing language and text is potentially unlimited and their applicability also heavily depends on the aims of the studies; as the number of fields of research increases, the number of problems increases as well, and both new methodological perspectives and theoretical approaches are needed. Theories and methods reflect different approaches defining their goals and capabilities as well as their limits and constraints. This book represents an overview of the state of the art in current QL, its interdisciplinary character, and its scope and reach. It provides an opportunity to discuss strengths, weaknesses, advantages, and drawbacks of different approaches and to highlight their connections to other disciplines. This book collects a  selected range of refereed contributions. The volume includes twenty contributions written by thirty-two authors from fifteen different countries (Australia, Brazil, Canada, China, Czech Republic, Germany, Greece, Italy, Japan, Poland, Russian Federation, Slovakia, Switzerland, Ukraine, United States of America). At a first glance these chapters may appear fragmentary in the content, but as whole they provide a broad overview on current research topics in QL and related fields. Moreover, they are joined together by the common interest in quantitative aspects of the analysis of language and text. Although it is difficult to cluster all contributions according to a classification of their contents (and this is the reason why we decided to list them in a simple alphabetical order), the volume is composed of a substantial group of studies that can be included in the very core of the discipline since they concern fundamental linguistics laws, such as Zipf’s Laws (Andrij Rovenchak; Yanru Wang and Xinying Chen), Menzerath-Altmann Law (Martina Benešová and Denis Biryukov; Martina Benešová, Dan Faltýnek and Lukáš Zámečník), Piotrowski-Altmann Law (Arjuna Tuzzi and Reinhard Köhler), and further two contributes that tackle traditional and less traditional measures to scrutinize diversity of language and its properties

VI 

 Editors’ Foreword

(Ján Mačutek; Aris Xanthos and Guillaume Guex). Moreover, the book includes contributions devoted to the syntactical features of languages (Sven Naumann; Haruko Sanada; Relja Vulanovic and Tatjana Hrubik-Vulanovic; Makoto Yamazaki) and language change from both the geographical an diachronic perspective (Sheila Embleton, Dorin Uritescu and Eric S. Wheeler; Vasiliy Poddubnyy and Anatoliy Polikarpov; Betsy Sneller; Petra Steiner). Finally, a  group of contributions can be ascribed under the general topic of stylometrics, with particular reference to authorship attribution and text classification (Sergey Andreev; Łukasz Dębowski; George Mikros and Kostas Perifanos; Adriana Pagano, Giacomo Figueredo and Annabelle Lukin; Jacques Savoy). We would like to thank all authors for their worthwhile contributions and also the referees for their invaluable effort. Martina Benešová and Ján Mačutek would like to acknowledge the research grant ESF OPVK 2.3 - Linguistic and lexicostatistic analysis in cooperation of linguistics, mathematics, biology and psychology (CZ.1 07/2.3.00/20.0161), which supported them during the preparation of this volume. Special thanks go to the members of and students at the Department of General Linguistics, Faculty of Arts, Palacký University Olomouc, Czech Republic who helped us for the preparation of this volume. The Editors: Arjuna Tuzzi Martina Benešová Ján Mačutek

Table of Contents Editors’ Foreword  V Sergey Andreev Quantitative Analysis of Poetic Space: Discrimination of Loci in Eugene Onegin by A. S. Pushkin  1 Martina Benešová – Denis Birjukov Application of the Menzerath-Altmann Law to Contemporary Written Japanese in the Short Story Style  13 Martina Benešová – Dan Faltýnek – Lukáš Hadwiger Zámečník Menzerath-Altmann Law in Differently Segmented Texts  27 Łukasz Dębowski A New Universal Code Helps to Distinguish Natural Language from Random Texts  41 Sheila Embleton – Dorin Uritescu – Eric S. Wheeler The Advantages of Quantitative Studies for Dialectology  51 Ján Mačutek Type-token relation for word length motifs in Ukrainian texts  63 George K. Mikros – Kostas Perifanos Gender Identification in Modern Greek Tweets  75 Sven Naumann Syntactic Complexity in Quantitative Linguistics  89 Adriana S. Pagano – Giacomo P. Figueredo – Annabelle Lukin Measuring Proximity Between Source and Target Texts: an Exploratory Study  103 Vasiliy Poddubnyy – Anatoly Polikarpov Evolutionary Derivation of Laws for Polysemic and Age-Polysemic Distributions of Language Sign Ensembles  115

VIII 

 Table of Contents

Andrij Rovenchak Quantitative Studies in the Corpus of Nko Periodicals  125 Haruko Sanada The Co-occurrence and Order of Valency in Japanese Sentences  139 Jacques Savoy Authorship Attribution Using Political Speeches  153 Betsy Sneller Using Rates of Change as a Diagnostic of Vowel Phonologization  165 Petra C. Steiner Diversification in the Noun Inflection of Old English  181 Arjuna Tuzzi – Reinhard Köhler Tracing the History of Words  203 Relja Vulanović – Tatjana Hrubik-Vulanović Grammar Efficiency and the Idealization of Parts-of-speech Systems  215 Yanru Wang – Xinying Chen Structural Complexity of Simplified Chinese Characters  229 Aris Xanthos – Guillaume Guex On the Robust Measurement of Inflectional Diversity  241 Makoto Yamazaki The influence of Word Unit and Sentence Length on the Ratio of Parts of Speech in Japanese Texts  255 References  267 Index of Names  269 Subject Index  273 Authors’ Addresses  281

Sergey Andreev

Quantitative Analysis of Poetic Space: Discrimination of Loci in Eugene Onegin by A. S. Pushkin Abstract: The paper is devoted to quantitative comparison of different loci in the novel in verse Eugene Onegin by Pushlin. Discriminant model is built which reveals the opposition of formal features of different loci in the novel. Keywords: poetic space, poetic time, locus, discriminant analysis, Pushkin

1 Introduction As is well known, the style of one and the same author may vary considerably due to genre (Rudman 2003: 27), purpose of writing (McMenamin 2002), life circumstances (Pennebaker & Lay 2002), time of creation (Can & Patton 2004; Hoover 2006; Juola 2007) and some other factors such as e.g., the style in more popular vs. less known works (Andreev 2014). Style variation within one and the same literary work has been studied from the point of the dynamics of style changes (Martynenko 2004; Naumann et al. 2012) and in relation to the speech of different characters depicted in a literary work (Burrows 1987; Hota et al. 2006; Piotrowski 2006; Stewart 2003). The purpose of this study is to see if style variation is observed in the description of different poetic (fictional) space and time (the chronotope). After Mikhail Bakhtin formulated the notion of the chronotope (Bakhtin 2000), which he understood as a unity of poetic time and space coordinates, it became very popular and was widely used in numerous philological studies. Though time and space in the chronotope are considered to be closely integrated, very often they are studied separately. Bakhtin himself thought that time is the leading component of the chronotope, but at present the attention of researchers seems to have shifted to the other component—poetic space. This study will also be focused mostly on the space component of the chronotope—usually designated either as “topos” or “locus”. The latter term will be used in this paper. By locus we understand a specific locality of the poetic world, the place of action, whose borders are difficult to cross (Lotman 1968). They can be crossed only by the main characters and very often not because of their own wish, but Smolensk state university; [email protected]

2 

 Sergey Andreev

as a result of some urgent necessity, due to dramatic events, or in some cases as a reward for their virtues (Lotman 2000: 301). Loci in prose and verse have been analyzed from literary, cultural, semiotic points of view. The purpose of this study is to carry out a quantitative comparison of different loci based on their formal features. It should be noted that in verse formal features acquire semantic properties.

2 Database The database of this study is the novel in verse Eugene Onegin (EO) by Alexander Pushkin. This work is considered to be one of the best (if not the best) achievements of Russian literature. In order to see the relationships between the loci of the novel, their types, and the poetic time it seems reasonable to give a very brief account of the plot of EO1. Eugene Onegin, the main character of the novel, lives in Saint Petersburg and belongs to high society. He has mastered the art of winning women’s hearts, and he practices it constantly and successfully without any real feeling. Eventually he becomes tired of this superficial life and falls into depression. Soon Onegin is informed that his uncle has died, leaving him a country estate. Onegin leaves Saint Petersburg and goes to the country. There he makes friends with his neighbor, a landowner named Vladimir Lensky, who is a young romantic poet. Vladimir is madly in love with a  local landowner’s  daughter, the beautiful but rather light-minded Olga Larina. Olga has an older sister Tatyana, an idealistic, romantic, and shy girl. Tatyana falls in love with Onegin. She writes a love letter to Onegin, speaking of her feelings to him. Onegin is touched by her sincerity, and explains to Tatyana that by accepting her love he will only ruin her. He warns Tatyana never to open her feelings to men, because inexperience leads to danger. Soon Tatyana has a  nightmare which is a  sort of prophecy. In this dream Onegin commands the forces of Hell, declares that Tatyana is his woman, and then kills his friend Lensky. Some things from the prophecy soon become real. Lensky becomes jealous when Onegin courts his fiancée Olga at a provincial ball and challenges Onegin to a duel. Onegin accepts it and shoots his friend dead.

1 A  remark is necessary that the novel is characterized by a  complex plot structure in which different levels of narration are interwoven. Here we take into account only the surface layer of the plot.

Quantitative Analysis of Poetic Space: Discrimination of Loci in Eugene Onegin […] 

 3

Horrified, Onegin leaves his country estate forever. He travels around the world, trying to forget what happened. Olga, after brief mourning, marries a military officer and leaves her home with him. Tatyana, still in love with Onegin, rejects all offers of marriage, and her mother decides to take her to Moscow where she hopes Tatyana will meet someone to interest her. Several years pass, Onegin returns to Saint Petersburg and at a  ball meets Tatyna who is married to an aged prince. Now this former modest and shy provincial girl whose love he ignored has become a brilliant woman, admired by high society. He is astonished and falls deeply in love with her. Onegin fails to attract the attention of Tatyana and in despair writes a letter to her. Receiving no answer, he risks going to her home, where he finds Tatyana crying over his letter. Tatyana tells Onegin that she loves him and will love him forever, but she is married and will never betray her husband. She asks Onegin to leave her. Onegin is completely disconsolate. On the basis of the plot nine loci may be singled out. They are Saint Petersburg (SPb), Moscow (Msc), Village (Vlg), Road, Tatyana’s Dream (Dream), Duel, Letters by Tatyana and Onegin, and Traveling (Fig. 1).

Fig. 1: The loci in EO

Since Pushkin did not include the chapter about Onegin’s  travels in the final version of the novel, it was not analyzed in this study. The other loci may be divided into two groups. The first group (primary loci) includes Saint Petersburg, Moscow, and Village. They form the main structure of the poetic space in the novel. The other five loci may be called secondary. Duel and Road, though they are important turning points of the plot and are usually singled out by critics, nevertheless take up a  rather limited amount of the text. Tatyana’s  Dream, Tatyana’s Letter, and the Letter by Onegin represent the poetic world created by the fictional characters themselves. These three loci may be conventionally called “second-order” loci.

4 

 Sergey Andreev

Saint Petersburg is present in the novel in the first and last chapters, which are distant from each other not only structurally, but are also separated by poetic time. The time difference coincides with the changes that happen to the main character. In the first chapter Onegin is not capable of deep emotions. In the last chapter he is deeply in love, suffering and ill because now he cannot be with Tatyana. This is why we distinguish between two loci, namely SPb-1 (ChapTer 1) and SPb-2 (Chapter 8). By doing this we actually follow the chronotope theory of time and space unity. The same consideration applies to Village, which is represented starting from the end of Chapter 1 until the middle of Chapter 7. In this study, only two chapters representing this locus were analyzed—the second and the seventh. The events described in these two chapters are also distant in poetic time and rather different in their contents. In Chapter 2 the action is centered around Onegin, whereas in Chapter 7 Onegin is absent altogether and Tatyana becomes the main character. That is also why we differentiate between Village-1 and Village-2. The stanzas of EO in primary loci were grouped according to their main themes. Each primary locus is thus a  class of groups of stanzas. Tab.  1 shows these stanza groups and the number of lines in which the loci are represented. Tab. 1: Groups of stanzas of primary loci Locus

Chapter Number of lines Theme

SPb-1 (1)

1

132

Childhood

SPb-1 (2)

1

224

High society

SPb-1 (3)

1

140

Depression

Vlg-1 (1)

2

70

Onegin in his village

Vlg-1 (2)

2

98

Lensky

Vlg-1 (3)

2

70

Onegin and Lensky

Vlg-1 (4)

2

84

Lensky and Olga

Vlg-1 (5)

2

84

Tatyana

Vlg-1 (6)

2

112

The Larins (father and mother)

Vlg-2 (1)

7

42

Grave of Lensky

Vlg-2 (2)

7

126

Tatyana in Onegin’s house

Vlg-2 (3)

7

84

Preparations for departure

Msc (1)

7

56

Arrival in Moscow

Msc (2)

7

42

At cousin Alina’s house

Msc (3)

7

98

Visits to relatives

Msc (4)

7

42

Ball

Quantitative Analysis of Poetic Space: Discrimination of Loci in Eugene Onegin […] 

Locus

Chapter Number of lines Theme

SPb-2 (1)

8

140

Onegin meets Tatyana

SPb-2 (2)

8

232

Onegin is in love with Tatyana

SPb-2 (3)

8

126

The last meeting

 5

The stanzas of the secondary loci were not divided into subgroups because the texts in which they are reflected are much shorter (Tab. 2). Tab. 2: Stanzas of secondary loci Locus

Chapter Number of lines

Letter-T

3

79

Dream

5

146

Duel

6

154

Road

7

70

Letter-O

8

60

3 Features Formal features used in this study include parts of speech and syntactic characteristics. Below the list of these features is given. Examples are taken from the translation of the novel into English by Ch. Johnston (Johnson 1977). The frequencies of the features were normalized to the length of the texts.

3.1 Syntactic Features Some of these features mark certain deviations from “the norm” in poetry. They include enjambment of different types and syntactic pauses. Enjambment is observed when a syntactic pause is removed from the end of a verse line to some place in the middle (or some part in the middle of the next line). Depending on where the pause is moved, three types of enjambment are traditionally distinguished. Rejet takes place when syntactic structure is longer than the line and ends in the second line (“We all meandered through our schooling / haphazard; so, to God be thanks”). In case the first line is broken by the beginning of a syntactic structure that is continued up to the end of the second line, the type of enjambment is called contre-rejet (“He saw that, as in cities, here

6 

 Sergey Andreev

/ boredom has just as sure an entry”). Double-rejet takes place if the syntactic structure begins in the middle of the first line and ends in the middle of the second line (“That, like proud Byron, I  can draw / self-portraits only – furthermore”). A syntactic pause in the verse line occurs when two syntactic structures are present in the same line (“How often, when the sky was glowing”). An emphatic line is observed in case it ends with such formal markers as exclamation points, question marks, or periods (“Yet sweeter far, at such a time, / The strain of Tasso’s octave-rhyme!”). In addition to inversion (full), coordinate, and subordinate sentences, direct speech was counted.

3.2 Part-of-speech Features These features include three main parts of speech: nouns, verbs, and adjectives. In the verse line three positions were distinguished: initial (the first metrically strong), final (the last metrically strong), and intermediate (all other places in the line between the first and the last metrically strong positions). The count of nouns, verbs, and adjectives was done separately for each position (the approach with localized parts of speech), but the total number of every part of speech irrespective of their localization was also used at the second stage of the analysis. Part-of-speech analysis was carried out manually.

4 Analysis and Results To see if there is any difference between the five main loci (SPb-1, SPb-2, Vlg-1, Vlg-2, Msc) multivariate discriminant analysis was used (Klecka 1980; Mikros 2009). The discriminant model includes the following features which differentiate the classes: double-rejet, subordinate clauses, inversion, syntactic pauses, nouns and adjectives in the beginning and middle of the line, and verbs in the initial position. Two first discriminant functions discriminate well all five main loci (Fig. 2). The first function discriminates between the loci with the exception of SPb-1 and SPb-2. The second discriminant function discriminates between SPb-1 and SPb-2 and between Msc and Vlg-2.

Quantitative Analysis of Poetic Space: Discrimination of Loci in Eugene Onegin […] 

 7

Fig. 2: Discrimination of five primary loci by two discriminant functions

A post-hoc test gave a rather unexpected result, shown in Tab. 3, in which rows are the classes of loci (Tab. 1), and columns are the result of classification on the basis of discriminant model features. The former may be called natural classes, the latter automatically formed. This test showed the 100% correctness of the automatic classification as compared to the natural classes. Tab. 3: Post hoc test (localized parts of speech) SPb-1

SPb-2

Msc

Vlg-1

Vlg-2

SPb-1 100

Percent correct

3

0

0

0

0

SPb-2 100

0

3

0

0

0

Msc

100

0

0

4

0

0

Vlg-1

100

0

0

0

6

0

Vlg-2

100

0

0

0

0

3

Total

100

3

3

4

6

3

It should be noted that if the localization of parts of speech in the verse line is ignored the discrimination becomes slightly worse. A post-hoc test in this case shows only 89% correctness (Tab. 4). This is also not a bad result, but it is still worse than the previous one with localized parts of speech, and confirms the importance of the role of localization of elements in the verse line.

8 

 Sergey Andreev

Tab. 4: Post hoc test (unlocalized parts of speech) SPb-1

SPb-2

Msc

Vlg-1

Vlg-2

SPb-1 100

Percent correct

3

0

0

0

0

SPb-2 100

0

3

0

0

0

Msc

75

0

0

3

0

1

Vlg-1

100

0

0

0

6

0

Vlg-2

66,67

0

0

1

0

2

Total

89,47

3

3

4

6

3

The next step of the analysis was to see if the poetic time, which was used to divide Saint Petersburg and Village into two loci each, is relevant. To do this the members of the pairs SPb-1/SPb-2 and Vlg-1/Vlg-2 were joined together, constituting the correspondingly bigger loci SPB and VLG. Discriminant analysis was carried out for three loci: Saint Petersburg, Village and Moscow. The latter remained the same as before since it did not change over time in EO. As a result, it was found that a highly pronounced difference between such loci remains irrespective of poetic time. The discriminant model is practically the same as when poetic time is taken into account (Tab.  5). The main features of the model are subordinate clauses, inversion, syntactic pauses, and nouns and adjectives in the beginning and middle of the line. Tab.  5 shows the squared Mahalanobis distances between these loci. Tab. 5: Squared Mahalanobis distances between the centroids of three loci

SPB

SPB

VLG

Msc



84,55

48,76

VLG

84,55



16,68

Msc

48,76

16,68



As seen in this table, Moscow and Village are much closer to each other than to Saint Petersburg. The opposition between town and country is less vivid than the opposition of Saint Petersburg and the rest of Russia. This result, generally speaking, coincides with the opinions of many critics who think that the image of Moscow in EO describes good old Russia and its glory as opposed to heartless aristocratic Saint Petersburg. But the fact that this opposition between two towns partially neutralizes the opposition of town vs. village is less expected.

Quantitative Analysis of Poetic Space: Discrimination of Loci in Eugene Onegin […] 

 9

Not only is the poetic time in the loci different, the actual or “real” time of writing the chapters of the novel was different as well. As has been mentioned above, the style of an author can undergo noticeable changes over time. The dates of writing the chapters containing the loci are as follows: – Chapter 1 – 1823 (Locus SPb-1) – Chapter 2 – 1823 (Locus Vlg-1) – Chapter 7 – 1827-1828 (Loci Vlg-2, Msc) – Chapter 8 – 1829-1830 (Locus SPb-2) The obtained opposition of Town and Village whose loci were depicted during very distant periods of time may be taken as an argument in favour of recognizing the fact that neither poetic time nor “real” time is relevant for the given chronotopes. But this conclusion seems rather speculative and the issue deserves further study, which should include among other things the comparison of formal features of the chapters taken as a whole (Chaptes 1-2 vs. Chapters 7-8).

Fig. 3: Discrimination of three loci (Saint Petersburg, Moscow, Village) by two discriminant functions

Fig. 3 shows how two functions discriminate between three classes (loci) in the multidimensional space. The first (strongest) function discriminates mostly between Saint Petersburg and two other loci (Moscow and Village). The second function separates the locus of Moscow from the other classes. It is possible to see which of the secondary loci have a  resemblance. The classes of Saint Petersburg, Moscow, and Village in such a  case may be called criteria. To carry out this classification, group centroids of the criterion classes were calculated and the Mahalanobis distances of each secondary locus to these centroids were established (Tab. 6). The secondary locus may be considered as

10 

 Sergey Andreev

belonging to such a criterion class to the centroid to which it is the closest. The shortest distances in the table are marked in bold. Tab. 6: Mahalanobis distances to the centroids of the criterion loci Saint Petersburg

Moscow

Village

Tatyana’s Letter 288,48

158,92

84,55

Dream

61,85

12,32

10,11

Duel

17,88

20,06

44,30

Road

48,46

8,30

16,31

Onegin’s Letter

18,38

64,11

89,25

According to the results obtained, both Tatyana’s Dream and Letter belong to the locus of Village. Duel and Onegin’s Letter are related with Saint Petersburg, and Road belongs to the locus of Moscow. Though this classification is based on purely formal features it nevertheless can be given meaningful interpretation. The fact that the Letter of Onegin is included in the locus of Saint Petersburg seems very reasonable, since Onegin belongs to Saint Petersburg high society. Even under stress he follows the norms and rules of expression typical of his social status. The inclusion of Duel into the locus of Saint Petersburg may also be considered logically motivated and actually due to the same reasons: dueling certainly belonged to the standards and norms of “high society” circles. A  possible explanation for why Road falls into the class of Moscow is as follows. Road is traditionally viewed as a metaphorical representation of changes in the life of a character and in EO this locus represents the passage of Tatyana from her former provincial life to a  new status as a  married woman in a  completely new social sphere. But Moscow is the place where this radical change is actually realized. So both loci are similar in this respect, showing changes in the life of Tatyana. If Road metaphorically represents the changes as movement in space, Moscow gives a more static spatial description. In literary criticism, Tatyana’s Dream is often considered a certain reflection of country life, reorganized by folk traditions and possessing only highly negative connotations. In other words Village and Dream are traditionally viewed as two visions (real and fantastic) of the same poetic space. This interconnection is intensified by the similarity of their formal features. The Letter of Tatyana was actually written in the country and this might serve as an explanation for why it falls into the Village class. But this explanation seems to be somewhat simplified. Both the contents of the letter and the manner

Quantitative Analysis of Poetic Space: Discrimination of Loci in Eugene Onegin […] 

 11

of presentation certainly differ from those themes which are usually associated with rural life. A better explanation of grouping this locus into the class of Village seems to be connected with the feeling of sincerity and harmony which is found in the description of village life and permeates the letter.

5 Conclusions On the whole it is possible to conclude that formal features allow differentiationon of loci very effectively, sometimes confirming critics’ statements, and in a number of cases revealing new facts. The existence of an opposition between town and village, considered very important by literary critics, is usually based in their works on subjective text interpretation. The results obtained here specify the difference and demonstrate the scale of divergence of the loci. Speaking of the chronotope components, it is also possible to conclude that poetic (fictional) time in this novel is of less importance for building chronotope taxonomy than poetic space. Further steps in the analysis can be directed at the study of how real space (places where Pushkin wrote the novel) and real time (when the chapters of the novel were written) are correlated with poetic space and time.

Acknowledgements The project was carried out with financial support of Russian Foundation for Humanities, project N 15-04-00371

12 

 Sergey Andreev

References Andreev, S. (2014). Representations of Tutchev’s Style: One Poet or Two? In G. Altmann, R. Čech, J. Mačutek & L. Uhlířová (Eds.), Studies in Quantitative Linguistics: Empirical Approaches to Text and Language Analysis. Dedicated to Ludek Hrebicek on the Occasion of his 80th Birthday (Vol. 17, pp. 14–28). RAM-Verlag. Bakhtin, M. (2000). Forms of Time and Chronotope in the Novel [Formi vremeni i khronotopa v romane]. In S. G. Bocharov (Ed.), Mikhail Bakhtin. Epos i roman (pp. 9–193). Saint-Petersburg: Azbuka, Burrows, J. F. (1987). Computation into Criticism: A Study of Jane Austen’s Novels. Oxford: Clarendon Press. Can, F., & Patton, Jon M. (2004). Change of Writing Style with Time. Computers and the Humanities, 38, 61–82. Hoover, D. L. (2006). Stylometry, Chronology and the Styles of Henry James. In Proceedings of Digital Humanities 2006 (pp. 78–80). Paris. Hota, S. R., Argamon, S., Koppel, M., & Zigdon, I. (2006). Performing Gender: Automatic Stylistic Analysis of Shakespeare’s characters. In Proceedings of Digital Humanities 2006 (pp. 100–104). Paris. Johnston, C. H. (1977). A. S. Pushkin. Eugene Onegin. Hannondsworth, Middlesex, England: Penguin Books Ltd. Klecka, W. R. (1980). Discriminant Analysis. Quantitative Applications in the Social Sciences, 19. SAGE Publications, Inc. Lotman, Y. M. (1968). Problems of Poetic Space in the Prose of Gogol [Problemi hudozhestvennogo prostranstva v proze Gogolya]. Trudi po russkoj I slavyanskoj filologii, 11, 5–50. Lotman, Y. M. (2000). Symbolic Spaces [Symvolicheskije prostranstva]. In: Y. M. Lotman (Ed.), Yu. M. Lotman. Semiosphera (pp. 297–335). Saint-Petersburg. Martynenko, G. Y. (2004). Rhythmic-Sense Dynamics of Russian Classical Sonnet [Ritmikosmislovaja dynamika russkogo klassicheskogo soneta]. Sankt-Peterburgskij Universitet. McMenamin, G. R. with contributions by Choi, D. (2002). Forensic Linguistics – Advances in Forensic Stylistics. Boca Raton / London / New York / Washington D.C.: CRC Press LLC. Mikros, G. K. (2009). Content Words in Authorship Attribution: An Evaluation of Stylometric Features in a Literary Vorpus. In R. Köhler (Ed.), Studies in Quantitative Linguistics 5: Issues in Quantitative Linguistics (pp. 61–75). RAM-Verlag. Naumann, S., Popescu, I.-I., & Altmann, G. (2012). Aspects of Bominal style. Glottometrics, 23, 23–55. Pennebaker, J. W., & Lay, T. C. (2002). Language Use and Personality During Crises: Analyses of Mayor Rudolph Giuliani’s Press Conferences. Journal of Research in Personality, 36, 271–282. Piotrowski, R. G. (2006). Linguistic Synergy: Basic Assumptions, First Results, Perspectives [Lingvisticheskaja synergetika: iskhodnije polozhenija, pervije resul’tati, perspektivi]. Sankt-Peterburgskij Universitet. Rudman, J. (2003). Cherry Picking in Nontraditional Authorship Attribution Studies. Chance, 16(2), 26–32. Stewart, L. (2003). Charles Brockden Brown: Quantitative Analysis and Literary Interpretation. Literary and Linguistic Computing, 18, 129–138.

Martina Benešová1 – Denis Birjukov2

Application of the Menzerath-Altmann Law to Contemporary Written Japanese in the Short Story Style Abstract: The main objective of this experiment is verification of the applicability of the Menzerath-Altmann law to contemporary written Japanese. The sample text style chosen for the analysis is a short story by a Japanese author. The methodology, i.e., sample text segmentation methods used in this experiment, partially stem from an earlier experiment on contemporary written Chinese, particularly the language unit of the component (island) proposed. Simultaneously, new language units and processes of segmentation of Japanese texts are also proposed and tested in this experiment as the Japanese writing system otherwise differs considerably from Chinese in many aspects. Analysis results will be presented and interpreted in the paper. Keywords: Menzerath-Altmann Law, written Japanese, segmentation, short story The short story Shinkon-san (新婚さん) by Banana Yoshimoto3 was chosen as a sample text for initiating research aimed at testing the Menzerath-Altmann Law (MAL) , as applied to the contemporary written Japanese language. The chosen edition of the story was first published in 1993. The following text requirement criteria were applied in the sample selection process: 1. an author with relative popularity among a wide (domestic) audience; 2. a short story style; 3. an uninterrupted and continuous sequence of language units on all language levels (…) , and clear distinguishability of the beginning and end of the chosen text, as highlighted e.g. by Hřebíček (2002: 43); 4. a relatively contemporary (non-obsolete) language style.

1 Supported by the project CZ.1.07/2.3.00/30.0004 POSTUP. 2 Supported by the project IGA_FF_2014_083. 3 吉本 ばなな, born July 24, 1964. 1 Palacky University; [email protected] 2 Palacky University; [email protected]

14 

 Martina Benešová – Denis Birjukov

The length of the sample text is 6,210 characters (according to the definition of “character” introduced below). The graphic principle of segmentation was chosen for this experiment due to the specific graphic layout of Japanese texts, and follows previous experiments with Chinese texts, see e.g. (Motalová & Spáčilová 2014).4 A  peculiarity of the Japanese writing system is its combined usage of originally Chinese kanji characters (now in a Japanese form),5 along with two Japanese-created syllabic kana scripts—hiragana and katakana—not only within a single sentence, but even within a single word. Additionally, Arabic numerals as well as Roman letters can also be used. In this regard, the Japanese script differs significantly from the Chinese script. Moreover, all Japanese words containing kanji characters can also be written using solely kana scripts, and although some rules and habitual practice for their combined usage apply, in many cases the method of usage depends on the author.

1 Segmentation of the Sample Text For the sample text we chose to employ seven linguistic levels: 1. Surparagraph 2. Paragraph 3. Sentence 4. Intercomma 5. Character 6. Component 7. Stroke Thus the following language level pairs to be tested in our MAL experiment were formed: U0: Surparagraphs measured in the paragraph count – Paragraphs measured in the average sentence count U1: Paragraphs measured in the sentence count – Sentences measured in the average inter-comma count

4 We are aware of theories claiming that the Menzerath-Altmann Law reflects semanticity to a certain degree, see e.g. Hřebíček (2007), Andres et al. (2011); in the next stages of this research semanticity is planned to be included and explored. 5 See Jōyō Kanji Hyō (2010) for the current official jōyō kanji list proposed by the Ministry of Education.

Application of the Menzerath-Altmann Law to Contemporary Written Japanese […] 

 15

U2: Sentences measured in the intercomma count – Intercommas measured in the average character count U3: Intercommas measured in the character count – Characters measured in the average component count U4: Characters measured in the component count – Components measured in the average stroke count The above-mentioned selected language units are defined as follows.

1.1 Surparagraph The surparagraph is a unit which represents thematic pieces of text more distinctively than normal paragraphs. Graphically, a surparagraph is highlighted by offsetting a piece of text with one or two empty rows and starting a new paragraph. This unit is not very frequent, not commonly cited or commented upon, and it was chosen solely on the basis of graphic distinctiveness. Due to a very low frequency of surparagraphs in the sample text (only 4 were present), the surparagraph-paragraph language level pair was excluded here. However, in future research on a longer sample containing more surparagraphs, its existence should not be ignored.

1.2 Paragraph A paragraph always starts on a new line or column of text while putting an indentation, represented by one empty graphical field, at the start of the line or column. An exception to this rule is when starting a new paragraph with quotation marks ‘ 「 ’ or ‘『 ’ (also ‘﹁ ’ or ‘﹃ ’). Although paragraphs do not have a grammatical or orthographical function, their usage is regulated by certain (not binding) rules. However, as a significant number of paragraphs in the sample text were created only to differentiate direct speech, it can be assumed that the MAL in the paragraph-sentence level pair might not be manifested to a very high degree. We assume that the MAL will be better manifested in styles such as academic papers or essays.6

6 See Bunshō no Kakikata, Totonoekata: 5, Danraku no Tsukurikata (2014) for more information about paragraph usage.

16 

 Martina Benešová – Denis Birjukov

1.3 Sentence The sentence is terminated by a dot ‘ 。’ (kuten), a question mark ‘?’ (gimonfu), or an exclamation mark ‘!’ (kantanfu). Cases exist where the sentence is terminated in another way, or is not terminated despite one of the punctuation marks being used. Yet in the current sample, these three (graphic) marks count as sentence termination markers for this experiment. With the exception of two ellipses ‘……’ (rīdā), no other potentially problematic marks appeared in the sample text and no semantic criteria had to be implemented. Interruption of a statement by an ellipsis was not considered a termination of the sentence.

1.4 Intercomma In the Japanese language, the definition of a word is relatively problematic. There are several influential theories of grammar which vary significantly in their approach to the concept of a “word”. Thus, instead of a vague and semantically specified word unit, the experiment used the easily graphically segmentable unit of an intercomma placed between the sentence and character units.7 The commas ‘、’ or ‘,’ (tōten) in Japanese have a  mostly auxiliary function, helping readers read fluently and comprehend the written text more easily. However, their usage is not always bound by strict rules. The commas’ usage can be ambiguous and it is often up to the author when and where to use the comma in questionable instances.8 However, if the comma appears in a text, it provides a clue about where to make a brief pause during reading. We thus assume that the Japanese comma always has the additional role of a graphical marker to slow down during reading. The intercomma segment is delimited by the beginning of a new sentence, or a comma at one end and a comma or the sentence termination markers at the other end. This method allows us to identify the intercomma segment easily on a graphical basis.9

7 In future research, we plan to define the word more precisely by employing the MAL. 8 The results of the analysis might to a certain extent reflect the subjective style of the author or editor of a sample text. See also Nitsū (2004: 25) for an example of auxiliary usage of the Japanese comma. 9 As this language unit is relatively artificial (we assume that few people perceive Japanese texts on the comma-to-comma level), it would not be very surprising if the MAL did not apply to at least the levels U3 and U4. Hiragana characters, which are usually less complicated regarding the component and stroke count than kanji characters, appear frequently, especially as verb af-

Application of the Menzerath-Altmann Law to Contemporary Written Japanese […] 

 17

If the validity of the MAL is not confirmed on the strictly graphical intercomma-character level, it can consequently be tested using the semantically defined long-word (chōtan’i) and short-word (tantan’i) units based on the National Institute for Japanese Language and Linguistics corpus instead of, or along with, the intercomma unit.

1.5 Character The character is a  basic graphic unit of Japanese and Chinese scripts. Every character should occupy one equally sized graphic field regardless of the stroke count. The graphic field is shaped as a square or a rectangle whose height is not much larger than its width, see Švarný (1967: 31). Individual words and characters are usually not separated by spaces. Segmentation into individual characters strictly adhered to the squareshaped graphic field criterion. Everything within the boundaries of the imaginary square/rectangle, and nothing more, was considered a single character. Roman letters in the text were treated in the same way as Japanese characters. All punctuation marks were also considered characters. The decision to include punctuation marks in the character unit is supported by the criterion of the graphic field, the existence of silent sentences,10 and finally by the strong role of punctuation marks on the modality of a statement.11

fixes or for other grammatical functions, and due to the agglutinative nature of the Japanese language, they extend the form of a verb significantly (i.e. the number of uncomplicated characters rises). It may be assumed that the longer the sentence, the more conjunctions, verbs etc., in which hiragana characters are plentiful, will be contained within the sentence. 10 Koike provides the following dialogue as an example of a silent sentence: Father: 「おまえ、大学へ行くのか。」 (Listen, you will go to a university, right?) Son: 「......。」 Father: 「行かないのか。」 (Or you won’t?) Son: 「......。」 Father: 「黙ってないで、なんとか言いなさい!」 (Why are you silent, say something!) Son: 「......。」 Father: 「考えが、まあだ、まとまってないのか!」 (You still haven’t decided?!) Son: 「......。」 In case of omission of these silent sentences, the dialogue would lack a  meaning. The silent sentence ‘ 「......。」’ would be segmented into 4 characters, 9 components, and 11 strokes. See Koike (2010: 65). 11 See Monbushō Kyōkashokyoku Chōsaka Kokugo Chōsashitsu (1946) for more information about Japanese punctuation marks.

18 

 Martina Benešová – Denis Birjukov

Because of the graphic-field criterion, combined moras (yōon) were treated as two single characters. Full stops in the sample text were treated as single independent characters consisting of one component and one stroke and occupying a discrete graphic field, with the following exception. Since the full stop is usually written together with the terminating quotation mark in a shared graphic field, these marks were (only in this case) treated as a single character consisting of two components.12

1.6 Component Existing methods of segregating individual components (parts) of a Japanese character are not completely uniform and often stem from semantic viewpoints, which is not very suitable for an entirely graphic analysis of the sample text.13 Thus, the criterion of ‘islands’, which Motalová & Spáčilová (2014) used in their Chinese text analysis, was used in segmentation of Japanese characters into components. [This] segmentation method…divides the characters into components according to the contacts of strokes and thus so-called ‘islands’. On the basis of this conception, we regard the component…as a separate part of the character which is composed of one stroke or a group of strokes connected to one another and obviously separated from other parts (i.e. components) of the character (Motalová & Spáčilová 2014: 40).14

Although this method of segmentation differs considerably from the usually implemented methods of dividing characters into components, it is based on strictly graphic principles of segmentation. This method of segmentation of characters into components will be affected by the font type used, as discussed by Motalová & Spáčilová (2014: 40). The Mincho (明朝体) font was used for the analysis of the Japanese sample text. If the island segmentation method proves unfit, a different method of identifying character components, based on more semantic criteria, will be used in later research.

12 Various sectors of the square graphic field, e.g. Spahn (1989), could instead take the role of new stand-alone (graphic) language units in a future experiment. 13 Other approaches will be taken into account in future research addressing semantics. 14 Thus, e.g., the character ‘ 一 ’ consists of one component and the character ‘ 感 ’ of 8 components.

Application of the Menzerath-Altmann Law to Contemporary Written Japanese […] 

 19

1.7 Stroke The stroke is the smallest graphic unit of a Japanese text. Every Japanese character (in standard writing) is made up of a certain number and order of strokes, of which there is a limited variety. The stroke is a line (even a curved one) that is written uninterrupted without lifting up the writing implement during handwriting, assuming the standard stroke order is strictly observed.15

2 Results 2.1 Paragraph-Sentence Level Pair A = 1.6999 b = 0.03067 R2 = 0.02906

x 1 2 3 4 5 6 7 9 15 24 25

z 47 14 10 6 4 4 3 1 1 1 1

y 1.2340 1.7143 1.8667 2.0000 1.5000 1.9583 2.0000 1.5556 1.2667 1.5000 1.3600

Fig. 1: Paragraph-sentence level pair

The parameter b is a positive number, so the function is descending and convex. One of the assumptions for the MAL manifestation has been met, but R2 turns out to be a mere 0.029. Thus, it is evident that the MAL is not a very adequate model on this language level for our sample text. One of possible reasons may be that the text was relatively dialogue heavy, and every termination of direct speech was accompanied by the termination of a paragraph. However, whether this really matters at all is yet another question.

15 E.g., the character ‘ 一 ’ usually consists of one stroke and the character ‘ 及 ’ of three strokes. In this experiment, a stroke count conforming to the Shin Kangorin character dictionary (Kamada 2004) was used.

20 

 Martina Benešová – Denis Birjukov

Another reason may be the introduction of strictly graphically segmented intercomma language units, which omit other (semantic) aspects of a sentence.

2.2 Sentence-Intercomma Level Pair A = 14.2678 b = −0.0483 R2 = 0.0178

x 1 2 3 4 5 6

z 152 78 27 8 1 1

y 13.19 13.17 13.14 14.38 22.20 10.17

x 1 2 3 4 5 6

z 152 78 27 8 1 1

y 14.30 14.14 14.20 15.59 23.20 11.17

Fig. 2: Sentence-intercomma level pair (punctuation marks included) A = 13.2033 b = −0.0508 R2 = 0.01735

Fig. 3: Sentence-intercomma level pair (punctuation marks excluded)

The parameter b is a negative number so the assumption for the MAL manifestation has not been met. However, there are not very many x value points present in this figure so the statistical relevance might not be very high. Moreover, the fifth and the sixth xs are extremes with a quantity of merely 1. If we exclude them, a slightly ascending trend is present, although it is still irrelevant for the MAL.16

16 A new method of calculating MAL parameters is being developed, and it will include the observation point frequencies as weights in the MAL parameter calculations (Andres et al. 2014).

Application of the Menzerath-Altmann Law to Contemporary Written Japanese […] 

 21

As the inclusion of punctuation marks into the character unit is not a commonly used practice, we decided to make additional calculations with punctuation marks excluded from the character unit for comparison. As is evident from Fig. 3, however, it had minimal impact on this language level.

2.3 Intercomma-Character Level Pair A = 2.5883 b = 0.0797 R2 = 0.2843

Fig. 4: Intercomma-character level pair (punctuation marks incl.)

x 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

z 1 5 15 8 34 31 25 19 23 29 25 30 12 21 14 15 8 17 9

y 2.50 2.07 1.75 1.58 2.22 2.03 1.88 1.94 1.89 1.98 2.00 2.07 1.99 2.05 1.87 2.11 2.08 1.87 1.87

22 

 Martina Benešová – Denis Birjukov

A = 2.0153 b = 0.0054 R2 = 0.0028

x 21 22 23 24 25 26 27 28 29 30 31 32 33 34 36 37 38 46 49

z 11 8 7 12 8 8 7 4 5 2 1 1 5 4 2 1 3 1 1

y 2.00 1.91 2.12 1.87 2.01 1.88 2.02 2.03 2.01 1.92 2.26 2.03 2.11 1.90 2.11 1.76 1.88 1.98 2.14

Fig. 5: Intercomma-character level pair (punctuation marks excl.) x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

z 1 7 14 9 38 29 25 20 22 31 25 27 12 19 17 14 8 14 9

y 4.00 2.50 1.86 1.78 2.42 2.22 2.02 2.04 2.01 2.09 2.08 2.20 2.03 2.14 2.00 2.18 2.01 1.96 1.91

x 20 21 22 23 24 25 26 27 28 29 30 31 32 33 35 36 37 43 48

z 14 6 8 10 8 8 8 3 5 2 2 1 4 4 2 1 3 1 1

y 2.04 1.98 2.16 1.87 2.05 1.91 2.09 1.99 2.05 1.95 2.40 2.06 2.05 1.93 2.14 1.75 1.90 2.05 2.17

On the intercomma-character level, a more or less constant shape of the curve is observable. The parameter b is a positive number, but R2 shows a just under 0.003 goodness of fit. Moreover, the ever so slightly descending shape of the curve is probably caused just by the first x-value point with a  quantity of 1. The other points are spread out rather haphazardly.

Application of the Menzerath-Altmann Law to Contemporary Written Japanese […] 

 23

The graph in Fig. 5 is more interesting. After excluding punctuation marks from the character unit, the tendency has considerably changed. Parameter b is still a positive number, and R2 has increased from almost 0 to 0.2843, even though it may be co-influenced by the first data point with a quantity of 1. As punctuation marks have a substantial influence on the length of intercommas, this may be one of the reasons for the noticeable change, although in this way the graphic principle is disrupted. A new method of calculation is in development, incorporating frequencies of occurrence of the x values that the current method does not take into consideration. It would be interesting to see the shape of the curve which takes even x-value point quantities into consideration on this language level.

2.4 Character-Component Level Pair A = 1.9312 b = 0.0305 R2 = 0.0435

x 1 2 3 4 5 6 7 8 9 12

z 3242 1272 958 344 223 101 25 29 14 1

x 1 2 3 4 5 6 7 8 9 12

z 2793 1224 954 344 223 101 25 29 14 1

y 2.03 1.78 1.70 1.94 1.61 2.23 2.06 1.64 1.87 1.67

Fig. 6: Character-component level pair (punctuation marks incl.) A = 2.0291 b = 0.0565 R2 = 0.1271

Fig. 7: Character-component level pair (punctuation marks excl.)

y 2.20 1.81 1.70 1.94 1.61 2.23 2.06 1.64 1.87 1.67

24 

 Martina Benešová – Denis Birjukov

Taking into account our lowest language level pair, the results are similar. The parameter b is a  positive number, but R2 turned out to be only 0.004, and the x values are scattered quite unevenly. At first we assumed that one of the reasons might have been the interference caused by our inclusion of punctuation marks into the character unit. But, as evident from Fig. 7, after their subsequent exclusion only the first two observation data points changed, which led to only a  subtle change in the curvature. Another possible reason might be the Japanese usage of syllabic kana scripts, which evolved from more complex kanji characters but which nowadays differ in many aspects. But as they follow the very same rules concerning the widely recognized graphic field criterion, it would be illogical to treat them any differently than kanji in graphic segmentation.

3 Conclusion The overall result is that no language-level pairs of our analyzed text comply with the MAL using the method of graphic segmentation, but it is still early to conclude that it does not comply overall with contemporary written Japanese in the short story style. As this was the first such experiment, other literary styles, and possibly additional proposed language units (such as long and short words), along with other currently tested MAL formulas, will be employed in future experiments.

Application of the Menzerath-Altmann Law to Contemporary Written Japanese […] 

 25

References Andres, J., Benešová, M., Chvosteková, M., & Fišerová, E. (2014). Optimization of Parameters in Menzerath-Altmann Law. In Acta Mathematika, 53(1) (pp. 3–23). Olomouc: Univ. Palacki. Andres, J., Benešová, M., Kubáček, L., & Vrbková, J. (2011). Methodological Note on the Fractal Analysis of Texts. Journal of Quantitative Linguistics, 18(4), 337–367. Bunshō no Kakikata, Totonoekata: 5, Danraku no Tsukurikata [文章の書き方・ととのえ方: 5, 段落の作り方]. (2014). In Sanseidō Web Dictionary. Retrieved January 4, 2014, from http:// www.sanseido.net/main/words/hyakka/howto/05.aspx Habein, Y. S. (2000). Decoding Kanji: A Practical Approach to Learning Look-Alike Characters (1st ed.). Tōkyō: Kodansha International. Hřebíček, L. (2002). Vyprávění o lingvistických experimentech s textem. Praha: Academia. Hřebíček, L. (2007). Text in Semantics. The Principle of Compositeness. Prague: Oriental Institute. Jōyō Kanji Hyō [常用漢字表]. (2010). Retrieved from http://www.bunka.go.jp/kokugo_nihongo/ pdf/jouyoukanjihyou_h22.pdf Kamada, T., & Yoneyama, T. (2004). Shin Kangorin Yōrei Purasu [新漢語林用例プラス]. Tōkyō: Taishūkan Shoten. (electronic dictionary) Koike, S. (2010). Úvod do gramatiky moderní japonštiny (1st ed.). Brno: Tribun EU. Monbushō Kyōkashokyoku Chōsaka Kokugo Chōsashitsu [文部省教科書局調査課国語調査室]. (1947). Kugiri Fugō no Tsukahikata [くぎり符号の使ひ方]. Retrieved from http://www. bunka.go.jp/kokugo%5Fnihongo/joho/kijun/sanko/pdf/kugiri.pdf Motalová, T., & Spáčilová, L. (2013). Aplikace Menzerath-Altmannova zákona na současnou psanou čínštinu (Mgr. thesis). Olomouc: Univerzita Palackého v Olomouci, Filozofická fakulta. Motalová, T., & Spáčilová, L. (2014). Application of the Menzerath-Altmann Law on Current Written Chinese. Olomouc: VUP. (in print) Nitsū, N., & Satō, F. (2004). Ryūgakusei no Tame no Ronritekina Bunshō no Kakikata [留学生のための論理的な文章の書き方]. Tōkyō: 3A Corporation. Spahn, M., Hadamitzky, W., & Fujie-Winter, K. (1989). Kan-Ei Jukugo Ribāsu Jiten [漢英熟語リバース辞典]. Tōkyō: Hatsubai Kinokuniya Shoten. (Edition: Shohan) Švarný, O. et al. (1967). Úvod do hovorové čínštiny: Příručka pro vys. šk. 2. Praha: SPN. Yoshimoto, B. (2000). Tokage [とかげ]. Tōkyō: Shinchōsha.

Martina Benešová1 – Dan Faltýnek2 – Lukáš Hadwiger Zámečník3

Menzerath-Altmann Law in Differently Segmented Texts

Abstract: The aim of our paper is to discuss one the fundamental assumptions of synergetic linguistics that a  language system tends to balance itself due to possessing self-regulating and self-organizing mechanisms, and to postulate an inquiry when it happens, if it ever does. We chose the well-known and often quoted Menzerath-Altmann law (MAL) as a model whose assumed validity confirms that it does. The MAL original verbal formulation was extended to become a relationship between the generalized units of the construct and the constituent. A unit of a particular linguistic level, therefore, acts as a construct toward the unit of the immediately lower neighbouring level and a constituent to the immediately higher neighbouring level. This has to lead inevitably to the precise definition of either on every linguistic level. Hence, the first discussed pitfall to be commented on is the choice of suitable linguistic units and the rules of a linguistic sample segmentation. One text was chosen and additionally tested using different segmentation criteria and units. The results are to be compared and interpreted. Keywords: Menzerath-Altmann law, segmentation of linguistic sample, construct, constituent, synergetic linguistics

1 Supported by the project POSTUP Podpora vytváření excelentních výzkumných týmů a  intersektorální mobility na Univerzitě Palackého v  Olomouci, grant no. CZ.1.07/2.3.00/30.0004, which is financed by the European Social Fund and the National Budget of the Czech Republic. 2 Supported by the project Linguistic and Lexicostatistic Analysis in Cooperation with Linguistics, Mathematics, Biology and Psychology, grant no. CZ.1.07/2.3.00/20.0161, which is financed by the European Social Fund and the National Budget of the Czech Republic. 3 Supported by the project Linguistic and Lexicostatistic Analysis in Cooperation with Linguistics, Mathematics, Biology and Psychology, grant no. CZ.1.07/2.3.00/20.0161, which is financed by the European Social Fund and the National Budget of the Czech Republic. 1 Palacky University; [email protected] 2 Palacky University; [email protected] 3 Palacky University; [email protected]

28 

 Martina Benešová – Dan Faltýnek – Lukáš Hadwiger Zámečník

1 Experiment Methodology At present, most scientific experiments require collaboration of a team of people who are specialized in different fields. Quantitative linguistic experiments are no exception. Consequently, teamwork requires a common experiment methodology, whether it is quantitative or not. In Fig. 1 we present a very simplified and general flow chart visualizing some basic steps of a  quantitative linguistic experiment which outlines the complexity of the mentioned problem. Naturally, at every step of the experiment one asks whether the requirements or expectations have been met, reaches a potential decision making fork, etc.; for the sake of simplification, these delicate matters are not included in the chart. Every step of the experiment flow chart represents an individual, yet “subresearch” interlinked with other steps where the research team members have to make a decision and set up the rules.4 We soon came to the conclusion that every single stage of the algorithm calls for its own methodology. Additionally, we wanted to study each of them, which has to be done one by one due to the interconnectedness of the steps. Nonetheless, even this target has proved infeasible because some of the experimental stages are so bound with one another that they cannot be studied separately. This is also the case with the technical steps of our research, which we would like to put into the spotlight in this paper, that is, determining the units and consecutive segmenting of the text (highlighted in grey in Fig. 1). Yet we cannot avoid including and commenting on the other steps of the methodology in our text; determining the units and consecutive segmenting of the text are highlighted. Both problems have not so far been independently discussed deeply, with concern for the aftereffects of different methods. Last but not least, we find it inevitable to mention that every change of an individual step or within an individual step may

4 Therefore, for better understanding we supply some examples of problems and decisions a researcher has to make in research of our type. Problem 1: Setting up text units is inseparably dependent on the text chosen, whether it is an orally produced text or a  written one, which literary style it represents, etc. Therefore, we cannot avoid including reasons for the text choice in our paper. Problem 2: Using a particular format of a quantitative linguistic law (e.g., in the case of the Menzerath-Altmann Law, whether we employ the truncated or complete formula; for more information see Andres et al. 2014), and using these formulas in pre-specified situations. Problem 3: Quantified text characteristics can have different frequencies, and thus they may be differently significant for the researcher, which can be expressed, e.g., by including weights into the research (Andres et al., 2014). Problem 4: For research verification, various statistical methods can be utilized. Problem 5: In the research output interpretation, the researcher can prefer verbal, numerical, graphical, or any other methods.

Menzerath-Altmann Law in Differently Segmented Texts 

 29

cause a flood of changes and shifts in the whole system. This is another reason to study the steps individually.

Fig. 1: A simplified flow chart of a quantitative linguistic experiment (using the MAL) with the steps in the spotlight

2 Problem to Consider As already mentioned, we chose to study two steps of the experiment methodology above all: determining the text units and consequent segmenting of the text. The question which then arose was whether alterations in setting up the units and consequently in the text segmentation would bring any change in the outputs of

30 

 Martina Benešová – Dan Faltýnek – Lukáš Hadwiger Zámečník

he experiment employing a chosen quantitative linguistic tool.5 In other words, if we regard the text as a dynamic system based on causal interactions among its elements (i.e., on the relations between the set of particular causes and the set of effects caused by them), do the differences in the initial/boundary conditions yield divergent outcomes? In case they do, how vast are the differences? Can we predict the outcomes anyhow? As mentioned in Footnote 5, the Menzerath-Altmann Law (MAL) was chosen as a quantitative linguistic tool. One motivation is Cramer (2005), which does not study the same problem but provides an overview on what had been learnt up to that time regarding the Menzerath-Altmann Law. We can clearly see that different segmentation methods as well as different units were collected and presented in the study, for the list of those discussed or mentioned in the paper see Tab. 1. However, these were applied on different texts so they are not suitable for the planned comparison. Cramer (2005), then, also studies and discusses the step of the algorithm which we call in Fig. 1 “Finding parameters ai, bi, ci, for i = 1, 2, 3, …”; that is, using mathematical or statistical instruments she recalculates the results of other authors mentioned in her paper employing different formulas of the Menzerath-Altmann Law. Tab. 1: Language levels and units employed in MAL experiments by different researchers as collected in Cramer (2005)6 Research authors6

Language levels and units used by the research authors

Grégoire (1899), Menzerath (1928), Menzerath (1954)

words (in syllables) – syllables (in sounds) words (in syllables) – syllables (in phonemes) words (in syllables) – syllables (in msec) words (in tones) – tones (in msec)

Gerlach (1982)

words (in morphemes) – morphemes (in phonemes)

Heups & Köhler (1983)

sentences (in clauses) – clauses (in words) sentences (in words) – words (in syllables)

Hřebíček (1989), …

text aggregates (in sentences) – sentences (in words)

Rothe (1983)

correlation between word in graphemes and meanings correlation between word in syllables and meanings

5 We have decided to employ the instrument of the Menzerath-Altmann Law firstly due to its suitability for such an experiment and secondly due to our experience with applying it, cf. e.g. (Andres & Benešová 2011) and (Andres & Benešová 2012). 6 If mentioned in Cramer (2005). For bibliography details in Tab. 1 cf. (Cramer 2005).

Menzerath-Altmann Law in Differently Segmented Texts 

Research authors6

 31

Language levels and units used by the research authors correlation between word in morphemes and meanings

(Altmann 1980)

language constructs (in constituents) – constituents (in its constituents)

In our case we found it useful to utilize one text so that comparing the outputs is justified. For our analysis we chose a text from the Olomouc Speech Corpus (currently under construction). It is a 23-minute spontaneous dialog of four speakers. In transcribed form it is 26 pages long, i.e. about 21,000 signs (the length differs in different segmentations on different segmentators because of the notation and revision of particular constructs and constituents). In the following section, three example paragraphs of the transcription of our chosen dialog are shown. Each of those paragraphs represents the same replica from our record from Olomouc Speech Corpus, transcribed in three different ways. We use the paragraphs as a. illustrations of different text notations and segmentations, and b. illustrations of the impact of the segmentation on text characteristics. Segmentation 1 (S1) is a phonetic transcript.7 The units of Segmentation 1 are the speech sound, syllable, and stress unit. To sum it up these units are products of

7 We did not use the full phonetic transcription. This was a practical decision without any influence on our experiment. For the Menzerath-Altmann Law analysis, the number of constructs and constituents is crucial. From this point of view there is no big quantitative difference between the Czech graphical system and the International Phonetic Alphabet. But there are exceptions, for example the Czech diphthong ou. In the first segmentation, ou is notated as a single unit, but using the orthographical notation of Czech text, ou is constituted of two units so there is a difference in the number of constituents for the Menzerath-Altmann Law analysis. For these cases— like ou—we use symbols from the International Phonetic Alphabet or our own, ad hoc notation. The phonetic transcription includes many phenomena which cannot be found in an orthographical text. For example, there are phonetic reductions when a speech sound is realized only as an acoustic feature in an adjacent speech sound, other examples are emphases or hesitation sounds, palatalization, or epenthetic j between two vowels. Those phonetic phenomena have a place when utilizing phonetic transcription but not in an orthographically notated text, and we expect that it has an influence on the Menzerath-Altmann Law analysis. These examples show that the notation (and additionally the segmentation) of a  text is the first and foremost aspect that has an impact on the shape or appearance of the object of analysis—the difference between the number of the lowest units employed using Segmentations 1 and 2 is only five units in our example paragraphs—but there is still a difference which is supposed to be much broader concerning the whole text. In the whole of our text, the ratio between the lowest level units in S1 and S2 is 1.7 % (15.059 : 15.320), but this number includes insertions and elisions of units.

32 

 Martina Benešová – Dan Faltýnek – Lukáš Hadwiger Zámečník

acoustic and articulation text structures. From this point of view, the syllable is the smallest articulation unit, and syllables always compound stress units. Segmentation 2 (S2) is different. The Czech orthographical system is used for notation, however the notation takes into consideration the oral character of our text which maintains its features as much as is possible. The lowest unit of segmentation 2 is the grapheme. The other, higher units are the syllable and the graphical word. Now one can see the big difference between Segmentation 1 and Segmentation 2 on the highest level—their appearance is 38 to 71, i.e., Segmentation 1 includes significantly less of the highest units, the stress units, than Segmentation 2 includes of the words. This segmentation represents a practical approach to text segmentation, in which the researcher respects graphical boundaries of letter sets, i.e. graphical words, and the level in-between is segmented by syllables. Now let us focus on Segmentation 3 (S3). Its units are the grapheme, another unit is the morpheme, and the last one is the graphical word. In comparison to Segmentations 1 and 2, there is a  big difference on the middle level. The morpheme’s appearance is 127 and it is related to 106 (in S1) and 111 syllables (in S2). We still have to highlight that we have not used the concept of the zero morpheme which would significantly increase the number of morphemes on the given level. We have three segmentations of one replica and we assume that we can see that segmentation has a  very strong effect on the appearance or shape of the text—we can say that segmentation produces different samples for quantitative analysis. Tab. 2: Differences in Segmentations 1, 2 and 3 illustrated in one paragraph (the first segment is underlined for better orientation) Segmentation 1 (S1) ne to-já-sem fpr-vá-ku to-sme-šli do-mel do-bel-mon-da mňe-vi-táh-li hol-ki a bi-la-tam a-ji spo-lu-žač-ka zgim-plu a-bi-la-tam sňá-kω par-tω no a-pros-ťe ti-jo ten-kluk-mi ňe-ko-ho pri-po-mí-ná je-to-on ne-ňí-to-on bil-to-on ja-ko-že pros-ťe je-ho-sem-fakt fo-lo-mω-ci ne-če-ka-la tak-sem-na-ňej u-pl-ňe ja-ko-že hm on-ta-ki tak sto-ho-bil ta-ko-vej Segmentation 2 (S2) ne to já jsem v pr-vá-ku to jsme šli do mel do bel-mon-da mě vy-táh-ly hol-ky a by-la tam i spo-lu-žač-ka z gym-plu a by-la tam s ně-ja-kou par-tou no a pros-tě ty jo ten kluk mi ně-ko-ho při-po-mí-ná je to on ne-ní to on byl to on ja-ko-že pros-tě je-ho jsem fakt v o-lo-mou-ci ne-če-ka-la tak jsem na něj ú-pl-ně ja-ko-že hm on ta-ky tak z to-ho byl ta-ko-vý Segmentation 3 (S3) ne t-o já j-s-em v prv-ák-u t-o j-s-me šl-i do mel do belmond-a m-ě vy-táh-l-y holk-y a by-l-a tam i spolu-žačk-a z gympl-u a by-l-a tam s ně-jak-ou part-ou no a prostě t-y jo t-en kluk m-i ně-k-o-ho při-pomín-á je t-o on ne-ní t-o on by-l t-o on jak-o-že prostě je-ho j-s-em fakt v olomouc-i ne-ček-a-l-a tak j-s-em na n-ěj úplně jak-o-že hm on taky tak z to-ho by-l tak-ov-ý

Menzerath-Altmann Law in Differently Segmented Texts 

 33

Accordingly, we wanted to test our conjecture that the outputs of an experiment analyzing a  linguistic sample and employing the Menzerath-Altmann Law will differ with different initial segmentations, firstly from the micropoint of view. That is, here we analyzed only the paragraph whose three different segmentations are shown in Tab. 2 and described in the section above. To repeat, our motivation is to illustrate the influence of different segmentations on the MAL manifestation. The following list shows the quantities of constructs, constituents, and subconstituents of the chosen paragraph for each segmentation: – S1 phonetic (38 stress units—106 syllables—245 sounds) – S2 orthographical (70 graphical words—111 syllables—250 graphemes) – S3 morphological (70 graphical words—127 morphemes—250 graphemes) There are three embedded linguistic levels. Obviously, there are deep differences in the above quoted quantities. The biggest differences can be perceived on the highest of the three utilized levels, especially when we confront Segmentation 1 with Segmentation 2 and Segmentation 3, i.e. 38 stress units vs 70 graphical words. Another big difference is when comparing the quantities of syllables vs. morphemes, especially Segmentation 1 vs Segmentation 3, i.e. 106 syllables vs 127 morphemes. The least difference seems to be found between sounds and graphemes, i.e. 245 sounds in Segmentation 1 and 250 graphemes in Segmentations 2 and 3. Particular data for each segmentation are given in Tab. 3, where x represents the length of the construct measured in its constituents (e.g., in case of Segmentation 1 it is the length of stress units in syllables), z stands for the frequency of the construct, and y represents the average length of the constituent in its constituents (e.g. in case of Segmentation 1 it is the average length of syllables in sounds). Tab. 3: A quantified example paragraph of the text showing the differences arising from different segmentations: x the length of the construct in its constituents (S1: the stress unit in syllables, S2: the graphical word in syllables, S3: the graphical word in morphemes), z its frequency, and y the length of its constituents in their subconstituents (S1: the syllables in sounds, S2: the syllables in graphemes, S3: the syllables in morphemes). Segmentation 1 Segmentation 2 Segmentation 3 x

z

y

z

y

z

y

1

5

2.0000

48

2.2826

33

2.3939

2

8

2.5000

10

2.4091

21

1.8095

3

15

2.2667

9

2.1111

13

1.8974

4

10

2.3250

4

2.1875

2

1.6250

5

-

-

-

-

1

1.6000

34 

 Martina Benešová – Dan Faltýnek – Lukáš Hadwiger Zámečník

Here we would like to point out the first difference, which is the one between the distributions of the frequencies of the constructs (z) of particular lengths (shown in Tab.  3 in grey). This quality cannot be illustrated graphically in the graph depicting the relationship between the constructs and their constituents. Fig. 2 shows the graphs of the relationships between the constructs and their constituents quantified in Tab. 3. We would like to highlight that this is a graphical illustration of one quantified replica, serving solely to show the differences arising from different segmentations.

Fig. 2: The graphical representations of the outputs of quantification shown in Tab. 3

Menzerath-Altmann Law in Differently Segmented Texts 

 35

The tendencies of the observation points are totally diverse one from another. S1 shows in its four points a not monotonous, yet overall slightly ascending tendency, S2 a markedly fluctuating tendency, and S3 again a not monotonous, yet overall descending tendency. Although we have mentioned just a set of data collected by quantifying one replica, and there are only four to five data points, the graphs are different at first sight. In the more extensive Tab.  4, we return to the original 23-minute-long dialogue with a 26-page-long transcription and we publicize data gained by processing the whole of our originally chosen text. The whole text was segmented in two ways. The first is a phonetic segmentation (corresponding to S1 above) and the other is an orthographic one (corresponding to S2). The criteria for these segmentations are the same as those stated for the segmentations of our short example paragraph. Tab. 4: Construct lengths and frequencies and constituent lengths concerning the whole of the original text showing the differences arising from different segmentations: x the length of the construct in its constituents (S1: the stress unit in syllables, S2: the graphical word in syllables), z its frequency, and y the length of its constituents in its constituents (S1: the syllables in sounds, S2: the syllables in graphemes). Segmentation 1

Segmentation 2

x

z

y

z

y

1

403

2.3151

2,149

2.3527

2

1,074

2.4074

1,220

2.4008

3

778

2.2935

456

2.2968

4

255

2.2863

95

2.2000

5

78

2.2026

35

2.1257

6

25

2.2067

5

1.8667

7

5

2.0571

 -



At first sight there are again obvious differences in the frequency distributions (the frequencies of constructs highlighted in grey). The tendencies of the constituent lengths related to the construct lengths are put the best again in a graphical way in Fig. 3.

36 

 Martina Benešová – Dan Faltýnek – Lukáš Hadwiger Zámečník

Fig. 3: The graphical representation of the quantification of the whole of the text of interest; the left graph for S1, the right one for S2. The descending regression curves show the visualization of the tendency given by the Menzerath-Altmann Law.

The layout of the observation points is not that dramatically different as in the case of our paragraph visualizations (Fig.  2), tendencies plotted in both graphs are now generally decreasing, yet the two graphs differ slightly at first glance. The regression curve shows the descending and convex tendency, which proves compliance with the MAL. The goodness-of-fit of the regression curves is R2 = 0.5710 for S1 (the left graph in Fig. 3) and R2 = 0.6253 for S2 (the right graph in Fig. 3). However, if we follow particular points, their tendency does not satisfy the MAL’s  assumptions fully. In both graphs in Fig.  3 the tendency of the first and second observation point juncture (the first point visualizing the shortest construct x = 1, highlighted in Fig.  3 by grey circles) is ascending, which does not comply with MAL expectations. In the left graph the downward-sloping tendency of the second to the seventh observation point is not smooth and not always absolutely convex either. In the right graph the tendency of the second to the fifth point is downward sloping and convex, nevertheless the last, sixth point slips down too excessively and breaks this downward-sloping, convex tendency (the same as in the left graph). However, the importance of the observation points is dramatically different (as obvious from the grey columns of Tab. 4). The first observation point in Segmentation 1 is almost four times less frequent than the one in Segmentation 2, the third almost two times more frequent, the fourth almost three times more frequent, the fifth two times more frequent, and the sixth five times more frequent. What is, however, of even higher significance is that the frequencies of particular observation points within one segmentation analysis differ dramatically. This quality is, nonetheless, not visibly brought back in the following data gained when using the “classical” MAL formula, which does not reflect the “significance” (i.e., the frequency) of each particular data point (b is

Menzerath-Altmann Law in Differently Segmented Texts 

 37

one of the MAL parameters, R2 is the coefficient of determination, the third piece of information is the 95% confidence interval8).9,10 Segmentation 1 b = 0.0549 R2 = 0.5710 b ∈ 〈0.0002; 0.1097〉 Segmentation 2 b = 0.1095 R2 = 0.6253 b ∈ 〈− 0.0082; 0.2273〉 Therefore, as a preliminary conclusion, we can say that by the quantification of one text segmented in two different ways, we got two quite different sets of data, which, however, lead to the visibly almost same models gained by linear regression. Questions and remarks which are left for further consideration and experiments are: – How is it possible that the data gained from one paragraph of ours differed so dramatically and the overall data models got much closer to each other? – Do the differences in the outputs gained by different segmentation methods disappear when you extend the data set? Always? In case they do, how extensive do the data sets have to be to erase the differences totally? – The importance, i.e. the frequency, of each observation point has not been fully included into the analyses’ methodology, and we believe it should be, e.g. in the form of weights of each data point, see Footnote 9.

8 The data was gained by means of the freely available statistical software R. 9 To get rid of this weakness we suggest involving the frequencies of data points into the MAL and other reflected formulas; for more information, see Andres et al. (2014). 10 Even if it is not the main focus of the text, the discussion is supplied here with the calculated parameters b and coefficients of determination and confidence intervals gained using one of the MAL formulas, the truncated one. Nevertheless, the authors would not like to discuss here how satisfying the fit is or which formula is the best (this is discussed elsewhere, e.g. in Andres et al. 2014), but instead to highlight that none of them reflects and incorporates frequencies of individual data points.

38 

 Martina Benešová – Dan Faltýnek – Lukáš Hadwiger Zámečník

3 Analysis of the Laws of Quantitative Linguistics It is a  common feeling in the quantitative linguistics community that it is very desirable to construct a  valid model of a  likely functional explanation on the basis of the laws of quantitative linguistics. For this task it is necessary to evaluate the nature of the mathematical laws of quantitative linguistics, such as the MAL. There are several possibilities which could help to define the MAL. It may be a mere description, based on an appropriate mathematical formalism, a simple mathematical model that properly depicts cogent linguistic systems, or an empirical generalization, confirmed by extensive empirical testing. We think, if a truly functional model of explanation has to be constructed, it certainly cannot be based on a law as a mere description or empirical generalization. If an explanation is constructed on the MAL as a model, then it needs to disclose the principles on which the model of the MAL was created. The current state of the development of quantitative linguistics suggests two variants of such principles: the MAL may be a manifestation of either economization principles or of conservation principles. The system of economization principles is potentially the most fruitful aspect of the field of quantitative or concretely synergetic linguistics. But, as in physics the economization principles of linguistics are only useful heuristic tools.11 Köhler, for one, shows in several places how these principles help to depict the function of language systems. The basic economization principles at the level of phonemes and words are threefold: the principle of minimizing production effort (MinP), the principle of minimizing memory (MinM), and the principle of minimizing decoding effort (MinD).12

11 However, mathematical physics increasingly seeks to reduce them to a fundamental level (Fermat’s principle), revealing their statistical nature (increasing entropy), or to substitute these principles for conservation and spontaneous symmetry breaking principles (minimization of energy). 12 The principle of minimizing production effort (MinP) expresses the need to minimize the effort for movement and muscle coordination for speech. The principle brings as a side effect an increasing similarity between words produced. The principle of minimizing memory (MimM) functions in conjunction with the principle of minimizing production effort (MinP), and expresses an economization requirement to minimize the vocabulary used by the communicator. The principle of minimizing decoding effort (MinD) has the opposite effect on the previous two, because on the decoding side there is a requirement for a lesser degree of similarity between the decoded sounds and a larger dictionary (see Köhler 2008: 761–763).

Menzerath-Altmann Law in Differently Segmented Texts 

 39

These principles form the basis of the language system which can be modeled as a control circuit in which a balance of effects of the mentioned requirements is established. Such a control circuit is highly stable, which can assimilate fluctuations necessarily arising in the considered processes. Transient equilibriums are constituted like the assimilation of mutations of the language system through selection procedures of speakers and listeners. The defined language system then appears to be functionally explained. As in biology the question arises as to the nature of functional explanation. A great challenge for quantitative linguistics is to study revealed mathematical regularities as conservation laws. From physics, the source discipline from which we transfer these conceptual inferences, we know that conservation laws are always borne by certain mathematical symmetries.13 For quantitative linguistics we can theoretically determine several potential conservation principles: a conservation principle (CP) of information complexity, a CP of lexical stability, a CP of grammar, a CP of syntactic structure, and a CP of semantic saturation. Each conservation principle would have to be borne by some spatial symmetry. For example, a CP of information complexity would be borne by the symmetry of computing a cluster of network units (of a language subject, Freeman 1999); a CP of lexical stability would be borne by the symmetry of lexical space (of a language subject, Hřebíček 2002). There may also be options for testing the validity of these principles in the field of neurolinguistics and psycholinguistics, as in studies of aphasia and its influence on the transformation of parameters of the MAL. Perhaps it is feasible to observe the transformation of parameters of the MAL by a shift of information complexity.

13 Emmy Nöther, for example, has initially and comprehensively elaborated upon this. For example, in classical mechanics the conservation of momentum is borne by the symmetry of Euclidean space, which is homogeneous and isotropic. Similarly, the conservation of energy is borne by the symmetry of time.

40 

 Martina Benešová – Dan Faltýnek – Lukáš Hadwiger Zámečník

References Andres, J., & Benešová, M. (2011). Fractal Analysis of Poe’s Raven. Glottometrics, 21, 73–100. Andres, J., & Benešová, M. (2012). Fractal Analysis of Poe’s Raven II. Journal of Quantitative Linguistics, 19(4). Andres, J., Benešová, M., Chvosteková, M., & Fišerová, E. (2014). Optimization of Parameters in the Menzerath-Altmann Law II. Acta Univ. Palacki. Olomouc. Fac. rer. nat., Mathematica, 53(1), 3–23. Cramer, I. M. (2005). The Parameters of the Menzerath-Altmann Law. Journal of Quantitative Linguistics, 1, 41–52. Freeman, W. (1999). How Brains Make up Their Minds. London. Hřebíček, L. (2002). Vyprávění o lingvistických experimentech s textem. Praha. Köhler, R. et al. (2008). Quantitative Linguistics. Berlin. Zámečník, L. (2014). The Nature of Law in Synergetic Linguistics. Glottotheory, 5(1), 101–119.

Łukasz Dębowski

A New Universal Code Helps to Distinguish Natural Language from Random Texts Abstract: Using a new universal distribution called switch distribution, we reveal a  prominent statistical difference between a text in natural language and its unigram version. For the text in natural language, the cross mutual information grows as a power law, whereas for the unigram text, it grows logarithmically. In this way, we corroborate Hilberg’s conjecture and disprove an alternative hypothesis that texts in natural language are generated by the unigram model. Keywords: universal coding, switch distribution, Hilberg’s conjecture, unigram texts

1 Introduction G. K. Zipf supposed that texts in natural language differ statistically from the output of producing characters at random (Zipf 1965: 187), or as we would say in modern terms, from unigram texts. With the advent of computational tools, this difference can be revealed using various experimental methods. For example, although both texts in natural language and unigram texts obey some versions of Zipf’s law (Mandelbrot 1954; Miller 1957), some difference between texts in natural language and unigram texts can be discerned by merely investigating the rank-frequency distribution of words (Ferrer-i-Cancho & Elvevåg 2010). In this paper we wish to demonstrate a more prominent difference between random texts and texts in natural language which can be detected by means of universal coding. This experimental setup is closely related to Hilberg’s conjecture (Hilberg 1990), an important hypothesis concerning the entropy of natural language. According to this hypothesis, natural language production forms a stationary stochastic process (Xi)i∞=−∞ on some probability space (Ω, J , P) with blocks of consecutive symbols being denoted as X 1n = (X1, X2, ... , Xn ), whereas the pointwise entropy of text blocks of length n, denoted HP (n) = − logP(X1n), satisfies HP (n) ≈ Bnβ + hn,

Polish Academy of Sciences; [email protected]

(1)

42 

 Łukasz Dębowski

where β ≈ 0.5 and h ≈ 0 (Hilberg 1990). This property may distinguish texts in natural language from unigram texts. Assuming that the distribution of characters is a vector of unknown random parameters of a unigram text, we obtain for this text that HP (n) ≈B log n + hn,

(2)

where h > 0 and B is proportional to the number of parameters, cf. Grünwald (2007). In both expressions (1) and (2) the value of entropy is asymptotically dominated by the linear term hn. To make the difference between (1) and (2) more prominent, we may consider mutual information between adjacent blocks, IP (n) = 2HP(n) − HP (2n). In this way we obtain IP (n) ∝ nβ

(3)

IP (n) ∝ log n

(4)

HQ(n) ≥ HP (n) −2 log n

(5)

if Hilberg’s conjecture is satisfied, whereas

for unigram texts. Relationship (3), which we call the relaxed Hilberg conjecture, was initially investigated by physicists interested in complex systems (Ebeling  &  Nicolis 1991, 1992; Ebeling  &  Pöschel 1994; Bialek et al. 2001b,a; Crutchfield & Feldman 2003) but later interesting linguistic interpretations were provided by Dębowski (2006, 2011). There are in particular mathematical connections between the relaxed Hilberg conjecture (3) and various forms of Zipf’s and Herdan’s laws (Herdan 1964). Testing Hilberg’s conjecture is difficult. What we need are estimates of entropy HP (n) or mutual information IP (n) for block lengths n varying over a large subset. It is quite costly or impossible to obtain such estimates using the guessing method (Shannon 1951) or the gambling method (Cover & King 1978). As an alternative, we may consider universal coding or universal distributions. For another probability distribution Q, let us denote the pointwise cross entropy HQ (n) = − log Q(X1n) and the pointwise cross mutual information IQ (n) = 2HQ(n) − HQ (2n). By Barron’s “no hypercompression” inequality (Barron 1985, Theorem 3.1; Grünwald 2007: 103), the cross entropy is greater than the entropy plus a logarithmic term,

for almost all n almost surely. Moreover, the defining property of a universal distribution Q is that the compression rate HQ (n)⁄n tends to the entropy rate for the text length tending to infinity,

A New Universal Code Helps to Distinguish Natural Language from Random Texts 

lim n→∞

HQ (n) HP (n) = lim n→∞ n n

 43

(6)

(Cover & Thomas 1991). Combining (5) and (6), we obtain that the cross mutual information is greater than the true mutual information, IQ (n) ≥ IP (n) − 2 log n

(7)

for infinitely many n (Dębowski 2011, Lemma 1). In this way, the difference of lengths of a universal code IQ (n) is some estimate of mutual information IP (n). In principle this estimate might be used for testing Hilberg’s conjecture but the problem is that we ignore how large the difference IQ (n) − IP (n) + 2 log n is. Dębowski (2013a) tried to test Hilberg’s conjecture using the Lempel-Ziv code (Ziv & Lempel 1977). This is the oldest known example of a universal code. In fact, the length of the Lempel-Ziv code for texts in natural language satisfies IQ (n) ∝ nβ

(8)

IQ (n) ∝ log n

(9)

with β ≈ 0.9 for text lengths in the range of n ∈ (103, 107) characters. The problem is, however, that the difference of lengths of the Lempel-Ziv code IQ (n) for a unigram text is very similar, as we will show in the experimental part of this paper, cf. a theoretical result by Louchard  &  Szpankowski (1997). Hence the Lempel-Ziv code cannot be used for discriminating between texts in natural language and unigram texts. This does not mean that we cannot discriminate between natural language and unigram texts by universal coding in principle, but rather another universal code should be used for that purpose. Actually, we have recently introduced some prospective universal code called the switch distribution (Dębowski 2013b). The switch distribution is a development of an idea by van Erven et al. (2007). It is a generic probability distribution for data prediction. Formally, it is a mixture of infinitely many Markov chains of all orders but it is effectively computable and proved to be a universal code. For the exact formula, which is a bit complicated, we refer to Dębowski (2013b). As we will show in the experimental part of this paper, for unigram texts the length of the switch distribution satisfies

unlike the Lempel-Ziv code, whereas the length of the switch distribution satisfies (8) for texts in natural language, like for the Lempel-Ziv code. In this way, we can discriminate between natural language and unigram texts using the switch distribution. Our observation also makes Hilberg’s conjecture more likely. The remaining part of the paper consists of two sections. In Section 2, we present the experimental data, whereas in Section 3 we offer concluding remarks.

44 

 Łukasz Dębowski

2 Experimental Data In this section we experimentally investigate the length of the Lempel-Ziv code and the switch distribution for a text in natural language and a unigram text. The considered text in natural language is “20,000 Leagues under the Sea” by Jules Verne, whereas the other text is the unigram model of this novel. The experimental data with regression functions are given in Fig. 1, 2, 3, and 4. Visually, in case of the Lempel-Ziv code, we observe no substantial difference between the two considered texts. For both texts, the compression rate HQ (n)⁄n decreases as a power law, viz. Fig. 1, whereas the difference of code lengths IQ (n) grows like a power law, viz. Fig. 2. If we apply the plain switch distribution, however, there arises a huge difference. In Fig. 3, the compression rate for the text in natural language decreases as a power law, whereas the compression rate for the unigram text stabilizes. Moreover, in Fig. 4, the difference of code lengths for the text in natural language grows like a power law, whereas for the unigram text it grows logarithmically. The data points can be approximated by the following least squares regression functions, where the values after the ‘±’ sign are standard errors. For the Lempel-Ziv code and the text in natural language, we have IQ (n)=(0.64±0.17) n(0.936±0.002),

(10)

whereas for the Lempel-Ziv code and the unigram text, we obtain IQ (n)= (1.31±0.05) n(0.832±0.003),

(11)

viz. Fig. 2. In contrast, for the switch distribution and the text in natural language, we obtain IQ (n)= (0.67±0.10) n(0.898±0.010),

(12)

whereas for the switch distribution and the unigram text, we have IQ (n) = (74±4) log(0.027±0.006)n, except for the three last data points, viz. Fig. 4.

(13)

A New Universal Code Helps to Distinguish Natural Language from Random Texts 

 45

Fig. 1: Compression rate HQ (n)⁄n vs. block length n for the Lempel-Ziv code.

Fig. 2: Pointwise cross mutual information IQ (n) vs. block length n for the Lempel-Ziv code. The lines are the regression functions.

46 

 Łukasz Dębowski

Fig. 3: Compression rate HQ (n)⁄n vs. block length n for the switch distribution.

Fig. 4: Pointwise cross mutual information IQ (n) vs. block length n for the switch distribution. The lines are the regression functions.

A New Universal Code Helps to Distinguish Natural Language from Random Texts 

 47

The three last data points for the switch distribution and the unigram text are probably outliers, caused by some unknown numerical errors, since mathematical theory predicts that the cross mutual information for the switch distribution on a unigram text follows relationship (9) asymptotically. This theoretical result stems from the fact that the switch distribution is a mixture of Markov chains of all orders, whereas for the mixture of Markov chains of zeroth order we obtain scaling (9) (Grünwald 2007). In spite of the mentioned numerical error, using the switch distribution, we can still distinguish the text in natural language from the unigram text. As we have stated in the introduction, this possibility supports Hilberg’s conjecture. Another interesting result is that we can reject with an extremely low p-value the hypothesis that the text in natural language was generated by the unigram model. For that reason we use a stronger form of Barron’s “no hypercompression” inequality. Suppose we have two probability distributions P and Q. The stronger form of the Barron inequality states that the probability of sufficiently long random data X1n according to distribution P is very low if only we can compress it better using an alternative distribution Q. Namely, we have P(HQ (n) ≤ HP (n)−m) ≤ 2−m

(14)

(Barron 1985, Theorem 3.1; Grünwald 2007: 103). In our case P is the unigram model, Q is the switch distribution, whereas X1n is the text of “20,000 Leagues under the Sea” with length n = 524,288 characters. For the unigram model we obtain compression rate HQ (n)⁄n = 4.4481 bpc (bits per character), whereas for the switch distribution we have HQ (n)⁄n = 2.3018 bpc. Hence the probability that the text can be so compressed by the switch distribution if it were generated by the unigram model is less than 2−524,288(4.4481-2.3018) ≤ 2−1,000,000.

(15)

Thus the unigram model should be rejected. We think that Barron’s inequality can be used in a similar fashion for disproving other simple probabilistic hypotheses about natural language.

3 Conclusion In this paper, using a new universal distribution called switch distribution, we have shown a prominent statistical difference between a text in natural language and its unigram version. The difference consists in a different growth rate of cross mutual information. Namely, for the text in natural language, the cross mutual

48 

 Łukasz Dębowski

information grows as a power law, whereas for the unigram text, it grows logarithmically. This observation corroborates Hilberg’s conjecture, an important hypothesis concerning natural language, and disproves the alternative hypothesis that texts in natural language can be generated by the unigram model. Further investigation is needed to illuminate why for the text in natural language we observe the power law exponent β ≈ 0.9 whereas the data analyzed by Hilberg (1990) suggest that β ≈ 0.5.

References Barron, A. R. (1985). Logically Smooth Density Estimation (Ph.D. thesis). Stanford University. Bialek, W., Nemenman, I., & Tishby, N. (2001a). Complexity through Nonextensivity. Physica A, 302, 89–99. Bialek, W., Nemenman, I., & Tishby, N. (2001b). Predictability, Complexity and Learning. Neural Computation, 13, p. 2409 . Cover, T. M., & King, R. C. (1978). A Convergent Gambling Estimate of the Entropy of English. IEEE Transactions on Information Theory, 24, 413–421. Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. New York: John Wiley. Crutchfield, J. P., & Feldman, D. P. (2003). Regularities Unseen, Randomness Observed: The Entropy Convergence Hierarchy. Chaos, 15, 25–54. Dębowski, Ł. (2006). On Hilberg’s Law and its Links with Guiraud’s Law. Journal of Quantitative Linguistics, 13, 81–109. Dębowski, Ł. (2011). On the Vocabulary of Grammar-based Codes and the Logical Consistency of Texts. IEEE Transactions on Information Theory, 57, 4589–4599. Dębowski, Ł. (2013a). Empirical Evidence for Hilberg’s Conjecture in Single-Author Texts. In I. Obradović, E. Kelih & R. Köhler (Eds.), Methods and Applications of Quantitative Linguistics — Selected papers of the 8th International Conference on Quantitative Linguistics (QUALICO) (pp. 143–151). Belgrade: Academic Mind. Dębowski, Ł. (2013b). A Preadapted Universal Switch Sistribution for Testing Hilberg’s Conjecture. Retrieved from http://arxiv.org/abs/1310.8511 Ebeling, W., & Nicolis, G. (1991). Entropy of Symbolic Sequences: the Role of Correlations. Europhysics Letters, 14, 191–196. Ebeling, W., & Nicolis, G. (1992). Word Frequency and Entropy of Symbolic Sequences: a Dynamical Perspective. Chaos, Solitons and Fractals, 2, 635–650. Ebeling, W., & Pöschel, T. (1994). Entropy and Long-range Correlations in Literary English. Europhysics Letters, 26, 241–246. Ferrer-i-Cancho, R., & Elvevåg, B. (2010). Random Texts do not Exhibit the Real Zipf’s Law-Like Rank Sistribution. PLoS ONE, 5(3), e9411. Grünwald, P. D. (2007). The Minimum Description Length Principle. Cambridge, MA: The MIT Press. Herdan, G. (1964). Quantitative Linguistics. London: Butterworths. Hilberg, W. (1990). Der bekannte Grenzwert der redundanzfreien Information in Texten — eine Fehlinterpretation der Shannonschen Experimente? Frequenz, 44, 243–248.

A New Universal Code Helps to Distinguish Natural Language from Random Texts 

 49

Louchard, G., & Szpankowski, W. (1997). On the Average Redundancy Rate of the Lempel-Ziv Code. IEEE Transactions on Information Theory, 43, 2–8. Mandelbrot, B. (1954). Structure formelle des textes et communication. Word, 10, 1–27. Miller, G. A. (1957). Some Effects of Intermittent Silence. American Journal of Psychology, 70, 311–314. Shannon, C. (1951). Prediction and Entropy of Printed English. Bell System Technical Journal, 30, 50–64. van Erven, T., Grünwald, P., & de Rooij, S. (2007). Catching up Faster in Bayesian Model Selection and Model Averaging. In Advances in Neural Information Processing Systems 20 (NIPS 2007). Zipf, G. K. (1965). The Psycho-Biology of Language: An Introduction to Dynamic Philology (2nd ed.). Cambridge, MA: The MIT Press. Ziv, J., & Lempel, A. (1977). A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory, 23, 337–343.

Sheila Embleton – Dorin Uritescu – Eric S. Wheeler

The Advantages of Quantitative Studies for Dialectology Abstract: Quantitative dialectology is able to make much stronger claims thantraditional dialectology because it can count all the occurrences of a phenomenon over a wider rangeof data; it can disprove claims that are based only on some or a few examples; and it can uncoverrelationships that are not obvious until one looks at large amounts of data. Even the definition of dialect changes when it is expressed in terms of quantified gradations of change. Keywords: dialectology, quantification, Romanian, Finnish, natural lenition, Crisana, multidimensional scaling

1 Introduction Over the last century or more, it has been common to study language by looking at the pattern of its forms, from the inventory of phonemes to morphological paradigms, syntactic arrangements and on to the structure of running discourse. Quantitative studies have kept pace with this activity by counting these elements and modelling them, and in recent times, with increasing sophistication. The observations of Zipf (1947) about word frequency and the pioneering expositions of Brainerd (1974) have become an extensive body of observations about many different linguistic patterns, using a variety of statistical models (surveyed in Grzybek 2006). Likewise, language has been studied as the product of human interaction and social structure, and the success of variationist linguistics in identifying critical social factors such as age, class, and gender is all the more convincing because it is supported extensively with quantitative studies (see Chambers 2003). The study of language as it varies with geography (dialectology) has always been data intensive, but perhaps because of this dependence on many data points spread over a geographic region, it has only been possible recently through the use of modern information technology to do dialectology in a more quantitative way. In our work (for example, Embleton, Uritescu & Wheeler 2008a, 2008b), we have been able to count over multiple data points at each location, and to make visual presentations of the quantities, something that requires software capable of doing that work. Furthermore, we have used advanced statistical techniques (such as York University; [email protected], [email protected]

52 

 Sheila Embleton – Dorin Uritescu – Eric S. Wheeler

multidimensional scaling) to represent the relationship between language variation and geography. This is work that is parallel to the quantitative work done for language-as-form and language-as-social-variant, but it is based on a different view of language. Combined with a study of what happens when language variants come in contact, it is a view of language that is concerned with what is different about your language and mine, and how that difference came to be. Quantitative dialectology is able to make much stronger claims than traditional dialectology because it can count all the occurrences of a phenomenon over a wider range of data; it can disprove or reinforce claims that are based only on some or a few examples; and it can uncover relationships that are not obvious until one looks at large amounts of data. It can prevent bias on the part of the researcher, because the relevant data is all present, and all requires consideration, avoiding the situation where the researcher only sees evidence in support of a favoured hypothesis. Even the definition of “dialect” changes when it is expressed in terms of quantified gradations of change. As the examples we cite below show, the use of quantitative methods in dialectology has room to grow, but it offers great promise for what it can tell us about the nature of language.

2 The Quantified Phenomenon Using the Romanian Online Dialect Atlas, RODA (Embleton, Uritescu & Wheeler 2009), it is possible to search the equivalent of several hundred hard copy pages of a dialect atlas for the north-western region of Romania, known as Crişana, and count the occurrence of specific patterns anywhere in the recorded data. For example, in Crişana, with the exception of the northern corner (a mountainous area, called Oaş), dentals are palatalized before front vowels. If the dentals are actually being restructured as palatals and the palatalization process is no longer productive, it is possible that there would be dentals that do not palatalize. In Fig. 1, we see examples of non-palatal /t/ before /e/ and /i/ in almost all locations, not only in Oaş. Indeed, there are many examples of /ti/ (vertical bars) and some of /te/ (horizontal bars), at almost all locations. We have excluded /st/ which we expect to result in non-palatalization (Uritescu 2007: 169-183; 2011: 61-62) and we scale the values by 3 to make them more visible, using one of the built-in search-management functions in RODA. The RODA Toolbench allows us to see both the frequency and location of these examples and (not shown here) the supporting data. The quantitative approach

The Advantages of Quantitative Studies for Dialectology 

 53

confirms the pre-quantitative view (that Oaş is exceptional), supports it with measurable results, but also highlights the limits and exceptions to this view.

Fig. 1: /t/ not palatalized before /e/ and /i/

The quantified view, however, also makes life more complicated for the dialectologist because it is no longer enough to make a categorical statement (“palatalization happens everywhere except in Oaş”); at the very least, it must be hedged with other considerations and analyses (“non-palatalization of dentals occurs not only in Oaş, but also in the other parts of Crişana before recent front vowels, in recent borrowings, in literary words…”). It invites the researcher to say why this difference occurs, what its consequence is on the system of the two sets of dialects, and whether or not there is a natural threshold for this phenomenon. These are all interesting questions that fall out directly from making a (small) quantitative study. Quantification can drive the direction of research.

3 The Hidden (Non)-Relationship In standard Romanian and most dialects, certain words derived from Latin (such as: Latin canto ‘I sing’, oculum ‘eye’) have lost any trace of the final high vowel ([u] in Balkan Romance and hence in Modern Romanian, hence Romanian cânt, ochi). In most parts of Crişana, these words preserved final /u/, either syllabic or non-syllabic (cântu [syllabic], cântu [non-syllabic]). A count over all the relevant

54 

 Sheila Embleton – Dorin Uritescu – Eric S. Wheeler

data shows the distribution in Fig. 2. (In some other northern dialects of Daco-Romanian, only the non-syllabic final /u/ was retained).

Fig. 2: Syllabic and non-syllabic word final /u/

A manual review of the supporting data allows us to eliminate certain non-relevant examples involving non-syllabic /u/, the definite article, and /u/ following certain consonant clusters (Fig. 3). The final pattern shows that the syllabic form is predominantly preserved in a few locations in the central area, as might be expected of a conservative feature that has been replaced elsewhere through other phenomena.

Fig. 3: Syllabic /u/ without certain items such as the definite article

The Advantages of Quantitative Studies for Dialectology 

 55

Another study looked at word-final raised /e/ and schwa. These vowels are represented by more than one RODA symbol (e.g. raised /e/ is i1, e3, and e7) but individual searches can be added together with search management functions in RODA. In Fig. 4, we compare raised /e/ and raised schwa: both are widespread, but there are clear examples where they do not always coincide (see locations 158 vs 159, or 163 vs 173, 175, and 177). They mostly coincide in the central and southern area which we can interpret as the outcome of a lenition process for final mid-vowels. Where they do not coincide can be explained as the vowels having different weights in the lenition process. In most but not all areas, /e/ seems weaker than schwa.

Fig. 4: Raised /e/ and raised schwa

When we compare the weakening of mid vowels to retained /u/ in Fig. 5, we see that the two patterns are not the same. There are locations (e.g. 137, 141, 146) where both the conservation of one vowel and the weakening of another vowel occur. (Fig. 5 shows retained /u/ vs. raised /e/ in a central region; RODA allows one to zoom in on selected parts of any map.)

56 

 Sheila Embleton – Dorin Uritescu – Eric S. Wheeler

Fig. 5: Word final /u/ versus raised /e/ in a central region

We are led to the conclusion that the raising of final mid-vowels and the weakening of final high vowels are distinct natural lenition processes. It would be understandable for an analyst working qualitatively to want to express these changes (and possibly others) as a common lenition process, and (as we hypothesized here for /e/ and schwa), it may be that some of the differences can be attributed to different parametric values in a common process. It is harder to justify both the /u/ and the /e/+schwa changes that way when one sees the full count of data. The quantificational perspective leads us to conclude that the commonality is not there. With quantity, we not only can see what is hard to see qualitatively, but we can identify what is really not there as well.

4 The Correlation of Dialect and Geography The General Online Dialect Atlas (GODA) is a reworking of RODA, intended to allow more scope to deal with other languages. In its first application, it is being developed with Robert Sanders (University of Auckland) and the authors of an extensive hard-copy atlas of Chinese dialects (see Atlas; and Sanders & Wheeler 2014); at this point, we have only a small sample of the digitized data to work with, so there are no valid observations to make about Chinese. Nonetheless, we have evolved some tools to help study the impact of geography and language contact on the dialects of Chinese.

The Advantages of Quantitative Studies for Dialectology 

 57

One of the proposed studies is to find a correlation between geography and dialect. In particular, if we map locations in linguistic space (defined by the dialect distance), we can compare that value to their geographic location. In an earlier effort (Embleton, Uritescu & Wheeler 2012), we looked at several ways of comparing geographic (G) and linguistic (L) distance, and argued that an MDS (multidimensional scaling) view of G-L gave a view that “readily shows the locations of interest, where the bulk of the discrepancy occurs, and the degree to which those locations vary from all the other locations.” See Fig. 6 for a sample map using our dialect data for Finnish. In this map, regions from the south-west, north, and east are shown (originally in different colours). For the most part, each region is compactly displayed, indicating that the relationship between G and L is similar for all those locations. The outliers are mostly represented by “other” points (points where the underlying data may have been missing) and a few points that may be from more remote locations. The interactive nature of the map on the RODA Toolbench itself would allow the researcher to identify individual locations and investigate the relationships.

Fig. 6: An MDS projection of the variance between linguistic and geographic distances on Finnish dialects.

However intuitive and useful that approach is, it does not give a quantitative measure of the correlation. The Mantel test is a method of testing the correlation between two distance matrices, and there are several packages for doing this available on the R statistics system (see www.r-project.org/ ). We have run a Mantel test on the Finnish data and found a correlation coefficient = -0.675431 with a p-value = 0.000999001. We interpret these numbers as meaning that the linguistic measure is strongly correlated with distance, and the certainty of this result is high. This, of course, is

58 

 Sheila Embleton – Dorin Uritescu – Eric S. Wheeler

what one would expect in dialectology (the geography of language variation), but there is also opportunity to look at where the correlation is not strong or not as expected, and to seek influences (such as language contact) that cause the simple relationship to deviate from expectations. The literature on the Mantel test, however, shows considerable debate over when it is appropriate to apply the Mantel test. It is frequently used in ecological studies (cf. the various sources of the R-packages that implement the Mantel test) but it has its critics. Guillot & Rousset (2011) in particular, challenge the use of the Mantel test in many of its applications, although they seem to concede that it is appropriate for exactly the kind of comparison of observation to geography that we have here. The lesson, of course, is that powerful statistical approaches can be a useful way of doing quantificational studies, providing we understand the technique and how it applies to the particular subject under study. For this reason, we like the use of multidimensional scaling as a visualization tool (rather than for gathering statistics) because it has an intuitively clear interpretation.

5 Impact on Theory Specific quantificational studies have specific results, as in the examples above, but the impact of quantificational study goes beyond that to the conceptual view of a subject. In dialectology, it has been a mainstay of the field to identify dialects. It has been recognized that this can be difficult sometimes, and that there are, for example, possibilities of graduated change in dialects (“dialect continua”), but the fundamental concept remains: each dialect is an identifiable region with an identifiable language variant. However, when one looks at the field as a subject of quantificational study, the fundamental concept starts to break down. For example, with RODA it is possible to search for and count a seemingly unlimited number of patterns. As in the examples above, we could look for /e/ before or after a given string; in word initial, medial, or final position; with certain flags or not; and then repeat the search with any of the nearly 100 symbols, or any string of them. Each search result gives a geographic pattern, each of which is in general different from the others. Which pattern (or combination of patterns) defines the dialect? Where do you draw the isogloss? In earlier expositions of Indo-European dialects, the satem/centum isogloss (for which the word for “hundred” began with either [s] or [k] ) was a major dividing line. But why should this item rather than any other get priority (see Lehmann 1962: 27)?

The Advantages of Quantitative Studies for Dialectology 

 59

In recognition of the existence of multiple isoglosses, the concept of an isogloss bundle has been put forward. An isogloss bundle is a set of isoglosses that coincide, at least for part of their length (Lehmann 1962: 127). But in practice, the isogloss bundles are limited to whatever isoglosses were studied, i.e., they were not “all” the possible isoglosses. Furthermore, for any given pattern (such as Fig. 2 or Fig. 4 above), there is in general no clean split between a feature being present or not present. Rather, there is a discrete set of measures that vary over an interval. We can make the situation black and white by picking a threshold, but the choice of threshold could still be quite arbitrary, with different thresholds giving different results. Perhaps there is a natural method for dividing some of the situations (e.g. in Fig. 1, there seems to be a clear division between those points that have the feature strongly, and those that have it only a little) but that may not always be the case (as in Fig. 2 or 4).

Fig. 7: A view of dialect as a quantified, multidimensional object, each “slice” of which is a potential dialect map.

60 

 Sheila Embleton – Dorin Uritescu – Eric S. Wheeler

Our interpretation of such challenges is to question the whole concept of dialect (Embleton, Uritescu and Wheeler 2008b, see Fig. 4). We have tried to suggest that geographic variation for language is a many-dimensional object, any slice of which can be seen as a pattern (see Fig. 7). The deep valleys (the differences that persist over many different searches) become dividing lines of interest, and raise questions about why they are where they are, and whether or not they will persist or are merely the accidents of history. The notion of dialect, dialect continuum, and gradual change become secondary concepts, defined on the more fundamental, quantified object—language in variation.

6 Summary Quantificational studies have brought much value to the study of language. When applied to dialectology, with the use of large, digitized data sets and the appropriate technology for managing the sets, they reveal phenomena that are not otherwise obvious, and in similar fashion, allow us to show that some things are not present. With the appropriate use of sophisticated techniques, we can not only demonstrate relationships but also give some measure of the reliability of the relationship. But more basic than any of these, quantificational studies force us to think about fundamental concepts in new ways. In dialectology, there is still a great opportunity to grow our understanding of the subject and quantificational methods provide a promising means of doing that.

The Advantages of Quantitative Studies for Dialectology 

 61

References Atlas. Linguistic Atlas of Chinese Dialects [汉语方言地图集]. The Commercial Press [商业印书馆]. Brainerd, B. (1974). Weighing Evidence in Language and Literature: A Statistical Approach. Mathematical Expositions 19. Toronto: University of Toronto Press. Chambers, J. K. (2003). Sociolinguistic Theory (2nd ed.). Oxford: Blackwell Publishing. Embleton, S., Uritescu, D., & Wheeler, E. S. (2008a). Digitalized Dialect Studies: North-Western Romanian. Bucharest: Romanian Academy Press. Embleton, S., Uritescu, D., & Wheeler, E. S. (2008b). Identifying Dialect Regions: Specific Features vs. Overall Measures Using the Romanian Online Dialect Atlas and Multidimensional Scaling. Leeds, UK: Methods XIII Conference, August 2008. In B. Heselwood & C. Upton (Eds.). (2009), Proceedings of Methods XIII. Papers from the Thirteenth International Conference on Methods in Dialectology, 2008 (pp. 79–90). Frankfurt am Main: Peter Lang. Embleton, S., Uritescu, D., & Wheeler, E. S. (2009). RODA 2.15.1. Retrieved from archive at York Space. Dialectology: http://yorkspace.library.yorku.ca/xmlui/handle/10315/2803 Embleton, S., Uritescu, D., & Wheeler, E. S. (2012). Effective Comparisons of Geographic and Linguistic Distances. In G. Altman, P. Grzybek, S. Naumann & R. Vulanović (Eds.), Synergetic Linguistics. Text and Language as Dynamic Systems (pp. 225–232). Vienna: Praesens Verlag. Guillot, G., & Rousset, F. (2011). On the Use of the Simple and Partial Mantel Tests in Presence of Spatial Auto-Correlation. Retrieved March 20, 2014, from http://orbit.dtu.dk/fedora/ objects/orbit:40911/datastreams/file_6334983/content Grzybek, P. (2006). History and Methodology of Word Length Studies. In P. Grzybek (Ed.), Contributions to the Science of Text and Language. Word Length Studies and Related Issues. Text, Speech and Language Technology (Vol. 31). The Netherlands: Springer. Lehmann, W. P. (1962). Historical Linguistics: an Introduction. Holt, Rinehart & Winston. Sanders, R., & Wheeler, Eric S. (2014). Creating the Chinese Online Dialect Atlas. Presented at Methods in Dialectology XV, Groningen, NL, August 2014. Uritescu, D. (2007). Sincronie şi diacronie. Fonetismul unor graiuri din nordul Banatului. Cluj-Napoca: Clusium. (Ediţia a doua revăzută şi adăugită) Uritescu, D. (2011). Formel et naturel dans l’évolution phonologique et morphophonologique: Essais de linguistique générale et romane. University, Mississippi: Romance Monographs. Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Reading, MA. (reprinted 2012 by Martino Publishing)

Ján Mačutek

Type-token relation for word length motifs in Ukrainian texts Abstract: A word length motif is a sequence of non-decreasing values of word lengths. Properties of motifs analysed so far (e.g., frequency distribution and motif length) indicate that mathematical models for word length motifs are the same as the ones for words. This paper studies their another property, the type token relation. Analysis of 70 Ukrainian texts shows that also in this case the mathematical model for word type-token relation fits the motifs as well. Relations among the parameter of the model, the total number of types, and the total number of tokens in the texts are investigated. Keywords: length motifs, type-token relation, Ukrainian

1 Introduction A motif is defined in linguistics as a continuous sequence of non-decreasing (i.e., equal or increasing) values of some other, “lower” units. One of their advantages is that they take into account a sequential structure of a text, which is inevitably lost if, e.g. only frequencies of units are recorded. Motifs are young linguistic units, inspired by certain analogies between musical compositions (Boroda 1982) and texts. Originally, Boroda (1982) used them to investigate a sequential structure of note duration. In linguistics, they were mentioned for the first time by Köhler (2006), where a sequence of word lengths (measured in syllables) was segmented in this way. However, motifs are neither limited to word length as their constitutive property, nor to syllables as the unit in which word length is measured (e.g., Köhler and Naumann 2009 discuss length motifs built from sentence lengths measured in clauses). There are also papers presenting frequency motifs (Köhler and Naumann 2008, 2010; Altmann 2014), or motifs based on argumentative structures in texts (Beliankou et al. 2013); other possibilities of creating motifs (Köhler 2015) await investigation. It is to be noted that even the definition of motif itself is open to some variation; e.g., Sanada (2010) uses non-increasing (as opposed to non-decreasing) sequences of word length in an analysis of Japanese texts (she argues that her definition is more appropriate specifically for Japanese, which is a postpositional language). Comenius University; [email protected]

64 

 Ján Mačutek

It is supposed, and corroborated to some degree, that motifs have properties analogous to those of their basic units. Word frequency and motif frequency, e.g., can be modelled with probability distributions from the same family; this is also true for motif length (in words) and word length (in syllables). In addition to the papers mentioned above, further analyses and results can be found in Köhler (2008a,b), Köhler and Naumann (2009), Mačutek (2009), and Milička (2015). Mačutek and Mikros (2015) report that word length motifs abide by the Menzerath-Altmann law, i.e., they display the same behaviour pattern as other, more “traditional” language units. This paper is focused solely on word length motifs, i.e., motifs of other types and motifs constructed from other basic units will not be considered here. Henceforward, word length motifs will be denoted simply as motifs. Our main aim is to investigate the type-token relation (which describes the development of the relation between the number of all units, i.e., tokens, and the number of different units, i.e., types, observed; TTR henceforward) for motifs. This topic was addressed at least three times in previous motif studies by Köhler (2006, 2008b) and Köhler and Naumann (2008). In all three cases, the TTR curve is claimed to be of the same type as the one for words, but its parameters seem to attain other values (i.e., different from those typical for words). According to Köhler and Naumann (2008), the parameter values seem to carry a remarkable amount of information on the text genre and the author, although the authors state that they are not likely to discriminate these factors sufficiently when taken alone. We present the first systematic study of some properties of the TTR for motifs. The study was conducted on material consisting of Ukrainian texts from seven genres (70 texts altogether).

2 Language material Ukrainian texts were chosen as the language material for this research for two reasons. First, the material was readily available. The texts were taken from the database created within the project “Designing and Constructing a Typologically Balanced Ukrainian Text Database” (see Kelih et al. 2009). The database is divided into several subcorpora according to text genres. In this paper, seven of them were used, namely, belletristic prose, blogs, dramas, scientific papers in the humanities, scientific papers in physics, sermons, and sport reports. Ten texts from each genre were analysed, making a total of 70 texts.

Type-token Relation for Word Length Motifs in Ukrainian Texts 

 65

Second, in Ukrainian it is very simple to determine word length in syllables, which is necessary for creating motifs. As this language has no syllabic consonants and no diphthongs, the number of syllables in a word is determined by the number of vowels the word contains. In addition, most of the texts used were already pre-processed and used for other analyses in a paper by Mačutek and Rovenchak (2011), in which canonical word forms were studied. The texts were converted from the grapheme to the phoneme level, and then modified so that every consonant was replaced with C and every vowel with V (see Mačutek and Rovenchak 2011 for a more detailed description of the procedure, including some peculiarities of the grapheme-phoneme relation in the Ukrainian language). In the modified texts, the number of Vs in a word is equal to the number of syllables.

3 Results There are several standard mathematical models for the TTR; Wimmer (2005) presents an overview. We will use the most simple of them here, namely f (x) = xb, where f(x) is the number of different motifs among the first x motifs in a text and b is a parameter. Notice that it is the value of the parameter b which, according to Köhler and Naumann (2008), could contribute to author and genre discrimination. The goodness of fit between the model and our data was evaluated in terms of the determination coefficient R2. In linguistics, the fit is usually considered satisfactory if R2 ≥ 0.9. This inequality is true for 55 out of 70 texts; moreover, the determination coefficient is only slightly lower for a further 10 texts. There are only five texts for which R2 < 0.87. Complete results together with some basic descriptive statistics can be found in Tab. 1-4 below. For all 70 texts, the following characteristics are presented: the numbers of words W, all motifs M and different motifs DM, the value of parameter b, and the determination coefficient R2; we emphasise that M is the number of tokens and DM the number of types. Computations were performed with NLREG and R software.

66 

 Ján Mačutek

Tab. 1: Descriptive statistics and TTR for belletristic prose and blogs Belletristic prose

1 2 3 4 5 6 7 8 9 10

Blogs

W

M

DM

b

R2

W

M

DM

b

R2

4451 3760 5089 6175 5744 6025 5959 3826 4993 5177

1631 1418 1911 2243 1989 2190 2078 1411 1752 1808

247 219 263 275 226 277 250 217 208 220

0.751 0.749 0.751 0.740 0.730 0.746 0.738 0.760 0.727 0.736

0.9268 0.9587 0.9343 0.9525 0.7865 0.8778 0.8419 0.9190 0.9300 0.8035

524 456 630 455 436 456 552 518 577 627

204 174 225 160 164 176 199 205 224 243

66 71 76 71 65 62 74 66 65 69

0.797 0.841 0.816 0.864 0.849 0.820 0.833 0.807 0.801 0.785

0.9512 0.9844 0.9600 0.9290 0.9350 0.9763 0.9218 0.9656 0.9001 0.9026

Tab. 2:. Descriptive statistics and TTR for dramas and scientific papers in the humanities Drama

1 2 3 4 5 6 7 8 9 10

Humanities

W

M

DM

b

R2

W

M

DM

b

R2

1067 1041 988 1899 1215 1365 5201 1009 930 1072

285 369 358 566 442 476 1885 213 335 383

95 113 96 130 100 119 206 78 99 94

0.812 0.818 0.785 0.779 0.779 0.785 0.714 0.827 0.813 0.782

0.9665 0.9477 0.9292 0.9357 0.9036 0.9815 0.9446 0.9608 0.9009 0.9572

2969 3953 2170 3481 3314 2568 2568 2099 2486 2145

1188 1574 876 1399 1367 1450 1068 850 1017 861

213 236 163 224 229 233 233 196 193 185

0.768 0.760 0.759 0.757 0.764 0.762 0.792 0.801 0.772 0.788

0.9179 0.8894 0.9384 0.9631 0.9614 0.9472 0.9707 0.9370 0.8735 0.9200

Tab. 3: Descriptive statistics and TTR for scientific papers in physics and sermons Physics

1 2 3 4 5 6

Sermons

W

M

DM

b

R2

W

M

DM

b

R2

1856 2396 1609 3322 4569 2435

777 940 678 1391 1915 1005

151 183 158 198 244 185

0.775 0.774 0.784 0.742 0.741 0.772

0.9017 0.9245 0.9787 0.7994 0.8772 0.9255

2050 1674 1764 1506 1235 1377

767 599 671 573 477 502

118 124 138 121 116 105

0.736 0.761 0.771 0.772 0.782 0.753

0.9302 0.9123 0.8779 0.9058 0.9449 0.9910

 67

Type-token Relation for Word Length Motifs in Ukrainian Texts 

Physics

7 8 9 10

Sermons

W

M

DM

b

R2

W

M

DM

b

R2

2046 3141 2451 1869

871 1293 1036 756

182 223 183 175

0.781 0.769 0.765 0.799

0.8819 0.8935 0.8886 0.9413

1385 1358 1428 1340

551 506 516 507

128 110 118 116

0.789 0.768 0.776 0.776

0.9318 0.9434 0.9461 0.9360

Tab. 4: Descriptive statistics and TTR for sport reports Sport

1 2 3 4 5 6 7 8 9 10

W

M

DM

b

R2

924 665 737 1450 1008 718 1012 1115 718 908

351 268 293 538 378 291 388 431 262 337

88 72 84 122 97 81 99 104 85 97

0.783 0.791 0.798 0.781 0.787 0.795 0.780 0.785 0.805 0.796

0.8934 0.7817 0.9597 0.9488 0.9454 0.9031 0.9073 0.8858 0.9592 0.9684

As can be seen in Fig. 1, the parameter values correspond to the findings of Köhler and Naumann (2008). Each genre seems to have its characteristic range, however, the ranges overlap. Thus, the parameter of the TTR carries some pieces of information about genres, but other text characteristics will also be needed if one wants to achieve a reasonable discrimination precision.

68 

 Ján Mačutek

Fig. 1: Values of parameter b (1. belletristic prose, 2. blogs, 3. dramas, 4. scientific papers in the humanities, 5. scientific papers in physics, 6. sermons, 7. sport reports)

Apart from possible applications in automatic genre (and perhaps also author) discrimination, the results presented in Tab. 1-4 also provide insight into interrelations among text properties which have impact on motifs. Pearson correlation coefficients between the pairs of text characteristics are very high (Tab. 5). Tab. 5: Pearson correlation coefficients M W M DM

0.988

DM 0.916 0.951

b -0.794 -0.798 -0.716

The correlation coefficients from Table 5 indicate several facts about the text characteristics considered. The longer the texts are, the lower the values of parameter b in the TTR are. As a text gets longer, new motifs appear rarer and rarer, which is reflected by a flatter TTR curve shape (or, in other words, by a lower value of its parameter b). The relation between the text length in words and the value of b is clearly non-linear (Fig. 2). It is, however, questionable whether text length alone can explain the parameter value, and some other factors are likely to be at play as well. We

Type-token Relation for Word Length Motifs in Ukrainian Texts 

 69

postpone modelling the dependence of parameter b on other text properties until more data are available.

Fig. 2: Relation between the number of words and the value of parameter b

Text length in words and the number of motifs in the text are very strongly linearly correlated, see Fig. 3. It is self-evident that longer texts tend to contain more motifs; the linearity, however, is far more interesting. What is even more intriguing is that the same relation (i.e., linear, but with a different slope) can be observed in five quite long Modern Greek texts (Mačutek and Mikros 2015, Table 1). Admittedly, we have a very modest sample, both in terms of the number of languages and texts, but tentatively we allow ourselves to formulate two hypotheses. First, the relation will be valid for all languages, and, second, the parameters of the linear function will be language specific and inversely proportional to the mean word length.

70 

 Ján Mačutek

Fig. 3: Relation between the number of words and the number of motifs

The longer the texts are (measured in words or in motifs), the more different motifs they contain. This statement is again obvious, but the very good fit of the linear function is surprising (Figure 4). In addition, characteristics of Modern Greek texts provided by Mačutek and Mikros (2015) suggest that the fit remains very good even for texts containing tens of thousands of words. Nevertheless, the relation will most probably be non-linear (a power function fits the data even better), with only the non-linearity (almost) unnoticeable for the time being; see Mačutek and Rovenchak (2011) for a similar debate on the (non-)linearity of the relation between the phonemic and syllabic length of canonical word forms.

Type-token Relation for Word Length Motifs in Ukrainian Texts 

 71

Fig. 4: Relation between the number of all motifs and the number of different motifs in all texts

To sum up, the parameter of the TTR for motifs and some other motif and text characteristics influence each other in a similar way as properties of other language units do (see Köhler 2005). Text length is one of the most important factors.

4 Conclusion The TTR for word length motifs can be modelled with the same function as that for words, thus once more confirming that motifs behave analogously to their basic units. The parameter of the model could be used as one of the characteristics of genre, possibly also of author. We would also like to pose several new questions related to the results achieved here. It seems that the relation between text length and the number of different motifs which occur in the text is almost linear. An inspection of potential parallels between the TTR and motif richness (a measure of motif exploitation rate, Mačutek 2009) might provide an additional point of view on the phenomenon. Further investigations of interrelations among properties of motifs and other language units can lead to embedding motifs into the synergetic model of language (Köhler 2005).

72 

 Ján Mačutek

It has recently become popular to compare real texts with random ones, see, e.g., Benešová and Čech (2015), and specifically for some motif properties Mačutek and Mikros (2015) and Milička (2015). For the time being, it remains an open question whether TTRs for word length (and other) motifs in real and random texts differ or not.

Acknowledgement Supported by the research grant ESF OPVK 2.3 - Linguistic and lexicostatistic analysis in cooperation of linguistics, mathematics, biology and psychology (CZ.1 07/2.3.00/20.0161).

References Altmann, G. (2014). Supra-sentence Levels. Glottotheory, 5(1), 25–39. Beliankou, A., Köhler, R., & Naumann, S. (2013). Quantitative Properties of Argumentation Motifs. In I. Obradović, E. Kelih & R. Köhler (Eds.), Methods and Applications of Quantitative Linguistics (pp. 35–43). Beograd: Academic Mind. Benešová, M., & Čech, R. (2015). Menzerath-Altmann Law Versus Random Model. In G. K. Mikros & J. Mačutek (Eds.), Sequences in Language and Text (pp. 57–69). Berlin/Boston: de Gruyter. Boroda, M. G. (1982). Häufigkeitsstrukturen musikalischer Texte. In J. K. Orlov, M.G. Boroda & I. Š. Nadarejšvili (Eds.), Sprache, Text, Kunst. Quantitative Analysen (pp. 231–262). Bochum: Brockmeyer. Kelih, E., Buk, S., Grzybek, P., & Rovenchak, A. (2009). Project Description: Designing and Constructing a Typologically Balanced Ukrainian Text Database. In E. Kelih, V. Levickij & G. Altmann (Eds.), Methods of Text Analysis (pp. 125–132). Chernivtsi: ChNU. Köhler, R. (2005). Synergetic linguistics. In R. Köhler, G. Altmann & R. G. Piotrowski (Eds.), Handbook of Quantitative Linguistics (pp. 760–775). Berlin / New York: de Gruyter. Köhler, R. (2006). The Frequency Distribution of the Lengths of Length Sequences. In J. Genzor & M. Bucková (Eds.), Favete Linguis. Studies in Honour of Viktor Krupa (pp. 145–152). Bratislava: Slovak Academic Press. Köhler, R. (2008a). Sequences of Linguistic Quantities. Report on a New Unit of Investigation. Glottotheory, 1(1), 115–119. Köhler, R. (2008b). Word Length in Text. A Study in the Syntagmatic Dimension. In S. Mislovičová (Ed.), Jazyk a jazykoveda v pohybe (pp. 416–421). Bratislava: Veda. Köhler, R. (2015). Linguistic motifs. In G. K. Mikros & J. Mačutek (Eds.), Sequences in Language and Text (pp. 89–108). Berlin/Boston: de Gruyter. Köhler, R., & Naumann, S. (2008). Quantitative Text Analysis Using L-, F- and T-segments. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme & R. Decker (Eds.), Data Analysis, Machine Learning and Applications (pp. 635–646). Berlin/Heidelberg: Springer.

Type-token Relation for Word Length Motifs in Ukrainian Texts 

 73

Köhler, R., & Naumann, S. (2009). A Contribution to Quantitative Studies on the Sentence Level. In R. Köhler (Ed.), Issues in Quantitative Linguistics (pp. 34–57). Lüdenscheid: RAM-Verlag. Köhler, R., & Naumann, S. (2010). A Syntagmatic Approach to Automatic Text Classification. Statistical Properties of F- and L-motifs as Text Characteristics. In P. Grzybek, E. Kelih & J. Mačutek (Eds.), Text and Language. Structures, Functions, Interrelations, Quantitative Perspectives (pp. 81–89). Wien: Praesens. Mačutek, J. (2009). Motif richness. In R. Köhler (Ed.), Issues in Quantitative Linguistics (pp. 51–60). Lüdenscheid: RAM-Verlag. Mačutek, J., & Mikros, G. K. (2015). Menzerath-Altmann Law for Word Length Motifs. In G. K. Mikros & J. Mačutek (Eds.), Sequences in Language and Text (pp. 125–131). Berlin/Boston: de Gruyter. Mačutek, J., & Rovenchak, A. (2011). Canonical Word Forms: Menzerath-Altmann Law, Phonemic Length and Syllabic Length. In E. Kelih, V. Levickij & Y. Matskulyak (Eds.), Issues in Quantitative Linguistics 2 (pp. 136–147). Lüdenscheid: RAM-Verlag. Milička, J. (2015). Is the Distribution of L-motifs Inherited from the Word Lengths Distribution? In G. K. Mikros & J. Mačutek (Eds.), Sequences in Language and Text (pp. 133–145). Berlin/Boston: de Gruyter. Sanada, H. (2010). Distribution of Motifs in Japanese Texts. In P. Grzybek, E. Kelih & J. Mačutek (Eds.), Text and Language. Structures, Functions, Interrelations, Quantitative Perspectives (pp. 181–193). Wien: Praesens. Wimmer, G. (2005). The Type-token Relation. In R. Köhler, G. Altmann & R. G. Piotrowski (Eds.), Handbook of Quantitative Linguistics (pp. 361–368). Berlin / New York: de Gruyter.

George K. Mikros – Kostas Perifanos

Gender Identification in Modern Greek Tweets Abstract: The aim of this paper is to analyze tweets written in Modern Greek and develop a robust methodology for identifying the gender of their author. For this reason, we compare three different feature groups (most frequent function words, gender keywords, and Author Multilevel N-gram Profiles) using two different machine learning algorithms (Random Forests and Support Vector Machines) in various text sizes. The best result (0.883 accuracy) was obtained using SVMs trained with the AMNP feature group using 100-word tweet chunks. This methodology can lead to reliable and accurate gender identification results using tweet chunk sizes as small as 50 words each. Keywords: author profiling, twitter, Modern Greek, gender identification, Multilevel Ngram Profiles, Support Vector Machines, Random Forests

1 Introduction Twitter has radically transformed the way information is spread over the Internet, and has created a new language genre with specific linguistic conventions and usage rules. Users form messages in 140 characters or less, producing text that is semantically dense, has many abbreviations and often carries extra-linguistic information using specific character sequences (smileys, interjections etc.) (Crystal 2008). The massive user-generated content produced in this platform can be used to uncover various latent user characteristics such as gender (Burger, Henderson, Kim, & Zarrella 2011; Fink, Kopecky, & Morawsky 2012), ethnicity (Rao, Yarowsky, Shreevats, & Gupta 2010), as well as a number of behavioral patterns that predict political beliefs (Tumasjan, Sprenger, Sandner, & Welpe, 2010), brand acceptance (Jansen, Zhang, Sobel, & Chowdury, 2009), movie revenues (Asur & Huberman, 2010), among others. Unlike Facebook, Twitter as a platform does not store user metadata in a structured format. The only obligatory user element is the name which has already been used as marker for gender identification, either alone (Mislove, Lehmann, Ahn, Onnela, & Rosenquist 2011), or combined with other textual characteris1 National and Kapodistrian University of Athens; [email protected] 2 National and Kapodistrian University of Athens; [email protected]

76 

 George K. Mikros – Kostas Perifanos

tics (Burger et al. 2011; Liu & Ruths 2013; Rao et al. 2011). Although the reported results are encouraging, the name marker cannot be used in many application areas such as forensics, where the user deliberately hides his or her identity. Furthermore, the associations of the first names with the gender of an individual do not exist for all languages and in many of them a large portion of names is unisex. Furthermore, names can lose their gender marking due to transliteration into the Latin alphabet (e.g., many names used in India and in Indian communities throughout the world). Tweets, like any text, can carry information related to the gender of their authors. Men and women’s language production differ in many aspects due to both biological and socio-psychological factors. Research in neurobiology (Kimura & Hampson 1994), cognitive science (Kimura 2000) and sociolinguistics (Labov 1982) have documented systematic differences in language usage patterns across genders. Tweets represent a challenging research object for studying gender traces due to their small length (max. 140 characters) and their elliptical linguistic structure. The present study aims to explore appropriate methods for gender identification in Twitter data. We utilized the Greek Twitter Corpus and selected 10 users based on their popularity (number of followers) and their activity (number of tweets in a month). We trained 2 machine learning classifiers using 3 different feature groups and in 10 different text sizes; this produced 60 different statistical models which were compared in terms of their gender identification accuracy.

2 Gender Identification in Twitter Users: Previous Work One of the first and most influential studies of tweets’ gender identification was conducted by Rao and his co-authors (Rao et al. 2010). They collected 405,151 tweets produced by 1,000 users (500 from each gender) and calculated a number of features related to the user’s network structure (follower to following ratio, the number of followers and followees of a user), the user’s communication behavior (frequency of retweets and responses) and 3,774 features called ‘sociolinguistic’ by the authors, containing various character sequences frequently encountered in online communication (smileys, abbreviations, ellipses etc.). Furthermore, the authors developed an n-gram document representation model using 1 and 2 character n-grams, creating a vector of 1,256,558 features for each tweet. The above mentioned features were used to train Support Vector Machine (SVM) classifiers for each feature group, and in a second phase their models were combined using

Gender Identification in Modern Greek Tweets 

 77

stacking and utilizing a SVM as meta-classifier. The evaluation of the models revealed that the most important feature set for gender identification was the set of sociolinguistic markers, exhibiting an accuracy of 71.76%, while the stacked model was slightly better at 72.33%. In another study, Miller, Dickinson, and Hu (2012) tried to detect the Twitter users gender following a different approach and using only n-gram vectors. They collected 36,238 tweets produced by 3,000 users and represented each tweet as a  vector of character increasing n-grams ranging from 1 to 5. Due to the exponential increase of n-gram types in the higher order n-gram representations, the authors used 6 different feature-selection algorithms and extracted only the features that were selected by at least 4 of them. Furthermore, 6 different datasets were created based on the minimum length of each tweet in order to examine the impact of tweet length on gender identification accuracy. For the classification purposes, two stream mining algorithms were used, the perceptron and Naïve Bayes. The perceptron performed relatively well, with very high precision (97%) and a balanced accuracy of 94%, which was outperformed by Naïve Bayes, scoring between 90% and 100% for all metrics. Fink et al. (2012) tried to predict Twitter users’ gender, crawling 18,5 million tweets written in English from 11,155 users, all located in Nigeria. All tweets from each user were merged in one text and this was represented by 3 different feature groups: unigrams, hash tags, and psychometric properties derived from the Linguistic Inquiry and Word Count (LIWC) text analysis program (Pennebaker & Francis 2001). For the classification task, the SVM with a linear kernel was used and was trained separately on each feature group and in all their possible combinations. Unigrams scored very high in terms of gender identification accuracy (80.5%) and were marginally lower than using all 3 feature groups (80.6%), indicating that the hash tag and LIWC features added relatively little performance boost over simply using unigrams. The hash tag features did very poorly by themselves, giving low accuracy results compared with the other feature sets alone. Burger et al. (2011) created a corpus of 4,102,434 tweets written by 183,729 users in at least 14 different languages. This is a subsample of a bigger corpus, which was filtered so that only Twitter users who have a linked blog account with information about their gender remain. The features measured were word n-grams from 1 to 2 and character n-grams from 1 to 5, totaling a vector of 15,572,522 features in both the content of the tweet and the username or screen-name appearing in the user’s account. The authors used a Balanced Winnow2 classification algorithm due to its speed and efficiency in the training phase of their sparse high-dimensional vectors. Results indicated that using combined n-grams from the tweet, the screen name, and the full name of the user yields the best accuracy in gender identification (91.8%). The full name as feature was found to be highly

78 

 George K. Mikros – Kostas Perifanos

informative, achieving 89.1% accuracy. Using only tweet texts performs better than using only the user description (75.5% vs. 71.2%). It appears that the tweet text conveys more about a Twitter user’s gender than his or her own self-description. Even a single (randomly selected) tweet text contains some gender indicative information (67.2%). The model performance was then compared to human ability to detect the authors’ gender. The human accuracy on average was very low compared to the statistical model previously developed. Moreover, only 5% of the individuals participating in the experiment scored better than the model. In a recent paper, Bamman, Eisenstein, & Schnoebelen (2014) discuss the gender identification problem from a wider perspective that encompasses both machine learning and a sociolinguistic methodological paradigm. The authors collected 9,212,118 tweets written in English by 14,464 American users who had at least 4 reciprocal connections (@ tag) with someone else from the users sampled. The document representation was based on a bag-of-words model which included the 10,000 most frequent words of the corpus. Using regularized logistic regression as a classification algorithm, the gender identification model achieved 88% accuracy. Furthermore, the study examined which words correlate better with each gender and found that in general pronouns, emotion and kinship terms, and abbreviations like lol and OMG appear as female markers, as do ellipses, expressive lengthening (e.g. coooooool), exclamation marks, question marks, and backchannel sounds like ah, hmmm, ugh, and grr. In order to reverse the supervised learning paradigm initially used and which takes gender as constant property of an author, the researchers used the same 10,000 words as features and used probabilistic clustering so that word usage alone can classify the authors in 20 clusters of homogeneous linguistic behavior. The clusters so formed group authors based on patterns of lexical co-occurrence, and can be approached in terms of verbal repertoires, which involve a complex intersection of social positions, social identities, styles, and topics. The word frequencies analyzed in each cluster revealed that it is the topic of a discourse that drives usage rates of specific category words and all authors adjust their linguistic behavior accordingly regardless of their gender.

Gender Identification in Modern Greek Tweets 

 79

3 Research Methodology 3.1 Research Aims Our study focuses on gender identification in small fragments of texts (tweets) written in Modern Greek and published on the micro-blogging platform Twitter. Our research aims more specifically are – to perform gender identification experiments in tweets written in Modern Greek; – to explore the effectiveness of various document representation methods in tweets’ gender identification, specifically gender-specific keywords, the corpus’s most frequent words, and the – Author Multilevel N-gram Profile (AMNP), which consists of a combined vector of increasing size and different level n-grams; – to compare state-of-the-art machine learning methods for text classification (Random Forests and Support Vector Machines); and – to investigate the effect of text size in the effectiveness of both features and algorithms.

3.2 Corpus and Experimental Datasets For the requirements of our research, we used the Greek Twitter Corpus (GTC), originally developed for evaluating authorship attribution in Greek tweets (Mikros & Perifanos, 2013). GTC was created using trending.gr, a free service that monitors the activity of Greek users in Twitter and publishes many statistics regarding their activity, including the top users in followers, mentions, etc. We selected 10 users (5 men and 5 women) based on their popularity (number of followers) and their activity (number of tweets in a month). In order to extract tweets from specific users we used the twitteR R package. The Twitter API can only return up to 3,200 statuses per account including native retweets in this total. To cope with this API’s rate limit restrictions, an incremental approach was adopted, keeping track of the most recent tweet ID per author for each API and repeating the whole procedure in frequent time intervals. The descriptive statistics of the Greek Twitter Corpus (GTC) are displayed in Tab. 1.

80 

 George K. Mikros – Kostas Perifanos

Tab. 1: GTC descriptive statistics Gender Male

Authors

Total size (words)

Average size (words)

St. Dev.

A

500

5,378

10.75

5.42

B

918

10,515

11.45

5.52

C

2,065

32,098

15.54

6.73

D

455

7,451

16.57

5.48

E

1,347

9,822

7.29

5.01

5,285

65,264

SubTotal Female

No of Tweets

F

535

3,692

6.90

4.93

G

1,277

9,412

7.37

5.63

H

2,306

26,212

11.36

5.86

I

2,986

18,720

6.26

4.28

J

584

7,618

13.06

6.74

7,688

65,654

12,973

130,918

SubTotal Total

In order to test the effect of text size in the gender identification accuracy of our models we used GTC to create 10 different datasets which contained merged tweets of increasing 10-word text sizes (i.e., 10, 20, 30 … 100 words). Thus, the 10-word dataset contains text fragments of 10 words each. These fragments are segmented sequentially from the first word of the GTC and do not correspond to single tweets.

3.3 Twitter-specific Characteristics Distribution across Gender Categories. Twitter makes available various devices for “pointing” to different extra- or inter-textual entities such as social relationships (mentions ‘@’), other tweets (Retweet ‘RT’), other tweets with similar topic (hash tag ‘#’), webpages and other data in the Web (http links) and psychological states and discourse strategies (emoticons). Previous research (see Section 2) has shown that a number of these devices display highly skewed usage distribution across gender. In order to confirm this finding in our data, we examined the female-to-male odd ratio of these Twitter-specific characteristics. The odd ratios are depicted in Fig. 1:

Gender Identification in Modern Greek Tweets 

 81

Fig. 1: Female to male odd ratios in Twitter specific characteristics

Emoticons seem to be a highly indicative marker for female users since they are 140 times more likely to be used in tweets written by women than men. All other Twitter-specific characteristics are also used more by women than men, with the odd ratios ranging from 6.4 (Topics – hash tags) to non-alphanumeric character 2.2. This last category includes many characters that form emoticons. The exclamation mark is the most prominent character in terms of female usage (9.1 times more likely to be used by a female user), followed by the colon (8.3 times) and the dash (6.1 times). The colon followed by the dash (optional) and the closed parenthesis form the happy emoticon ‘:-)’ which belongs to the most feminine markers. It is interesting, however, that the opening parenthesis, which can be used to form the sad emoticon ‘:-(’, does not carry positive female weighting, meaning that in general female tweets tend to be more sentimentally positive, a research finding which has already been noted (Bamman et al. 2014; Fink et al. 2012; Rao et al. 2010). Although these Twitter-specific textual devices are very powerful gender indicators, we filtered them out of our corpus during the preprocessing stage. The main reason is that we want to develop a gender identification methodology that is based exclusively on linguistic features. We want to focus only on purely linguistic markers and infer the author’s gender using only the tweet’s linguistic content. Such a model can be very useful in many application areas, such as forensics, where an author may hide most extra-linguistic links that can possibly be used to identify him or her.

3.4 Features and Machine Learning Algorithms In order to investigate our research aims, we constructed 3 datasets each based on different features. The features groups compared are the following:

82  1.

2. 3.

 George K. Mikros – Kostas Perifanos

Author Multilevel N-gram Profiles (AMNP) (5,000 features, 1000 features from each n-gram category) a) 2-, 3-, and 4-grams (character level n-grams) b) 2- and 3-grams (word level n-grams) Most Frequent Words in the corpus (1,000 features) Gender Related Keywords (60 features)

The first feature group (AMNP) provides a robust document representation which is language independent and can capture various aspects of stylistic textual information. It has been used effectively in authorship attribution problems (Mikros & Perifanos 2011, 2013) and gender identification focused on bigger texts (blog posts) (Mikros 2013). AMNP consists of increasing order n-grams on both the character and word levels. Since character and word n-grams capture different linguistic entities and function in a complementary way, we constructed a combined profile of 2-, 3-, and 4-character n-grams and 2- and 3-word n-grams. For each n-gram we calculated its normalized frequency in the corpus and used the 1,000 most frequent, resulting in a combined vector of 5,000 features. The second and the third feature group belong to the word level. The second one (most frequent words) can be considered classic in the stylometric tradition and is based on the idea that the most frequent words belong to the functional word class and are beyond the conscious control of the author, thus revealing a stylometric write print. In this study we used the 1,000 most frequent words of the corpus. The third feature group (Gender Related Keywords) is based on automatic keyword extraction research. We compare the frequency wordlist of the men against the frequency wordlist of the women using log-likelihood as a statistical test. We extract the 30 most distinctive words of women and the corresponding 30 most marked words of men. This method has been used previously as a feature selection method in authorship attribution problems (Mikros, 2006, 2007) and has been proved to be superior to the classic most-frequent-words model. Each one of the 3 feature groups was used for training 2 different classification machine learning algorithms, Support Vector Machines (SVM) (Vapnik 1995) and Random Forests (RF) (Breiman, 2001). Both SVM and RF are considered stateof-the-art algorithms for text classification tasks. The SVM construct hyper-planes of the feature space in order to provide a linear solution to the classification problem, while the RF grow multiple decision trees using random sub-samples of both features and cases and using a majority voting scheme to decide the class of an unknown instance. All statistical models developed have been evaluated using 10-fold cross validation (90% training set – 10% test set) and the accuracies reported represent the mean of the accuracies obtained in each fold.

Gender Identification in Modern Greek Tweets 

 83

4 Results We developed 10 different datasets (1 for each of the 10 text sizes defined) calculating 3 different feature groups in each one and used them to train 2 different algorithms producing 60 different statistical models for gender identification. The best performance (0.883) was obtained by SVM trained on 100 words text fragment using AMNP. In order to answer our research questions we examined reported accuracies under various experimental conditions. The accuracies achieved from the 2 classification algorithms and the 3 feature groups across different text sizes are displayed in Fig. 2.

Fig. 2: Accuracies in gender identification using the 3 different feature groups (mean based on accuracies reported by both classifiers)

We observe that the 3 feature groups behave differently across different text sizes. However, AMNP representation seems to achieve better accuracies across all text sizes compared to the other two feature groups. In order to test this hypothesis, we performed a two-way ANOVA with the dependent variable as the accuracies reported, and with independent variables as the feature group and text size. Both feature group and text size were found to be statistically significant (p = 0.028 and p = 0.000 respectively). This result further supports the usefulness of the multilevel n-gram profiles to author profiling tasks. We also compared the performance of the two classification algorithms (SVM and RF) across all text sizes. The results are displayed in Fig. 3.

84 

 George K. Mikros – Kostas Perifanos

Fig. 3: Accuracies in gender identification using the 2 classification algorithms (mean based on accuracies reported by all three feature groups)

The SVM algorithm performs better across all text sizes compared to RF. A two-way ANOVA with the dependent variable as the accuracy and with independent variables as the text size and the classification algorithms confirms the visual impression since both independent variables were found to be statistically significant (p Hungarian is supported by the data reported in Köhler (2012) and Köhler and Naumann (2013). Tab. 1: Values of parameter k and m based on the corpus data used in Köhler (2012) and Köhler and Naumann (2013) English

German

Hungarian

k

0.0054

2.64

4.21

m

0.0016

0.052

5.81

But these results can only be considered preliminary. More empirical data are called for. Only four of the documents from the Szeged corpus conformed to the hyper-Pascal distribution; the values in the last row of Tab. 1 are the averages of the values computed for these four texts. Additionally, a few questions remain. The data lead to the desired ranking of the three languages, but the k- and m-values are highest for Hungarian and lowest for English – not the other way around as one would have expected. Further work has to address both issues.

3 Syntactic Complexity Though a large number of publications discuss the concept of syntactic complexity and possible ways to measure it, there is no generally accepted and theoretically well-motivated definition of this concept. It seems plausible to assume that this situation is not going to change in the near future, as it is not even clear what phenomena this term is supposed to cover. Does it refer to properties of the human language processing system (cognitive complexity) or properties of the structures attributed to sentences (structural complexity)? Or does it try to shed light on the functional role certain words and constructions carry in language production and comprehension (grammatical/syntactic weight)? Even if one concentrates on structural complexity, there are a number of alternatives. Let us name just the most well-known ones. It has been suggested that the (structural) complexity of a construction be identified with its length (the number of words or word-forms, e.g., Hawkins 1990) or the number of (all or non-terminal) nodes in the structure (Frazier 1985, Hawkins 1995). Preliminary empirical tests seem to indicate that both measures lead to nearly identical results (Szmrecsanyi 2004). Lately, a number of linguists have proposed to measure lan-

94 

 Sven Naumann

guage complexity in general, and syntactic complexity in particular, by applying an information-theoretic measure (Kolmogorov complexity, see Juola 1998, 2008, Bane 2008). Given a corpus c, Ehret and Szmrescanyi (2014) propose to compute a syntactic complexity score for c, comp(dis(c))/disc(c): dis(c) refers to a distorted variant of c which is generated by randomly deleting 10% of the word tokens in c, and comp(x) denotes the size of a data collection x after applying a standard file compression algorithm to x. Regardless of the fact that it seems impossible to come up with a definition of syntactic complexity (as structural complexity) that finds general consent, I would like to conjecture that there are at least two implicit assumptions shared by most linguists that an adequate definition should meet. First, longer constructs tend to be more complex than shorter ones. Second, deeper constructs tend to be more complex than shallower ones. If we take both assumptions for granted, Köhler’s complexity concept can be criticised as insufficient because it is designed just for local trees (i.e., trees of depth 1).2 For the re-evaluation of KCM (Section 4), the following four heuristic complexity measures were used. a. b. c. d.

Number of immediate sub-constituents Number of terminals 0.5 * L + 0.5 * DMAX 0.5 * L + 0.5 * DAV

(IC) (L) (LDM) (LDA)

(3)

IC is the complexity measure as proposed by Köhler. L is similar to IC in focusing on length. LDM and LDA both take length and depth into consideration: given a syntactic unit c, depth is determined as the length of the longest path in c (LDM) or as the average length of all paths in c. Both factors are given the same weight.

2  However, complexity is only one of properties he investigates. For depth, length, and position separate models are built.

Syntactic Complexity in Quantitative Linguistics 

 95

Fig. 1: Partial syntactic structure taken from the SUSANNE corpus

Let us turn to a brief example to illustrate the different behavior of the four measures. Take a look at the nominal phrase the jury (NPS), the prepositional phrase in term-end presentments (PPS), and the nominal phrase term-end presentments (NPPP) in Fig. 1. The complexity values for the three phrases are displayed in Tab. 3. IC assigns the same value to all of them, while the other measures rank them the same way (PPS > NPPP > NPS), differing only in granularity. Tab. 2: Complexity values obtained by the four complexity measures NPS

PPS

NPPP

IC

2

2

2

L

2

5

4

LDM

2

4.5

3.5

LDA

2

4.1

3.375

96 

 Sven Naumann

4 Empirical Investigations The tests of KCM reported in Köhler and Naumann (2013) revealed some problems with the data taken from the Hungarian Szeged corpus. The number of documents to which the hyper-Pascal distribution could be fitted was rather small. As only a small part of the corpus had been used for the tests at that time, it seemed to be sensible to perform further tests. The results presented here were obtained by using the Altmann-Fitter to search for the distributions that provided the best fit for the data sets at hand.

4.1 The Szeged Treebank The Szeged Treebank is a collection of 15 sub-corpora containing documents from six different topics: fiction, student essays, newspaper articles, legal documents, computer science articles, and business news. It contains 82.000 sentences (≈1.200.000 word forms) that received a phrase structure and a dependency structure annotation. All subcorpora except the two containing fiction and student essays3 were used for the tests, resulting in a corpus of 71.300 sentences and 1.331500 constituents (≈ 18,67 constituents per sentence). Additionally, 11 newspaper articles were selected, because corpora do not constitute natural linguistic objects and KCM might produce different results for sub-corpora and single documents. In both cases the results remained disappointing no matter which complexity measure was chosen.

4.1.1 Sub-Corpora Only 5 of the 13 subcorpora showed good or at least acceptable results (2/3)4 with the hyper-Pascal distribution. The Poisson-Binomial distribution yielded the best results: For 12 subcorpora we got good or acceptable results (11/1). A number of well-known other distributions worked better than the hyper-Pascal distribution:

3  They were excluded because the corpora for the other languages did not include data of this type. 4  Here, an expression of the form x/y (x’/x’’) is used to express that a given distribution could be fitted to x out of y data sets, with x’ (x’’) data sets leading to a good (to an acceptable) fit (i.e. x = x’ + x’’).

Syntactic Complexity in Quantitative Linguistics 

 97

negative binomial-Poisson at 10/13 (9/1), mixed geometric at 6/13 (6/0), and negative hypergeometric at 6/13 (6/0).

4.1.2 Individual documents Applying the four complexity measures to the eleven randomly selected texts led to a slightly different picture (Tab. 3), though the overall tendency did not change. Tab. 3: Fitting results for the 11 documents selected from the Szeged corpus. The two best fits for each measure are marked. HyPa: hyper-Pascal distribution; Mx-Geo: mixed geometric distribution; PoBi: Poisson binomial distribution; NeHyGe: negative hypergeometric distribution; NeBiPo: negative binomial Poisson distribution. HyPa

Mx-Geo

PoBi

NeHyGe

NeBiPo

IC

6/11 (4/2)

7/11 (6/1)

5/11 (4/1)

8/11 (6/2)

9/11 (8/1)

L

2/11 (2/0)

9/11 (9/0)

7/11 (6/1)

8/11 (8/0)

3/1 (2/1)

LDM

5/11 (3/2)

9/11 (8/1)

7/11 (4/3)

5/11 (4/1)

2/11 (2/0)

LDA

2/11 (1/1)

5/11 (4/1)

6/11 (6/0)

1/11 (1/0)

1/11 (0/1)

Still, one could argue that these results are due to special characteristics of the Hungarian language, the documents that were used to compile the corpus, or the way the documents were annotated. Therefore, as a next step we tried to replicate the positive results for English and German using larger data sets (cf. Köhler 2012).

4.2 The Penn Treebank 2 For English, the Wall Street Journal section (WSJ section, for short) of the Penn Treebank 2 corpus was used. The WSJ section contains about 47.000 sentences with 2.136.000 constituents (≈ 45,44 constituents per sentence). The Penn Treebank 2 uses a phrase structure annotation. Again, we contrasted the results obtained for the whole section with the results acquired for 20 articles randomly selected from the section. Because the Szeged Treebank does not include punctuation marks, two versions of the WSJ section and the articles were used: an unmodified version (WSJ+) and one with all punctuation marks removed (WSJ-).

98 

 Sven Naumann

4.2.1 The WSJ section The four complexity measures were applied to WSJ+ and WSJ-, thus generating 8 data sets. The hyper-Pascal distribution could be fitted to only 2 of the 8 data sets: WSJ-/ IC and WSJ-/LDM. All other distributions that led to good results with the Hungarian data (except for the negative binomial Poisson distribution) outperformed the hyper-Pascal distribution: In each case the distribution could be fitted to 5 of the 8 data sets. The impact of the punctuation marks on the results remains to be clarified. While the hyper-Pascal distribution could only be fitted to data sets based on WSJ-, there is no clear pattern for the other distributions. 3 of the 5 successful data sets for the negative hypergeometric and the Poisson binomial distribution are based on WSJ- and 2 on WSJ+. For the mixed geometric distribution, it is just the other way around.

4.2.2 Individual documents Looking at the individual documents, we see that only by using IC do we receive a  good result for the hyper-Pascal distribution. But in general, other distributions, especially the negative hypergeometric distribution, led to better results. Tab. 4: Fitting results for the 20 documents selected from the Penn Treebank 2. The two numbers denote the number of documents whose data the distribution could be fitted to. The first number takes the data sets computed for a document and its punctuation-free variant as manifestations of the same event (x ≤ 20); the second considers them different events (x ≤ 40), i.e. 20 – 40 represents a perfect result. HyPa

Mx-Geo

PoBi

NeHyGe

NeBiPo

IC

20 – 32

19 – 27

20 – 28

20 – 30

20 – 36

L

8 – 10

5–8

3–3

20 – 40

1–1

LDM

5–7

17 – 34

3–3

20 – 40

3–3

LDA

0–0

20 – 38

5–5

17 – 32

6–6

Syntactic Complexity in Quantitative Linguistics 

 99

4.3 The TüBa-D/Z Treebank The TüBa-D/Z is currently the largest and best annotated treebank for German. It is a newspaper corpus that contains a selection of articles from the newspaper “Die Tageszeitung” (taz) and contains around 85.000 sentences or 1.600.000 word forms. There are annotations for morphological features, grammatical functions, named entities, anaphora, etc., and 65.000 sentences or 3.500.000 constituents (≈ 53,84 constituents per sentence). 10 articles were used for testing KCM. As before, a punctuation-free copy (TüBa-D/Z-) of the test corpus (TüBa-D/Z+) and the articles was generated.

4.3.1 Corpus Only one setting produced good results with the hyper-Pascal distribution: Tüba-D/Z— + LDA. The best results were obtained for the Poisson binomial distribution. This distribution could be fitted to all data sets. All the other distributions considered worked better than the hyper-Pascal distribution: negative Poisson binomial at 2/8 (0/2), mixed geometric at 4/8 (4/0), and negative hypergeometric at 4/8 (3/1).

4.3.2 Individual documents No matter which complexity measure we use, the hyper-Pascal distribution can be fitted to only a few data sets. Except for the Poisson binomial distribution that worked so well for the whole corpus, all other distributions lead to much better results and again the negative hypergeometric distribution leads the field. Tab. 5: Fitting results for the 10 documents selected from the TüBa-D/Z corpus. As in Tab. 4, the two numbers denote the number of documents to whose data the distribution could be fitted. Since we selected only 10 documents, 10 – 20 represents a perfect result. HyPa

Mx-Geo

PoBi

NeHyGe

NeBiPo

IC

7 – 14

10 – 16

10 – 14

10 – 20

10 – 20

L

1–1

6–8

0–0

8 – 17

6–6

LDM

2–2

10 – 19

1–1

4–6

3–3

LDA

8 – 10

9 – 18

1–1

10 – 18

5–5

100 

 Sven Naumann

5 Conclusion The central aim of the study reported on in this paper was to investigate if the language-specific variations of the parameter obtained for KCM on the basis of the Hungarian, German, and English data reflect the specific way grammatical information is coded in the language, and to seek further empirical confirmation of the model by using larger data sets for the three languages considered so far. However, it was not possible to substantiate these claims. On the contrary, the results presented above seem to indicate that the problems with the Hungarian data reported in Köhler and Naumann (2013) are not due to properties of the language or annotation, and that it seems reasonable to question the validity of the model as such. By combining a suitable complexity measure with the right representation of the linguistic data (with or without punctuation marks), it is possible to generate a data set for almost every document or (sub-)corpus to which the hyper-Pascal distribution can be fitted. But it is obvious that this ‘multiple-choice-approach’ cannot be considered an adequate solution. A closer look at the results reveals that in most cases the hyper-Pascal distribution cannot be fitted to the data. In general, other distributions, especially the negative hypergeometric distribution, offer a far better performance regardless of which complexity measure is chosen. These findings are far from conclusive. More empirical work is necessary. But in case further investigations substantiate the results presented here, a revision of KCM (e.g. by re-adjusting the interaction of the four requirements or by introducing a further requirement) seems inevitable.

Syntactic Complexity in Quantitative Linguistics 

 101

References



Altmann, G., & Köhler, R. (2000). Probability Distributions of Syntactic Units and Properties. Journal of Quantitative Linguistics, 7, 189–200. Bane, M. (2008). Quantifying and Measuring Morphological Complexity. In Proceedings of the 26th West Coast Conference on Formal Linguistics (pp. 67–76). Ehret, K., & Szmrecsanyi, B. (2014). An Information-Theoretic Approach to Assess Linguistic Complexity. In: R. Baechler & G. Seiler (Eds.), Complexity and Isolation. Berlin: de Gruyter. Frazier, L. (1985). Syntactic Complexity. In D. R. Dowty, L. Karttunen & A. M. Zwicky (Eds.), Natural Language Parsing: Psychological, Computational, and Theoretical Perspectives. Cambridge: Cambridge University Press, 129–189. Gao, S., Zhang, H., & Haitao, L. (2014). Synergetic Properties of Chinese Verb Valency. Journal of Quantitative Linguistics, 21, 1–21.
 Hawkins, J. A. (1990). A Parsing Theory of Word Order Universals. Linguistic Inquiry, 21(2), 223–261. Hawkins, J. A. (1995). A Performance Theory of Order and Constituency. Cambridge: Cambridge University Press. Juola, P. (1998). Measuring Linguistic Complexity: the Morphological Tier. Journal of Quantitative Linguistics, 5(3), 206–213. Juola, P. (2008). Assessing Linguistic Complexity. In: M. Miestamo, K. Sinnemäki & F. Karlsson (Eds.), Language Complexity: Typology, Contact, Change. Amsterdam/Philadelphia: Benjamins. Köhler, R. (1999). Syntactic Structures: Properties and Interrelations. Journal of Quantitative Linguistics, 6, 46–57. Köhler, R. (2012). Quantitative Syntax Analysis. Berlin / New York: de Gruyter. Köhler, R., & Naumann, S. (2013). Syntactic Complexity and Position in Hungarian. Glottometrics, 26, 27–37. Liu, H. (2011). Quantitative Properties of English Verb Valency. Journal of Quantitative Linguistics, 18(3), 207–233. Szmrecsanyi, B. (2004). On Operationalizing Syntactic Complexity. In G. Purnelle, C. Fairon & A. Dister (Eds.), Le poids des mots. Proceedings of the 7th International Conference on Textual Data Statistical Analysis. Louvain-la-Neuve, March 10–12, 2004 (Vol. 2, pp. 1032–1039). Louvain-la-Neuve: Presses universitaires de Louvain. Vincze, V. (2014). Valency Frames in a Hungarian Corpus. Journal of Quantitative Linguistics, 21, 153–176.

Adriana S. Pagano1 – Giacomo P. Figueredo2 – Annabelle Lukin3

Measuring Proximity Between Source and Target Texts: an Exploratory Study

Abstract: This paper reports the results of an ongoing exploratory study aimed at investigating source-target text relations as computed through statistical methods for a manually annotated representative text sample. The purpose was to compare results obtained from clustering texts based on paragraph, sentence, and word counts with those obtained from clustering-based variables built on theory-informed categories of functions realized in the grammar of each language system. The corpus used is made up of ten different translations of an English source text into Spanish and Portuguese. The results point to different patterns with a common group obtained on the basis of the two methods. Keywords: translational corpus; cluster analysis; source-target text comparison; systemic functional theory

1 Introduction Quantitative approaches to text analysis in translational corpora have traditionally relied on descriptive statistics of data on paragraph, sentence, and word-based counts—including mean length and type/token ratio—as measures to compare source and target texts (Oakes 2012). More recently, and drawing on stylometry, researchers have pursued other methods to explore translated texts, as in the case of automatic clustering of texts on the basis of multivariate statistics. Cluster analysis, for instance, allows processing raw and unannotated text in order to discover patterns that would otherwise remain unseen if carried out by analysts relying on their perceptive skills or their own eyeballing of numerical data (Gries & Wulff 2012). Clustering methods used in stylometry (Rybicki & Eder 2011; Rybicki 2012; Rybicki & Heydel 2013) and empirical approaches to translation (Oakes 2012) resort to word frequency to profile texts in terms of the degree of similarity between them. While these no doubt prove useful for same-language comparisons, they

1 Federal University of Minas Gerais; [email protected] 2 Federal University of Ouro Preto; [email protected] 3 Macquarie University; [email protected]

104 

 Adriana S. Pagano – Giacomo P. Figueredo – Annabelle Lukin

are less productive for cross-linguistic comparisons as in the case of translations of a single source text in different languages, a felt need in translation studies. Since words have a limited potential for automatic cross-linguistic comparisons, and counts of automatically defined units such as paragraph and sentences are features of text organization that ultimately demand human analysis for an interpretation of observed differences, alternative methodologies need to be pursued through human-machine shared labor. One such methodology, as posited here, is to explore patterns of text organization, not only in terms of language expression and lexical items, but also grammar functions realized by expression (lexical items being exponents of grammatical choices) and annotated for that purpose. Systemic functional linguistics (SFL) as developed by M.A.K. Halliday (Halliday & Matthiessen 2014) is a  comprehensive theory of language that offers a descriptive and interpretive framework for making informed claims about how texts are organized and what kind of meanings are made through particular language choices. SFL analytical methodology permits the annotation of texts and the comparison of patterns therein built regardless of the language texts have been produced in. In this sense, SFL categories used for annotation of text samples (as qualitative variables) can be quantified and computed to obtain a metafunctional profile of each text. Clustering techniques can then be used to examine degrees of similarity between the texts and those interpreted in terms of the aim envisaged by the analyst. This analytical framework was developed in Pagano, Figueredo & Lukin (2014), where details can be found regarding the theoretical and methodological rationale. This paper reports the results of a case study aimed at investigating source-target text relations as computed through statistical methods for a manually annotated representative text sample. The main purpose was to compare results obtained from clustering texts based traditional measures of paragraph, sentence, and word counts with those obtained from clustering based on frequency of values attributed to text samples for variables built on theory-informed categories of functions realized by choices in the grammar of each language system. The corpus used is made up of ten translations of a source text in English. The set of target texts are ten different translations of an English original into Spanish and Portuguese by different translators over a period of six decades. As the annotated categories refer to a function’s description under a common general theory, they apply to variation in functional organization across language systems, each language having its particular lexicogrammatical realization. This ensures comparability between texts written in different languages to which equivalence is assigned because they stand in a relation of translation between one another.

Measuring Proximity Between Source and Target Texts: an Exploratory Study 

 105

2 Meta-functional Profiles for Text Comparability across Languages SFL conceives of language as structured to make three main kinds of meanings simultaneously, namely ideational, interpersonal, and textual meanings, which are fused together in linguistic units, the clause being “the central processing unit in the lexicogrammar” (Halliday & Matthiessen 2004:10). The ideational metafunction organizes the experience of the natural world in terms of events and things impacting or impacted by those events (the experiential component). It also chains the events in sequences of logical relations between them (the logical component). The experiential component of the ideational metafunction is realized by the system of transitivity (nuclear and circumstantial). The logical component is realized by taxis (degrees of dependency between clauses) and logical relations. Transitivity is the system assigned to represent things and events as grammatical functions of Participant and Process respectively. Transitive representations can be typologized in a general form as Material (a representation of events external to Participants), Mental (events internal to Participants, or consciousness), Relational (relations between Participants), and Verbal (symbolic events). Logical relations chain up experiential meanings in sequences of adding (extension), restating (elaboration), and focusing on specific aspects (enhancing). Projection sets up a semiotic reality (realis/irrealis) in terms of ideas, desires, wishes, sayings, and hypotheses. The interpersonal metafunction is sensitive to the relationship between speaker and listener, as well as the interaction types among interlocutors. Power, politeness, humility, familiarity, and expertise relations between interlocutors are organized within language by features of this metafunction. Grammatically, the systems of mood, modality, and polarity enact social interaction through clause types: indicative (declarative/interrogative), imperative, evaluation, and assessment. These enable the speaker to give/demand the commodities of information/ service from the listener (Halliday 1978). The Mood Element comprises the functions of Subject (degrees of responsibility, from responsible to impersonal) and Finite (arguability anchored in reference to the speech event as past, present, or future). Modality enables assessing propositions according to the degrees of modalization (probability and frequency) or modulation (obligation and inclination). Polarity is realized grammatically by absolute degrees of commitment between speaker and proposition. The textual metafunction enables interpersonal and ideational meanings by contextualizing them in a specific situation, according to a specific text type.

106 

 Adriana S. Pagano – Giacomo P. Figueredo – Annabelle Lukin

Textual grammatical systems construct texture, or the distribution of information (i.e., enabled/contextualized ideational and interpersonal meanings) along the flow of discourse. The main grammatical system that does this job is Theme. Discursively, Theme rearranges each clause to fit context within text. Its main function is either to keep the arrangement of the discourse flow or shift the arrangement to best fit contextual/text types’ new phases. A metafunctional profile of a clause is a descriptive statement of each choice made in every system within each metafunction. Tab. 1 below illustrates this metafunctional profiling carried out for the first clause in our sample for the original text in English. Tab. 1: Metafunctional profile of a corpus clause Although Bertha Young was thirty

she still had moments like this

bound clause realis responsible non-interactant

free clause declarative realis responsible non-interactant

Textual

Theme: perspective: initial

Rheme

Ideational: Experiential

Relational clause

Relational clause

Ideational: Logical

Hypotactical dependent clause enhancing

dominant clause

Interpersonal

The systemic functional approach to language is particularly useful for comparability across languages, in that features of systems can be compared independently of the way each one is realized in the lexicogrammar of each language system. Thus it is not the particular realization that is being compared, as when lexical items are compared on the basis, for instance, of word frequency lists, but selections of features in grammatical systems. SFL-based approaches within the field of translation studies have an over three-decade tradition of supporting cross-linguistic text analysis (see Steiner & Yallop 2001). Due to the substantial amount of manual annotation of functions, analysis is carried out on a  small corpus scale, yielding results that are interpreted with a  view to posing models to be tried on further samples of a corpus or other corpora altogether. The lack of big datasets is compensated for by depth of analytical power allowed for by a functional analysis. Our analysis, as shown in this paper, draws on a small corpus and demands substantial annotation of functional features. However, it is sensitive to features that contribute to similarities and dissimilarities between the texts and quanti-

Measuring Proximity Between Source and Target Texts: an Exploratory Study 

 107

fies them in order to compute distances and generate dendrograms that show us how texts enter into clades of clusters. This clustering behavior is explored in our study as a potential method in evaluating translated text, in the sense that proximity of source and target text as computed on the basis of grammatical functions relevant to text construction can be a criterion, as suggested by Halliday (2001) for judging whether a translated text is a good translation of a given source text from a linguistic perspective. For the corpus analyzed herein, as will be described in the methodology section of this chapter, a criterion used to compare source and target texts was functions manually annotated for a  selected text sample and frequencies of occurrences of features in the systems responsible for those functions, which allowed for quantification of data and the adoption of methods of multivariate analysis, such as cluster analysis.

3 Corpus and Methodology As can be seen in Tab. 2 below, the corpus is made up of a source text, a short story by Katherine Mansfield, written in English and published for the first time in 1918, along with ten of its translations, five of them into Portuguese and five into Spanish, their publication date ranging from 1940 to 2000. Tab. 2: The corpus compiled File

Status

Lang.

Title

Date

Place

Author

Words

KM

Source

Eng.

Bliss

1918

England

4,774

EV

Target

Port.

Felicidade

1940

Brazil

Katherine Mansfield Érico Veríssimo

ACC

Target

Port.

Êxtase

1980

Brazil

Ana Cristina Cesar

4,652

EVS

Target

Port.

1984

Brazil

Target

Port.

1991

Brazil

Edla van Steen and Eduardo Brandão Julieta Cupertino

4,574

JC

Infinita Felicidade Felicidade

MS

Target

Port.

Felicidade

1993

Brazil

Maura Sardinha

4,608

JMS

Target

Span.

Felicidad

1945

Chile

Jose Maria Souviron

4,520

EA

Target

Span.

Felicidad

1959

Spain

Esther de Andreis

4,962

JH

Target

Span.

Dicha

1976

Argentina Juana Heredia

4,793

LGEL

Target

Span.

1998

Spain

4,768

JG

Target

Span.

Felicidad Perfecta Éxtasis

2000

Spain

Lucía Graves and Elena Lambea Juani Guerra

4,854

4,752

4,931

108 

 Adriana S. Pagano – Giacomo P. Figueredo – Annabelle Lukin

The source text (Mansfield 1918) was retrieved from the online digital library Internet Archive and saved as a txt file. The ten translated texts were manually scanned from printed sources, proofread, and saved as txt files. An excerpt of the short story was chosen as a sample to be manually annotated. This is the first move in the story as defined by Pagano and Lukin (2010), extending from the opening clause, where the protagonist Bertha is introduced, up to the characters’ first exchange of direct speech. It spans the first four paragraphs in the English short story and is made up by seven sentences, comprising twenty-two ranking clauses, four of them being simplexes, i.e. clauses made up by one clause; and eighteen ones make up clause complexes in paratactical and hypotactical relations of expansion and projection of locution and ideas. There are embedded clauses as well. This intricate clause organization recurs in subsequent moves throughout the short story and is as such representative of the text, a microarray of the whole story and one suitable for comparison between source and target texts and the different target texts themselves. The excerpts were segmented into ranking clauses and pasted onto a spreadsheet for a  metafunctional analysis as proposed by Halliday & Matthiessen (2014). Following the SFL analytical framework, each clause was analysed in terms of the three metafunctional strands operating at clause rank: ideational, interpersonal, and textual. Counts such as number of paragraphs, sentences, and words were also considered. Author initials were used to label each entry of the spreadsheet. The spreadsheet file was imported in R (R Core Team, 2014) as a data.frame and coerced into a numerical matrix, the rows being each of the corpus texts, so that the dist. function in R could compute the distances between the rows of our data matrix. The values for each analytical subcategory (features) were read as variables and the counts for each of them were used for computing the distances using the Euclidean distance measure and the Ward method for cluster linkage. An R script was developed to first group the texts according to word, paragraph, and sentence counts. A second script was used to group texts according to counts for variables related to the grammatical functions annotated. Finally, a  third script was run considering all counts for all variables. In the following section, the results obtained in the form of dendrograms through each script are presented and discussed.

Measuring Proximity Between Source and Target Texts: an Exploratory Study 

 109

4 Results The output of the first script run in R for the databank containing the data pertaining word, paragraph and sentence counts yielded the dendrogram in Fig. 1.

Fig. 1: Dendrogram showing hierarchical clustering of source and target texts based on paragraph, sentence, and word count (Ward linkage) of Euclidean distances

In a bottom up reading, right to left, texts group in subsequently formed clades till they ultimately join in a two-clade structure. The first large clade is formed by two separate ones, made up by EA, LGEL, JMS, JG, and EVS on the one hand and EV and JH on the other. These are translations into Spanish and Portuguese. The second big clade is also formed by groups, even though these are more homogeneous than the first big clade. JC, KM, and ACC on the one hand and MS joining them higher up. These are all translations into Portuguese and they are close to KM (the original text in English). The second script run was meant to group texts according to counts for variables related to the grammatical functions annotated. The output can be seen in Fig. 2.

110 

 Adriana S. Pagano – Giacomo P. Figueredo – Annabelle Lukin

Fig. 2: Dendrogram showing hierarchical clustering of source and target texts based on annotated functional categories (Ward linkage) of Euclidean distances

As can be seen, in a bottom-up reading, right to left, the EV and the JC texts (both translations into Portuguese) form one clade, which is the closest link to the bottom of the diagram, thus showing that these two texts are the most similar and join together first. This clade is linked higher up to KM (the source text in English) and still higher up to ACC (a translation into Portuguese), the four of them clustering into a clear set. A second group is formed by joining JH and LGEL, joined higher up to JMS, then EA, and finally JG. Curiously enough, all these texts are translations into Spanish. The third clear group is formed by the EVS and MS texts, two translations into Portuguese, clearly separate from the two big groups, due to their mutual similarity and difference from rest. Like the results for the first script run, three of the Portuguese translations (EV, JC, and ACC) are closer to the original text in English (KM). Some of the

Measuring Proximity Between Source and Target Texts: an Exploratory Study 

 111

remaining texts seem to show similar groupings, with the notable exception of MS, now clustering with EVS in a separate clade. The third script run considered all counts both for the graphological units (paragraphs, sentences, and words) and the functions annotated. Its output can be visualized in the dendrogram in Fig. 3.

Fig. 3: Dendrogram showing hierarchical clustering of source and target texts based on both paragraph, sentence, and word count and annotated functional categories (Ward linkage) of Euclidean distances

Fig. 3 shows the same clustering pattern as that in Fig.  2, which may be interpreted as an indication that data on word, paragraph, and sentence count have no impact on the groups that may be obtained from clustering the texts on the basis of the functional categories annotated. When we consider graphological units only (Fig.  1), however, the clusters are different, except for one grouping common to Fig. 1, Fig. 2, and Fig. 3, which is the clade formed by the source text

112 

 Adriana S. Pagano – Giacomo P. Figueredo – Annabelle Lukin

in English (KM) and the two translated texts into Portuguese (ACC and JC). This may be interpreted as an indication that similarity between these texts encompasses both features pertaining to graphological units (word, sentence and paragraph) and grammatical functions and that there can be a relation between the two. Nonetheless, the different patterns obtained in Fig. 1, Fig. 2, and Fig. 3 for the remaining texts seem to point to differences between these two types of features and can be taken to mean that functional categories can be more promising in text analysis at more depth. The following section presents the concluding remarks of our study and its implications for further research.

5 Conclusions Our approach to measure proximity between source and target texts in translation draws on categories ascribed to the choices made by each author in the lexicogrammar of each language to realize analogous functions. The results obtained from clustering texts on the basis of two different measurements—paragraph, sentence, and word count vs. annotated functional category count—show different grouping patterns, while pointing to one group in common made up by the source text in English and two of its translations. The fact that these two translations are more similar to the source text in terms of graphological units but also, and more importantly, in terms of grammatical functions, suggests that grammatical functions are a  useful criterion to attest similarity between source and target texts, and possibly to judge the quality of translations from a  linguistic perspective. Drawing on Ke (2012), who investigates whether the similarity shown by dendrograms can be correlated to scores attributed to texts by a board of expert assessors, a  next step in our study is to collect data on rankings of the translated texts performed by human assessors in order to see which translated texts are assessed as being closer to the source text. These assessment data will be analyzed in order to verify whether the results yielded by automatic clustering match human assessment, in other words, whether translated texts more closely grouped with the source text through the clustering methods are ranked higher by human assessors. This step is expected to provide conclusive evidence as to the potential of performing cluster analysis of annotated samples of source and target texts as a novel way to tackle translation assessment.

Measuring Proximity Between Source and Target Texts: an Exploratory Study 

 113

References Gries, S. Th., & Wulff, S. (2012). Regression Analysis in Translation Studies. In M. Oakes & M. Ji (Eds.), Quantitative Methods in Corpus-Based Translation Studies: A Practical Guide to Descriptive Translation Research (pp. 35–52). Amsterdam: John Benjamins. Halliday, M. A. K. (1978). Language as Social Semiotic: the Social Interpretation of Language and Meaning. London/Baltimore: Edward Arnold / University Park Press. Halliday, M. A. K. (2001). Towards a Theory of Good Translation. In E. Steiner & C. Yallop (Eds.), Exploring translation and multilingual text production: beyond content (pp. 13–18). Berlin: Walter de Gruyter. Halliday, M. A. K., & Matthiessen, C. M. I. M. (2014). Halliday’s Introduction to Functional Grammar. London: Arnold. Ji, M., & Oakes, M. P. (2012). A Corpus Study of Early English Translations of Cao Xueqin’s Hongloumeng. In M. Oakes & M. Ji (Eds.), Quantitative Methods in Corpus-Based Translation Studies: A Practical Guide to Descriptive Translation Research (pp. 177–208). Amsterdam: John Benjamins. Ke, S.-W. (2012). Clustering a Translational Corpus. In M. Oakes & M. Ji (Eds.), Quantitative Methods in Corpus-Based Translation Studies: A Practical Guide to Descriptive Translation Research (pp. 149–174). Amsterdam: John Benjamins. Oakes, M. P. (2012). Describing a Translational Corpus. In M. Oakes & M. Ji (Eds.), Quantitative Methods in Corpus-Based Translation Studies: A Practical Guide to Descriptive Translation Research (pp. 115–147). Amsterdam: John Benjamins. Pagano, A., & Lukin, A. (2010). Exploring Language in Verbal Art: A Case Study of Katherine Mansfield’s Bliss. In Paper presented at the 22nd European Systemic Functional Linguistics Conference and Workshop, Univerza na Primorskem, Koper, 9–12 July. Pagano, A., Figueredo, G., & Lukin, A. (2014). Modelling Proximity in a Corpus of Literary Retranslations: a Methodological Proposal for Clustering Texts Based on Systemic-Functional Annotation of Lexicogrammatical Features. In M. Ji (Ed.), Empirical Translation Studies: Interdisciplinary Methodologies Explored. London: Equinox. (to appear) Rybicki, J., & Eder, M. (2011). Deeper Delta across Genres and Languages: Do We Really Need the Most Frequent Words? Literary and Linguistic Computing, 26(3), 315–321. Rybicki, J., & Heydel, M. (2013). The Stylistics and Stylometry of Collaborative Translation: Woolf’s “Night and Day” in Polish. Literary and Linguistic Computing, 28(4), 708–717. R Core Team. (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/ Rybicki, J. (2012). The Great Mystery of the (Almost) Invisible Translator: Stylometry in Translation. In M. Oakes & M. Ji (Eds.), Quantitative Methods in Corpus-Based Translation Studies: A Practical Guide to Descriptive Translation Research (pp. 231–248). Amsterdam: John Benjamins. Steiner, E., & Yallop, C. (Eds). (2001). Exploring Translation and Multilingual Text Production: beyond Content. Berlin: Walter de Gruyter.

Vasiliy Poddubnyy1 – Anatoly Polikarpov2

Evolutionary Derivation of Laws for Polysemic and Age-Polysemic Distributions of Language Sign Ensembles1 Abstract: A continuous stochastic dynamic mathematical model for the evolution of the polysemy of natural language signs is offered. The model is based on the assumption of the dissipative nature of the development of polysemic linguistic signs. Based on this model, theoretical laws for synchronous (simultaneous) probability distributions for signs’ ensembles are derived (age-polysemic and polysemic distributions). Theoretically derived conclusions are compared with the corresponding empirical polysemic distributions for lexical signs obtained from representative explanatory dictionaries of Russian and English. Keywords: Linguistic signs, polysemy, evolution, dissipative stochastic mathematical models, derivation of lows, polysemic and age-polysemic distributions, model identification

1 Introduction and Basic Assumptions of the Model The proposed mathematical model is based on the assumption of the dissipative nature of the development of linguistic signs’ polysemy (Poddubnyy & Polikarpov 2011: 103−24; 2013: 69−83). This assumption means that each linguistic sign at the moment of its birth has an individual limit of some ability to generate (to acquire) some certain quantity of meanings through its lifetime. This ability is called associative semantic potential (ASP). The ASP of a sign is gradually wasted in the course of the sign’s use. These microchanges accumulate with any act of use of signs, leading in time to some macroscopic results—first of all, to the development of a sign’s polysemy. The rate of the birth of new meanings in the history of a sign at any step in its semantic development should be proportional to the still unspent portion of its ASP. Therefore, the rate of new meaning births gradually

1 This publication is prepared within the framework of the scientific project No. 14-14-70010 supported by the Russian Humanitarian Scientific Fund. 1 National Research Tomsk State University; [email protected] 2 Lomonosov Moscow State University; [email protected]

116 

 Vasiliy Poddubnyy – Anatoly Polikarpov

slows in time. Meanings appearing later in the history of a sign are usually relatively more abstract than initial meanings. At the same time, but with a delay of time τ0, a similar process of accumulation of lost meanings begins, starting from the initial meanings of signs. Why does the process of losing meanings begin from the initial meanings of signs? Because there is a tendency for initial meanings to be relatively more specific, and therefore relatively less stable than subsequent, more and more abstract meanings. That is also why the process of loss gradually slows down. Our previous work was based on a discrete version of this model using simulation methods. In this paper, we present a continuous version of this model using analytical methods. On this basis, we attempt to formulate theoretical derivation laws for polysemic and age-polysemic distributions for signs’ ensembles. We denote ASP with the variable G. The current polysemy of a sign at any time t of its lifecycle is expressed by the difference x(t) = x1(t) − x2(t) between the processes of gaining new meanings x1(t) and losing previously acquired meanings x2(t). A continuous model suggests that these processes are continuous and subject to linear differential equations of the form dx1(t) 1 = (G − x1(t)), x1(t0) = 1, t ≥ t0 , τ1 dt

(1)

dx2(t) 1 = (G − x2(t)), x2(t0 + τ0) = 0, t ≥ t0 + τ0 , τ2 dt

(2)

where τ1 = a1 /G, τ2 = a2 /G is inversely proportional to the ASP time constants of the growth and decline of polysemy with coefficients of proportionality a1 and a2 respectively, and τ1  [9] By or at > ([10] With, [3] IO or direction)

hataraku (work)

([2] DO, [3] IO or direction, [1] Subject) > ([9] By or at, [10] With)

yaburu (tear, break)

([4] Theme, [1] Subject) > ([9] By or at, [2] DO)

umareru (be born, arise)

([9] By or at, [4] Theme, [3] IO or direction, [1] Subject, [6] From)

ugoku (move)

([4] Theme, [1] Subject, [9] By or at, [8] Until or to, [6] From, [2] DO, [3] IO or direction, [10] With, [5] Direction) while ([1] Subject > [3] IO or direction) ([4] Theme, [1] Subject) > ([9] By or at, [3] IO or direction, [6] From) > [2] DO ([4] Theme, [1] Subject) > ([8] Until or to, [9] By or at, [6] From, [3] IO or direction) > ([2] DO, [10] With, [5] Direction)

ataeru (give) 6 verbs

Tab. 8: Order of postpositions (surface case) Verbs

Order of postpositions

au (meet)

[4] wa, [1] ga) > [9] de > ([10] to, [3] ni)

hataraku (work)

([2] wo, [4] wa, [3] ni) > ([1] ga, [9] de, [10] to)

yaburu (tear, break)

([3] ni, [4] wa, [1] ga) > ([9] de, [2] wo)

umareru (be born, arise)

([9] de, [4] wa, [3] ni, [6] kara, [1] ga)

ugoku (move)

[4] wa > ([9] de, [8] made, [6] kara, [1] ga) > ([3] ni, [10] to, [5] e, [2] wo) [4] wa > ([1] ga, [9] de, [3] ni, [6] kara) > [2] wo

ataeru (give) 6 verbs

[4] wa > ([1] ga, [8] made, [9] de, [6] kara, [3] ni) > ([2] wo, [10] to, [5] e)

7 Discussions and Conclusions It can be considered that in the case of the verbs umareru (be born, arise) and ugoku (move), most of scores of complements or the postpositions for the order of the pair are not significant. In the case of other verbs, there are groups of com-

150 

 Haruko Sanada

plements or postpositions, e.g. ([3] ni, [4] wa, [1] ga) and ([9] de, [2] wo) for yaburu (tear, break), and the order of these groups is significant. Among complements or postpositions, [4] wa is placed in the relatively “higher” position in the sentence, and [1] ga follows it. Derivative forms with [4] wa like niwa or towa which are categorized into respective groups seem to be placed in the “higher” position in the sentence. [4] wa has a role of the topic marker and it can be interpreted as the writer’s aim to show a keyword to readers early in the sentence. This problem will be one of our future tasks. The order of a DO and an IO for the verb like ataeru (give) has been discussed (Saeki1960, Tokunaga et al. 1991, Koizumi et al. 2004, Sawa 2004). In the present study the order is significant and [3] IO or direction has a “higher” position in the sentence than [2] DO. [2] DO seems to have a relatively “lower” position in the sentence. This can be interpreted as meaning that a DO needs less distance from the predicate of the sentence, since Japanese is a SOV language. It can be seen that the order of postpositions is individual by verb. For example, the postposition [2] wo takes a  “lower” position in the sentence containing ugoku (move) or ataeru (give) while it takes “higher” position in the sentence with hataraku (work). We often discuss postpositions regardless of verbs. However, we should consider postpositions and verbs as a set if it is clear that postpositions behave differently under the individual verbs. The order of complements or postpositions in the SOV sentence can be interpreted as a decision between a requirement to show information early and a requirement to connect to the predicate. It might be an example of self regulation (Köhler 2012). For example, in the case of the order of DO and IO for a verb like ataeru (give), a DO shows early in the sentence whether the “object which is given” is important information for the speaker and the hearer, and the DO-IO order must be chosen. However, from the point of view of a “clearly” understood sentence, a DO should be close to the predicate (a “lower” position in the SOV language) and the IO-DO order must be chosen. It means that the order is decided with a balance between the requirement of less memory (a preference for the left position) and the requirement of less complexity (a preference for the right position in an SOV language). We have to consider the case of an SOV language where the “lower” (or right) position in the sentence means a position close to the predicate, and the sentence has less complexity if the complement and the predicate have less distance. Therefore we may need to modify Köhler’s self regulation model. That is also one of our future tasks.

The Co-occurrence and Order of Valency in Japanese Sentences 

 151

Acknowledgements The study is partly supported by the Grant-in-Aid for Scientific Research (No. 23520567) of Japan Society for the Promotion of Science (JSPS).

References Kobayashi, S. (2000). On Factors that Determine the Order between NI Noun Phrase and O Noun Phrase in Japanese: an Analysis Based on Examination of a Corpus of Newspaper Articles [Nikaku meishiku to wokaku meishiku no gojun no yoin ni tsuite: shinbun kiji zenbun corpus ni motoduku bunseki]. Tsuru Bunka Daigaku Kiyo (= Bulletin of Tsuru Bunka University), 52, 105–124. Köhler, R. (2012). Quantitative Syntax Analysis. Berlin: Mouton De Gruyter. Koizumi, M., & Tamaoka, K. (2004). Cognitive Processing of Japanese Sentences with Ditransitive Verbs. Gengo Kenkyu (= Journal of the Linguistic Society of Japan), 125, 173–190. Koso, A., Hagiwara, H., & Soshi, T. (2004). What a Multi-Channel EEG System Reveals about the Processing of Japanese Double Object Constructions [Sanko doshibun shori no tachannel noha kenkyu]. IEICE Technical Report. Thought and Language, 104(170), 31–36. Muraoka, S., Tamaoka, K., & Miyaoka, Y. (2004). The Effects of Case Markers on Processing Scrambled Sentences. IEICE technical report. Thought and language, 104(170), 37–42. Noda, H. (1996). Post Positions wa and ga [Wa to ga]. Tokyo: Kuroshio Shuppan. Ogino, T., Kobayashi, M., & Isahara, H. (2003). Verb Valency in Japanese [Nihongo Doshi no Ketsugoka]. Tokyo: Sanseido. Saeki, T. (1960). Typical Pattern of the Word Order in the Present Japanese [Gendaibun ni okeru gojun no keiko]. In Language and Life [Gengo Seikatsu] (Vol. 111, pp. 56–63). Tokyo: Chikuma Shobo. Sanada, H. (2012). Quantitative Approach to Frequency Data of Japanese Postpositions and Valency [Joshi no Shiyo Dosu to Ketsugoka ni Kansuru Keiyoteki Bunseki Hoho no Kento]. Rissho Daigaku Keizaigaku Kiho (= The quarterly report of economics of Rissho University), 62(2), 1–35. (in Japanese) Sanada, H. (2013). Distributions of Frequencies of Postpositions for the Subject and the Ellipsis of the Subject: a Quantitative Analysis Employing the Valency Theory [Shukaku wo Shimesu Joshi no Hindo Bunpu to Shukaku Shoryaku no Mondai: Ketsugoka Riron wo Sanko ni Shita Keiryoteki Kenkyu]. Gakugei Kokugo Kokubungaku (= Journal of Japanese Linguistics and Literature of Tokyo Gakugei University), 45, 1–14. Sanada, H. (2014). The Choice of Postpositions of the Subject and the Ellipsis of the Subject in Japanese. In G. Altmann, R. Čech, J. Mačutek & L. Uhlířová (Eds.), Empirical Approaches to Text and Language Analysis (pp. 190–206). Lüdenschied: RAM-Verlag. Sawa, T. (2004). The Effects of Word-order Information on the Syntactic and Semantic Processing of Japanese Sentences [Nihongobun no togoteki imiteki shori ni oyobosu gojun no eikyo ni tsuite: online ho ni yoru kento]. Bulletin of Tokyo Gakugei University (Series I: Science of education) [Tokyo Gakugei Daigaku Kiyo. Dai 1 bumon. Kyoiku Kagaku], 55, 285–291.

152 

 Haruko Sanada

Sommerfeldt, K.-E., & Schreiber, H. (1977a). Wörterbuch zur Valenz und Distribution der Substantive. Leipzig: Bibliographisches Institut. Sommerfeldt, K.-E., & Schreiber, H. (1977b). Wörterbuch zur Valenz und Distribution deutscher Adjektive. Leipzig: Bibliographisches Institut. Tesnière, L. (1959, 1988). Éléments de Syntaxe Structurale (2nd ed.). Paris: Klincksieck. Tokunaga, T., & Tanaka, H. (1991). On Estimating Japanese Word Order Based on Valency Information [Ketsugoka joho ni motoduku nihongo gojun no suitei]. Mathematical linguistics [Keiryo Kokugogaku], 18(2), 53–65.

Jacques Savoy

Authorship Attribution Using Political Speeches Abstract: This paper describes a set of authorship attribution experiments using 224 State of the Union addresses delivered by 41 US presidents from 1790 to 2014. We view each speech as a compound signal including both the style and the topical terms selected by the author. We can also detect authorship differences based on the POS tags and their morphological information. Even if the president was not always the real author, we assume that one author corresponds to one presidency. Based on the top most frequent words (MFW), including a large number of functional words (FW), we can achieve an accuracy rate of up to 89% (200 correctly assigned speeches out of 224). The POS tags alone do not offer a high performance and their combination with the MFW does not enhance the performance over a representation based on MFW only. Using only content-bearing terms, Labbé’s measure obtains a high accuracy of around 91% (204 of 225). The combination of MFW and content words tends to slightly improve the overall performance (206 of 224) compared to MFW-based representation. Finally, the analysis of some incorrect assignments reveals interesting political relationships. Keywords: authorship attribution, discourse analysis, political speeches

1 Introduction Authorship attribution (Juola 2006; Craig & Kinney 2009) aims to determine, as accurately as possible, the author of a disputed text (e.g., part of a play) based on text samples written by known authors. Under this general definition, we can find the closed-class attribution problem where the real author is one of the given candidates. In the open-set problem, the real author could be one of the specified authors or another one. Authorship attribution can be limited to demographic or psychological information on an author (profiling) (Argamon et al. 2009) or simply to determine whether or not a given author did in fact write a given text (chat, e-mail, testimony) (verification) (Koppel et al. 2009). To solve this question, various authorship attribution models attempt to derive the style of the disputed text and those corresponding to the possible candidates (author profile). These stylistic representations are usually based on either University of Neuchatel; [email protected]

154 

 Jacques Savoy

the frequency analysis of functional words (FW) (determiners, pronouns, prepositions, conjunctions, and certain auxiliary and modal verbs) or on the top m most frequent words (MFW), with m between 50 and 1,000. As variants, other studies have considered letter frequencies (Merriam 1998) or short sequences of letters (n-gram), but always by considering the top m most frequent ones. As a second source of evidence to describe an author’s style, we can consider the relative frequencies of various parts-of-speech (POS), with or without morphological information (e.g., simply the tag verb vs. verb, 3rd person singular, present tense). Moreover, instead of being limited to isolated tags, we can consider short sequences of POS tags (e.g., adj-noun-noun) and their relative frequencies. These POS tags can be determined by applying a freely available POS tagger (Toutanova et al. 2003). As a third source of evidence, we can account for topical or content-bearing words, words belonging to the nouns, adjectives, verbs, or adverbs category that convey more to the meaning. Some studies suggest applying a feature selection to extract the more discriminative terms (Savoy 2012), while others propose to use all of them (Labbé 2007). Of course, instead of using these three sources of evidence separately, we can combine them in order to improve the overall performance of the system. Such approaches have already been proposed, but mainly by focusing on combining FW (or MFW) with POS information. In this paper, we consider taking into account topical words as well. This paper is structured as follows. The next section provides an overview of authorship attribution schemes used in our experiments. Section 3 describes the main characteristics of our evaluation corpus. An evaluation and analysis of the results are presented in Section 4, and the last section recaps the main findings of this study.

2 Authorship Attribution Methods Different automatic authorship attribution systems have been proposed. In this paper, we will focus on distance-based models using words to represent the style of each author. On this perspective, Section 2.1 describes the Delta rule suggested by Burrows (2002) based on the top MFW. As an alternative, Section 2.2 describes the Kullback-Leibler Divergence (KLD) approach (Zhao & Zobel 2007) based on a pre-defined set of words, mainly FW. In Section 2.3, we present the general idea of computing an intertextual distance, as proposed by Labbé (2007). Finally, in

Authorship Attribution Using Political Speeches 

 155

Section 2.4, we illustrate the machine learning paradigm with the naïve Bayes model (Mitchell 1997).

2.1  Burrows’ Delta To determine the probable author of a disputed text, Burrows (2002) proposes taking into account the most frequent words (MFW) including many functional words (FW). Burrows suggests considering from 40 to 150 MFW. Of course, this limit is arbitrary and we can explore the performance up to 1,000 MFW (Hoover 2004). When comparing two texts, Burrows (2002) suggests that the second most important aspect is not the use of absolute frequencies, but rather their standardized scores (Z score). To calculate this Z score for each word ti in a sample of texts, we apply Equation (1) in which rtfij indicates the relative term frequency of term ti in a document dj, meani its mean, and sdi its standard deviation when considering the underlying text samples.

Z score (tij) =

rtfij − meani sdj

(1)

From the Z score value attached to each word, we can compute a distance between two texts. Given a query text Q, an author profile Aj (concatenation of all his/her writings), and a set of terms ti, for i = 1, 2, …, m, we compute the Delta value by applying Equation (2). 1

m

Delta (Q,Aj) = m × ∑ | Z score (tiq) − Z score (tij) | i=1

(2)

In this formulation we attach the same importance to each term ti, independently of their absolute occurrence frequencies. The most probable author is the one depicting the smallest Delta distance with the disputed text.

2.2  Kullback-Leibler Divergence To define a functional word (FW) list, we can include all closed POS categories (namely determiners, pronouns, prepositions, and conjunctions). The decision is less clear when we analyze the auxiliary and modal verbs. On this perspective, Zhao & Zobel (2007) suggest considering a priori a limited number of predefined words to discriminate between different author profiles. Their proposed English list contains 363 terms, mainly function words. For each word in this list, a prob-

156 

 Jacques Savoy

ability of occurrence associated with each author of the disputed text must be estimated. As a direct estimate for word ti (denoted Probq[ti] or Probj[ti]), we can apply the maximum likelihood principle and estimate it as shown in the left part of Equation (3). Prob[ti]=

tfi tf + λ or Prob[ti] = i n+ λ |V | n

(3)

where tfi indicates the absolute term frequency (or the number of occurrences) of word ti in the text or sample, and n the sample size (number of tokens). This first solution tends to overestimate the occurrence probability of words appearing in the sample, at the expense of the missing terms. To correct this problem, we can apply a Laplace smoothing that adds 1 to the numerator in Equation (3) and likewise adds the vocabulary size (denoted by |V|) to the denominator (Manning & Schütze 1999). This approach could then be generalized by using a λ parameter (Lidstone’s law), resulting in the probability estimate depicted in the right part of Equation (3). In our experiments we fixed the λ value to 0.1. Based on these estimations, we can measure the degree of disagreement between two probabilistic distributions. To achieve this objective, Zhao & Zobel (2007) suggest using the Kullback-Leibler Divergence (KLD) formula shown in Equation (4). m

KLD (Q || Aj) = ∑ ProbQ [ti] . log2  i=1

ProbQ [ti] Probj [ti]

(4)

where Probq[ti] and Probj[ti] indicate the occurrence probability of the word ti, for i = 1, 2, …, m, in the query text Q, or in the Aj author profile respectively. The smallest KLD value indicates the most probable author of the disputed text.

2.3  Labbé’s Intertextual Measure Instead of being limited to the top m MFW (Section 2.1) or to a pre-defined set of FW (Section 2.2), Labbé (2007) suggests computing an intertextual distance based on the whole vocabulary. This measure can take a value between 0 and 1 depending on the degree of overlapping between the two texts. A value of 0 indicates that the two texts are identical. A distance of 1 specifies that the two speeches have nothing in common.

Authorship Attribution Using Political Speeches 

 157

More formally, this distance denoted D(Q,A) between the query text Q and an author profile A is given by Equation (5) where nq indicates the length (number of tokens) of text Q, and tfiq denotes the (absolute) term frequency of word i. The length of the vocabulary used to discriminate between the different texts is designated by m. Usually the text Q does not have the same length (in our case, we assume that the length of text A is larger than text Q). We need therefore to reduce the longest text by multiplying each of its term frequencies (tfiA) by the ratio of the two lengths, as indicated in the second part of Equation (5). D(Q,A) =

∑ mi=1 |tfiq − tf̂iA| 2 . nq

with tf̂iA = tfiA

m nq and with nq ∑ tfiq i=1 nA

(5)

Finally, to return valid measurements, the length difference between the two texts must be smaller than eight times, and each text must contain at least 5,000 words. The smallest value for D(Q,A) indicates the most probable author.

2.4  Naïve Bayes Until now, we have presented authorship attribution methods following the distance-based paradigm. As another view, we can apply a machine learning approach (Sebastiani 2002). As a typical and simple text classifier, we choose the naïve Bayes model (Mitchell 1997) to determine the possible author between the set of possible candidates (or hypotheses), denoted by Aj. To define the probable author of a query text Q, the naïve Bayes model selects the one maximizing Equation (6), in which tiq represents the ith word included in the query text Q, and nq indicates the size of the query text. ng

Arg maxAj Prob[Ai | Q] = Prob[Ai] Π Prob[tiq | Aj] i=1

(6)

To estimate the prior probabilities (Prob[Aj]), we can take into account the proportion of texts written by each author or, as in this experiment, fixing all prior probabilities to the same value (uniform distribution). To determine the term probabilities we regroup all texts belonging to the same author to form the author profile. For each word ti, we then compute the ratio between its occurrence frequency in the corresponding author profile Aj (tfij) and the size of this sample (nj). Prob[ti | Aj] =

tfij tf + λ or Prob[ti | Aj ] = ij nj+ λ |V | nj

(7)

158 

 Jacques Savoy

This definition (see Equation  (7)) tends to overestimate the probabilities of terms occurring in the text with respect to missing terms. For the latter, the occurrence frequency (and probability) was 0, so a smoothing approach had to be applied to correct this. As for the other methods, we will apply Lidstone’s law for smoothing as shown in the right part of Equation (7).

3 The State of the Union Addresses The choice of the State of the Union addresses (Hoffman & Howard 2006) as an evaluation corpus can be explained by the following reasons. First, political speeches are without copyright and freely available. Second, they are correctly written and are usually easy to understand unlike SMS or some blog posts. Third, as the State of the Union addresses indicate the main concerns and priorities of the tenant of the White House, they have a worldwide interest and impact. To create this corpus, we downloaded all the addresses from the web site www.presidency.ucsb.edu. This corpus contains 224 speeches delivered by 41 US presidents. The first address was uttered by G. Washington (January, 8th, 1790) and the last by B. Obama (January, 28th, 2014). For two presidents (W. H. Harrison (1841) and J. A. Garfield (1881)), we do not have any State of the Union addresses because their term was too short (a few months). We have also removed the single address given by Taylor (1849) because we cannot train and test an attribution scheme using a single text for one possible author. Cleveland appears twice as president (1885–1888 and 1893–1896), corresponding to his two terms interrupted by B. Harrison’s presidency (1889–1892). To represent each speech, we can use word-tokens (or simply tokens) (e.g., is, were, been or armies, army) or lemmas (entries in the dictionary). To define the corresponding lemma for each token, we used the part-of-speech (POS) tagger developed by Toutanova et al. (2003). Analysis of the speeches gives the mean length as 8,731 tokens (standard deviation 5,860). The longest speech was delivered by in 1910 (30,773 tokens) and the shortest by Washington in January 1790 (1,180 tokens). When considering the mean length per president, Adams (1797–1800) wrote the shortest speeches (an average of 1,931 word-tokens per speech) while (1909–1912) is the author, on average, of the longest addresses (24,655 word-tokens).

Authorship Attribution Using Political Speeches 

 159

4 Evaluation Knowing all speeches except one, can we attribute this last address to its presidency? This evaluation methodology is called leaving-one-out and guarantees that the training set does not contain the text to be classified. We then iterate over 224 State of the Union addresses to obtain the overall performance of a given authorship attribution scheme. In our experiments, we also assume that the same author is behind all speeches covering a given presidency. Of course, this is not strictly exact because we know that behind each president there is usually a speechwriter. For example, behind Kennedy we can find the name Sorensen (Carpenter & Seltzer 1970), Favreau behind Obama, and even Madison & Hamilton behind some speeches delivered by Washington. But, as Sorensen said: If a man in a high office speaks words which convey his principles and policies and ideas and he’s willing to stand behind them and take whatever blame or therefore credit go with them, [the speech is] his.

In the first set of experiments, we ground the attribution decision on the top m MFW, with m varying from 50 to 600. The results, depicted in Tab.  1, indicate that using less than 200 MFW is not a good practice. Usually, working with around the top 300 MFW, we achieve a good overall performance and even the best for the KLD scheme. The Delta scheme achieves the best performance (200 correct assignments corresponding to a success rate of 89.3%) when considering 500 MFW. Labbé’s intertextual distance tends to work better with more words (the best result is attained with 500 MFW). Tab. 1: Performance over 224 speeches using only the top m MFW # MFW

Delta

KLD

Labbé

Naïve Bayes

50

172 (76.8%)

178 (79.5%)

170 (75.9%)

181 (80.8%)

100

188 (83.9%)

189 (84.4%)

175 (78.1%)

194 (86.6%)

150

190 (84.8%)

192 (85.7%)

181 (80.8%)

192 (86.7%)

200

190 (84.8%)

194 (86.6%)

185 (82.6%)

196 (87.5%)

250

192 (85.7%)

194 (86.6%)

187 (83.5%)

196 (87.5%)

300

199 (88.8%)

197 (87.9%)

188 (83.9%)

192 (85.7%)

350

199 (88.8%)

195 (87.1%)

187 (83.5%)

193 (86.2%)

400

198 (88.4%)

197 (87.9%)

189 (84.4%)

189 (84.4%)

500

200 (89.3%)

196 (87.5%)

192 (85.7%)

189 (84.4%)

600

198 (88.4%)

182 (81.3%)

192 (85.7%)

188 (83.9%)

160 

 Jacques Savoy

As a second source of evidence, we can consider POS tags resulting from eight possible categories (noun, verb, adjective, adverb, preposition, conjunction, pronoun, and determiner) as well as the punctuation class. We think it is more appropriate to also take account of the morphological information. Thus we can discriminate between nouns in singular or plural form, or with verbs we can add the person, number and tense information. With this morphological information, we have 35 distinct POS tags. Using only this evidence to describe the different styles, the first row of Tab. 2 indicates that this stylistic representation does not produce high performance levels. We can, however, combine the POS information with the top 300 or top 500 MFW. As shown in the last two rows of Tab. 2, this combination offers better performance. However, when considering only the m MFW (see Tab. 1), we can obtain, on average, a better performance. Tab. 2: Performance using the POS and morphological information (224 speeches) Delta

KLD

Labbé

Naïve Bayes

POS only

88 (39.3%)

96 (42.9%)

154 (42.9%)

99 (44.2%)

POS & 300 MFW

194 (86.6%)

184 (82.1%)

182 (81.3%)

182 (81.3%)

POS & 500 MFW

200 (89.3%)

192 (85.7%)

184 (82.1%)

184 (82.1%)

As a third source of evidence, we can use only topical or content-bearing words. This choice can be explained by the fact that the same idea or concept can be expressed using various formulations (Furnas et al. 1987). The choice of the words is not arbitrary, and when discussing a problem one person may prefer the abstract notion (e.g., immigration) while another may emphasize the human aspect (e.g., immigrants). To select terms used in this representation, we consider words occurring more than r times in the corpus. Terms appearing once or twice are not really useful to discriminate between 41 presidencies. Thus we impose a  term frequency (denoted tf) larger than 30 or 50. Similarly, we ignore words used only by a few authors. In our experiments, we impose that each word must be used by at least 10 distinct authors (author frequency or af > 9). After applying these two constraints, we finally remove the words belonging to the set of the top 300 or top 500 MFW. The size of the used vocabulary is given in parentheses in the first column of Tab. 3.

Authorship Attribution Using Political Speeches 

 161

Tab. 3: Performance using only content-bearing words (224 speeches) Delta

KLD

Labbé

Naïve Bayes

tf > 30, af > 9, -300 MFW, (3,081)

193 (86.2%) 172 (76.8%) 205 (91.5%) 176 (78.6%)

tf > 30, af > 9, -500 MFW, (2,881)

195 (87.1%) 162 (72.3%) 204 (91.1%) 174 (77.7%)

tf > 50, af > 9, -300 MFW, (2,278)

194 (86.6%) 178 (79.5%) 204 (91.1%) 171 (76.3%)

tf > 50, af > 9, -500 MFW, (2,078)

199 (88.8%) 166 (74.1%) 200 (89.3%) 171 (76.3%)

Based only on content bearing terms, we achieve similar performance levels to when we considered only the top 300 or 500 MFW for the Delta and KLD schemes (see Tab. 1). The accuracy rate is clearly higher with Labbé’s measure, while for the naïve Bayes the performance is a little bit lower. Finally, we can combine both the topical and stylistic representations and consider all possible words respecting a minimal term (tf) and author frequency (af). The result of such a combination is given in Tab. 4. Varying the values of the underlying parameters does not significantly modify the performance achieved by Labbé’s model. The Delta and the naïve Bayes are more sensitive to this parameter setting, rendering these approaches less robust. Tab. 4: Performance using both content-bearing terms and MFW (224 speeches) Delta

KLD

Labbé

Naïve Bayes

tf > 5, af > 1, (8,524)

143 (63.8%)

199 (88.8%)

206 (92.0%)

209 (93.3%)

tf > 30, af > 9, (3,381)

197 (87.9%)

189 (84.4%)

203 (90.6%)

189 (84.4%)

tf > 40, af > 9, (2,928)

201 (89.7%)

188 (83.9%)

203 (90.6%)

187 (83.5%)

tf > 50, af > 9, (2,578)

200 (89.3%)

190 (84.8%)

203 (90.6%)

182 (81.3%)

When inspecting some incorrect attributions, we discover interesting facts. As a first example, we found that the first Johnson speech, uttered January 8th, 1964, was always assigned to J. F. Kennedy. This incorrect attribution is indicated by all attribution schemes. Looking at the MFW, we see that they correspond more to J. F. Kennedy than to Johnson’s style from 1965. External evidence tends to confirm that the attribution schemes provide the correct decision. In fact, J. F. Kennedy was assassinated November 22nd, 1963, a little bit more than one month before the first State of the Union address delivered by L. Johnson. So we can formulate the hypothesis that the same ghostwriter was behind Kennedy’s speeches and the first address uttered by L. Johnson. Moreover, we can also mention that this address is rather short (3,647 tokens) compared to the other Johnson speeches (mean 5,844).

162 

 Jacques Savoy

As another example, we can inspect the first State of the Union address uttered by G. W. Bush, February 27th, 2001. All attribution schemes (except Delta based on MFW) propose B. Clinton as the most probable author. This first speech was delivered before the attack of September 11th. After this date, Bush’s speeches include topics and expressions related to terrorist, axis of evil, Iraq, Al Qaida, homeland security, etc. This set of topics was absent from the first speech and its content reflects more closely the problems facing Clinton’s administration. Thus the automatic system detects a difference between the first Bush speech and the rest, with the first more closely related to Clinton’s style and topics. As a last example, we can analyze Ford’s speech uttered in 1976. All attribution methods based on the MFW or on topical terms tend to assign this address to Reagan. From topical words, we observe, for example, more United States, military, or executive than usually occurring in other Ford speeches. On the other hand, those relative frequencies are close to Reagan’s profile. Moreover, from a stylistic point of view, Ford in 1976 uses, with a relative frequency close to Reagan’s, the words shall or win, indicating a speech more oriented towards the future than other Ford speeches and corresponding more to Reagan’s style.

5 Conclusion The 244 State of the Union addresses form a pertinent corpus to study US history, to analyze the relationships between presidencies or to test different authorship attribution models. In this latter case, we assume that the same author (or ghost writer) was behind all speeches corresponding to a given presidency. It is a common belief that having more sources of evidence about the real author of a disputed text may enhance the performance of the attribution scheme. In this study, we view each speech as a compound signal corresponding to the author’s style and his/her intents reflected by the choice of the topical words. To analyze this composite signal, we can first consider the top most frequent words (MFW) covering a large number of functional words (determiners, pronouns, prepositions, conjunctions, and some auxiliary or modal verbal forms). Based on this representation, the Delta model correctly classifies 200 speeches over 224 (using 500 MFW), the KLD 197 (with 300 MFW), Labbé’s measure 192 (500 MFW), and the naïve Bayes 196 (200 MFW). Varying the number of MFW, the resulting performance is slightly lower. As a second source of information, we can represent the author’s styles by their POS tags (with their morphological information). This approach does not provide as good an accuracy rate, ranging from 88 correct attributions with the

Authorship Attribution Using Political Speeches 

 163

Delta and up to 154 with Labbé’s measure. Adding this signal to the MFW, the resulting performance is slightly lower than using only the MFW. As a third source of information, we can take account of topical words. In this case, we can ignore words appearing infrequently (e.g., less than 30 or 50 times in the corpus) or terms used only by a few authors. Using this source of evidence, the Delta scheme provides similar levels of performance as using MFW only, while Labbé’s measure can achieve higher levels (around 204 correct attributions compared to 192 with MFW only). For both the KLD and naïve Bayes, the resulting accuracy is lower than when using only the MFW. Combining topical terms and the top MFW, we can obtain a high accuracy rate with Labbé’s model, achieving 206 correct assignments over 224 (or 92%). This is the highest performance in all our experiments. Using the Delta or the KLD model, we achieve similar performance levels as when considering only topical terms, and slightly better than when using only MFW. With the naïve Bayes, the combination of MFW and topical terms does not enhance the performance compared to MFW only. Analyzing the attribution errors shows us interesting information about different presidents and their speeches. For example, the first State of the Union address uttered by L. Johnson was systematically assigned to J. F. Kennedy. External evidence tends to confirm that this assignment is correct.

Acknowledgments This research was supported, in part, by the Swiss NSF under Grant #200021_149665/1.

References Argamon, S., Koppel, M., Pennebaker, J. W., & Schler, J. (2009). Automatically Profiling the Author of an Anonymous Text. Communications of the ACM, 52(2), 119–123. Burrows, J. F. (2002). Delta: A Measure of Stylistic Difference and a Guide to Likely Authorship. Literary and Linguistic Computing, 17(3), 267–287. Carpenter, R. H., & Seltzer, R. V. (1970). On Nixon’s Kennedy style. Speaker & Gavel, 7(41). Craig, H., & Kinney, A. F. (Eds.). (2009). Shakespeare, Computers, and the Mystery of Authorship. Cambridge: Cambridge University Press. Furnas, G., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1987). The Vocabulary Problem in Human-System Communication. Communications of the ACM, 30(11), 964–971.

164 

 Jacques Savoy

Hoffman, D. R., & Howard, A. D. (2006). Addressing the State of the Union: The Evolution and Impact of the President’s Big Speech. Boulder, CO: Lynne Rienner Publ. Hoover, D. L. (2004). Testing Burrows’s Delta. Literary and Linguistic Computing, 19(4), 453–475. Juola, P. (2006). Authorship Attribution. Foundations and Trends in Information Retrieval, 1(3). Koppel, M., Schler, J., & Argamon, S. (2009). Computational Methods in Authorship Attribution. Journal of the American Society for Information Science & Technology, 60(1), 9–26. Labbé, D. (2007). Experiments on Authorship Attribution by Intertextual Distance in English. Journal of Quantitative Linguistics, 14(1), 33–80. Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press. Merriam, T. (1998). Heterogeneous Authorship in Early Shakespeare and the Problem of Henry V. Literary and Linguistic Computing, 13, 15–28. Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill. Savoy, J. (2012). Authorship Attribution: A Comparative Study of Three Text Corpora and Three Languages. Journal of Quantitative Linguistics, 19(2), 132–161. Sebastiani, F. (2002). Machine Learning in Automatic Text Categorization. ACM Computing Survey, 14(1), 1–27. Toutanova, K., Klein, D., Manning, C., & Singer, Y. (2003). Feature-rich Part-Of-Speech Tagging with a Cyclid Dependency Network. In Proceedings of HLT-NAACL 2003 (pp. 252–259). Zhao, Y., & Zobel, J. (2007). Entropy-Based Authorship Search in Large Document Collection. In Proceedings ECIR (pp. 381–392). Berlin: Springer (LNCS #4425).

Betsy Sneller

Using Rates of Change as a Diagnostic of Vowel Phonologization Abstract: This paper examines the use of rate-of-change as a diagnostic for phonological category. Using large-scale speech corpora, researchers are able to track phonemic and allophonic splits and mergers in real-time on the community level. The present paper demonstrates using a rate-of-change analysis to assess the phonological category of two possibly different vowels. I first demonstrate using a rate-of-change analysis on well-known vowel changes in the Philadelphia Neighborhood Corpus, and then use this method to diagnose vowel changes in the Origins of New Zealand English corpus. Keywords: Language Change, Vowel Merger, Rate of Change

1 Introduction The study of language change has benefitted greatly from the increasing availability of large-scale speech corpora. With large data sets of recorded speech, researchers are able to track the phonetic outputs of different sounds across time. Moreover, large-scale speech corpora allow more than simply the phonetic outputs to be measured; phonological inventories can also be seen to change over time. One type of inventory change can be found in vowel mergers and splits, which both result in a structural change to the inventory of vowels in a given language. On a more fine-grained level, we can also find evidence of subphonemic vowel splits giving rise to different allophones of a given phoneme. An important question for researchers of language change is when these category splits and mergers occur. The present paper addresses one method of analyzing vowel category in a diachronic speech corpus: comparing the rates of change, following Fruehwald (2013). Analyzing the rate of change of two potentially different phonological categories may serve as an important tool for determining the category of those two potentially different sounds. In this paper, I discuss using rate of change as one out of many useful methods for diagnosing phonological category, as well as the theory behind a rate of change analysis. I then show what a rate of change analysis looks like on several well-studied changes in Philadelphian English, before University of Pennsylvania; [email protected]

166 

 Betsy Sneller

turning to some attested changes in New Zealand English that have not previously been analyzed using large-scale normalized vowel measurements. Before discussing the theory behind a rate of change analysis, I will first turn to some of the existing methods that researchers regularly use to determine phonological category.

1.1 Acoustic Space One method of determining which category a  vowel sound belongs to is to examine its distribution in acoustic space. For any given speaker, the expectation is that a single phoneme will be normally distributed around a mean value in F1-F2 space (see Fig. 1a). If the acoustic distribution of a vowel is bimodally distributed (see Fig. 1b), this suggests that this one vowel actually belongs to two categories.

Fig. 1: Normal distribution around a mean for phoneme /æ/ (a), and bimodal distribution around two distinct means for /æ/

Fig. 2: Split MAD and TRAP in Philadelphian English (a) and merged LOT and CAUGHT in Californian English (b)

Using rates of Change as a Diagnostic of Vowel Phonologization 

 167

TRAP in Philadelphian English provides an example of a vowel that has become two distinct categories, lax TRAP and tense MAD (Fig. 2a); examining the acoustic distribution of TRAP for a Philadelphian English speaker suggests that TRAP in Philadelphian English consists of two categories rather than one category. Likewise, examining the acoustic distribution of two potentially different vowels may suggest that two vowel categories have merged into a single category. For many dialects of American English, the LOT and CAUGHT vowels have merged into a single vowel (see Fig. 2b) (Labov, Ash, & Boberg 2006). An analysis of the F1-F2 space of these vowels would show that they occupy the same distribution and similar means which are not significantly different from each other.

1.2 Minimal Pairs While examining acoustic space is a good first step to determining vowel category, it does not provide researchers information about which categories exist in speakers’ minds. For a  vowel that looks like it may contain two categories, it’s  important to determine whether speakers think the two categories are the same or not. It’s also important to determine whether two different categories can change the meaning of the word (are phonemically different) or whether they are just two different realizations of the same phoneme (are allophonically different). One important supplement to the acoustic space method is in examining the existence of minimal pairs amongst the vowels in question. If the acoustic distribution of a vowel suggests that there are two categories rather than one, the next step is to determine whether the two categories are phonemically distinct. If a set of minimal pairs can be found, then the categories are phonemic. If not, then it is likely that the two categories are allophonic.

1.3 “Same or Different” Task A  second important method of supplementing the acoustic space analysis is to ask speakers whether two vowels are the same or different. Occasionally, speakers will report two sounds being the same (e.g. in the minimal pair COTCAUGHT), even though their production of the two vowels in acoustic space are reliably different (Labov et al. 1972). Alternatively, speakers may report thinking of two sounds differently even if they are produced without a reliable difference in acoustic space. These scenarios allow researchers to gain a deeper insight into the relationship between mental targets and production.

168 

 Betsy Sneller

1.4 Effects of Duration In trying to determine whether two vowel variants are phonologically different or phonetically different, Strycharczuk (2012) points out that the differences created by purely phonetic reasons respond differently to changing speech rates or duration than do categories created by phonological reasons. If a difference that may be a coarticulatory effect gets larger for shorter durations of a vowel, this is suggestive of a purely phonetic effect. If the difference does not get larger for shorter durations, it suggests that the difference was not caused by coarticulation but rather by the speaker aiming for two distinct mental targets.

1.5 Application of a Phonological Rule The goal of this paper is to examine one additional tool for determining vowel category: examination of the application of a diachronic phonological rule (Fruehwald 2013). If a  vowel is undergoing a  diachronic change such as fronting or lowering, then it is possible to compare the application of that rule across two potentially different vowel categories. If the two potential categories respond in the same way to a diachronic phonological change, then this suggests that they are actually the same mental category for the speech community.

2 Using Rate of Change to Determine Phonological Category In this section, I will outline the theory behind using rates of change to determine phonological category. The question at hand is whether two potentially different vowels are actually represented as different mental categories in speakers’ minds. For the sake of exposition, I will use the vowel class GOOSE to demonstrate this theory, although it can be used to investigate any potentially related vowels. In Philadelphian English, GOOSE preceded by coronals such as in the words dew, knew, Tuesday surface as more front in the vowel space than GOOSE preceded by other consonants such as in the words cool, movie, fool. Conveniently, both sounds have also been undergoing changes throughout the past century. I will refer to the post-coronal GOOSE as TOO and the elsewhere condition as GOOSE. The question at hand is whether the difference between TOO and GOOSE is caused by purely mechanical effects or whether these two categories exist in speakers’ minds as distinct targets for production.

Using rates of Change as a Diagnostic of Vowel Phonologization 

 169

If the acoustic differences are simply due to mechanical effects such as coarticulation, then we can say that the difference between TOO and GOOSE is purely phonetic, and therefore that speakers do not have two different mental targets. If, on the other hand, speakers have two different mental targets, then we can say that the difference between those targets is phonological: the acoustic difference is caused by speakers actually intending to produce different sounds. It should be noted that a  rate of change investigation can only help determine whether a difference is phonetic or phonological. It does not, however, determine whether a phonological difference is phonemic or allophonic; additional tests are needed to tease this distinction apart.

2.1 Rate of Change Predictions for a Phonetic Difference If the difference between TOO and GOOSE is due to phonetic coarticulation, then there is a  single diachronic change that affects a single mental category in the minds of the speech community. Synchronically, there is an additional layer of coarticulation which causes each instance of TOO to surface as fronter than its GOOSE counterparts. Because there is only one underlying mental target for the diachronic change to target, measuring the rates of change for both TOO and GOOSE will result in the same graph. This is schematized in Fig. 3a and 3b.

Fig. 3: Schema of a phonological change affecting one category diachronically, with phonetic coarticulation causing two distinct acoustic outputs synchronically (a) and schema of the rate of change graph this scenario would produce (b)

170 

 Betsy Sneller

2.2 Rate of Change Predictions for a Phonological Difference Conversely, if the difference between TOO and GOOSE is due to phonological factors, this means that speakers have two different mental targets for TOO and GOOSE. Because a phonological rule can only target one phonological category at a time, this means that any change that TOO is undergoing is a different change than GOOSE. Schematically, it is represented in Fig. 4a. I note that coarticulation is also a possible additional influence on the acoustic realization of TOO, though this added coarticulation does not have an effect on the rate of change investigation. Crucially, when two different changes take place, this opens up the possibility of two different rates of change to also surface (Fig. 4b).

Fig. 4: Schema of two phonological changes affecting two categories diachronically, causing two distinct acoustic outputs synchronically (a) and schema of the rate of change graph this scenario would produce (b)

It should be noted that if two variants have similar rates of change, this does not demonstrate that they are the same category: it is theoretically possible for two different changes to occur at the same rate. However, it is not possible for one change to occur at two different rates. If a rate of change investigation produces two different rates of change for two variants, this is a clear indication that two phonological categories are at play. One final caveat is that phonological changes can target phonological categories of different sizes. It has been shown that categories as large as “all back vowels preceding /r/” can be targeted by a phonological rule (Labov et al. 2013). This is a  phonological raising rule that targets a  phonological category larger

Using rates of Change as a Diagnostic of Vowel Phonologization 

 171

than a phoneme. Additionally, the category targeted by a phonological change can be as small as an allophone. In the case of TOO and GOOSE, if we measure two different rates of change, we can conclude that the phonological targets are two different allophones of a single phoneme. For these reasons, I will reiterate that a rate of change investigation is useful as one of many tools used to determine phonological category. Finally, this theory is necessarily based on diachronic data. In order to obtain enough tokens to use a rate of change investigation, researchers need access to large-scale diachronic corpora.

2.3 How to Measure ROC In this section, I  will explain how to measure the rate of change of a  given phoneme, following Fruehwald (2013). Beginning with a diachronic database of vowel tokens that have been speaker-normalized using z-scores, I then obtained the locally weighted regression of these normalized tokens. It is the first derivative of the loess-smoothed data that gives us the rate of change. Because the loess predictor is calculated across the small span of one year, the first derivative of the predictor is very sensitive to small changes. As a result of this, the derivative is also smoothed in order to give a clearer picture of the community-level rates of change for each variant. In analyzing a rate of change graph, it should be noted that the informative points on the graph are at zero. When a variant is positive, this means that diachronically it is moving in one direction. When it is negative, it is moving in the opposite direction. When a  variant is at zero, it is considered stable diachronically. In using rates of change to determine phonological relatedness, it is the variants’ status with respect to zero that is important to determine. This will be fleshed out in further detail in sections 3 and 4.

3 Rate of Change Analysis of Philadelphian English In this section, I  will use a rate of change investigation to examine changes in Philadelphian English to their phonological or phonetic status. This will allow readers to see what a  rate of change investigation looks like for differences in variants that have been separately confirmed as being either phonetic or phonological differences.

172 

 Betsy Sneller

The data for these sections come from the Philadelphia Neighborhood Corpus (PNC). The PNC is a corpus of sociolinguistic interviews carried out between the years 1972 and 2013, with investigations ongoing. In accordance with the Apparent Time hypothesis, the corpus contains over a century worth of sound change, as it contains speakers with dates of birth ranging from 1890 to 1999. Interviews were transcribed using ELAN (Brugman & Russel 2004), and vowel measurements were taken using the FAVE suite (Rosenfelder et al. 2014). For a full analysis of the variables examined in this section, see Fruehwald (2013), which is the first use of a rate of change analysis on these variables from Philadelphian English. See also Labov, Rosenfelder and Fruehwald (2013).

3.1 Allophonic Split in PRICE The first variable under consideration is the allophonic split in the PRICE vowel. In Philadelphian English, the nucleus of the diphthong in PRICE is raised before voiceless obstruents, resulting in [pɹʌɪs] for price, [fʌɪt] for fight, and [ɹʌɪt] for write. Before voiced obstruents, the nucleus of the diphthong in PRICE remains low, resulting in [haɪd] for hide, [ɹaɪd] for ride, and [pɹaɪz] for prize. This is a difference that has been analyzed as allophonic rather than coarticulatory by Fruehwald (2013), because the vowel raising responds to underlying segments rather than surface forms. This is demonstrated by the pair rider and writer: in these words, the phonetic environment surrounding the diphthong is identical [ɹ_ ɾɚ]. However, even though the environment in both variants has been neutralized, speakers still produce a raised [ʌɪ] in writer and a low [aɪ] in rider. This shows that speakers are choosing /aɪ/ variants in response to the phonological environment rather than the phonetic environment. Therefore, RIDE and WRITE are analyzed as allophonically different. According to the rate of change theory laid out above, this means that there are two different mental targets and therefore there are possibly two different rates of change available.

Using rates of Change as a Diagnostic of Vowel Phonologization 

 173

Fig. 5: Normalized acoustic outputs of RIDE and WRITE vowel classes over time (a) and the rates of change for both variants over time (b)

Fig. 5a shows the normalized acoustic tokens, and Fig. 5b shows the first derivative of the smoothed acoustic tokens. Fig. 5b clearly shows that WRITE is always negative throughout the time course of the PNC, while RIDE is not different from zero. This shows that WRITE is always raising throughout the course of the PNC, while RIDE remains stably low. The rate of change analysis aligns with the other aspects of Fruehwald’s  analysis which show RIDE and WRITE to be allophonically distinct tokens.

3.2 Allophonic Difference in GOOSE – FEW I  will now turn to the example given above: GOOSE fronting as promoted by a preceding coronal token. Again, Labov et al. (2013) analyze post-coronal words such as TOO as being allophonically distinct from the elsewhere GOOSE words, because of the large phonetic distance between TOO and GOOSE.

Fig. 6: . Normalized acoustic outputs of TOO and GOOSE vowel classes over time (a) and the rates of change for both variants over time (b)

174 

 Betsy Sneller

Fig. 6a and 6b show the normalized acoustic tokens, with a large gap of ~500 hz difference between TOO and GOOSE on the left, and the first derivative of these slopes on the right. As the rate of change graph shows, TOO and GOOSE pattern together between the years of 1890 and 1950. After this time, TOO splits from GOOSE and the two variants move in the opposite directions. These results do not conflict with Labov et al.’s analysis of GOOSE and TOO being allophonically distinct: it is clear from the rates of change after 1950 that the two variants are clearly two different mental targets after this time. Before 1950, it is less clear. It is possible that the two variants were a single mental target before 1950 which then split into two allophones, or alternatively it’s also possible that these two allophones had coincidentally been undergoing two changes that happened to occur at the same rate. The only way of teasing these two options apart would be to turn to additional tests of phonological status such as perception experiments or a duration effect analysis.

3.3 Fronting of MOUTH, GOAT, GOOSE The final change from the PNC that I will examine in this paper is the fronting of MOUTH, GOAT, and GOOSE. These are the three back rising diphthongs in American English. Fruehwald (2013) showed that these three diphthongs underwent a fronting process in Philadelphian English in the first half of the 20th century, which was then reversed in the second half of the 20th century. This is analyzed in Fruehwald (2013) as a  single phonological process of fronting which affects [+back] diphthongs.

Fig. 7: . Normalized acoustic outputs of MOUTH, GOAT, and GOOSE vowel classes over time (a) and the rates of change for both variants over time (b)

Fig. 7a and 7b show the acoustic realization of the back rising diphthongs over time as well as their rate of change graphs, respectively. The rate of change graph

Using rates of Change as a Diagnostic of Vowel Phonologization 

 175

shows a clear indication that MOUTH, GOAT, and GOOSE are being affected by a  single phonological process for most of the time span investigated. Between 1890 and 1950, all three variants are positive, showing that they are fronting together. All three cross zero at the same time, showing that the single phonological rule governing back diphthong fronting is being reversed around the 1950s. All three variants are then negative for the rest of the timespan, showing that they are backing together. As Fruehwald (2013) points out, this similarity across variants in response to a single phonological rule such as fronting suggests that the phonological rule is targeting the larger category of back diphthongs, rather than targeting single phonemes.

4 Rate of Change Analysis of New Zealand English Changes Having shown what a rate of change investigation looks like in known changes from the Philadelphia Neighborhood Corpus, I will now turn to some changes in New Zealand English which have only recently had large-scale normalized acoustic data obtained. This next section uses a rate of change investigation to examine some changes which have been previously analyzed for New Zealand English, but have not been subject to a rate of change investigation. The data under consideration come from the Origins of New Zealand English (ONZE) Corpus. The ONZE corpus comprises data very similar to the PNC: naturalistic conversational speech from 521 speakers ranging from 30 minutes to an hour in duration. The ONZE corpus covers a larger time depth than the PNC, with dates of birth ranging from 1861 to 1987. Vowel measurements were also taken with the FAVE suite (Rosenfelder et al. 2014), which resulted in over a million normalized measurements of vowel tokens spanning a time course just larger than a century. In this next section, I will use the large-scale FAVE-output vowel measurements to conduct a  rate of change investigation on several changes in New Zealand English.

4.1 TRAP, DRESS Raising The first change that I will look at is the raising of TRAP and DRESS along the front periphery of the vowel space. This is a  change that has been well documented (e.g. Gordon et al. 2004, Maclagan et al. 2008), but has not been subject

176 

 Betsy Sneller

to a rate of change analysis. The traditional analysis is that TRAP and DRESS are undergoing a chain-shift of raising.

Fig. 8: Normalized acoustic outputs of TRAP and DRESS vowel classes over time (a) and the rates of change for both variants over time (b)

Fig. 8a and 8b show the acoustic data and the rate of change for both TRAP and DRESS over the course of the ONZE corpus. The rate of change analysis in 8b supports the claims by Gordon et al. (2004) and Maclagan et al. (2008) that TRAP and DRESS are undergoing a chain-shifting process. Furthermore, the similarity of the rate of change data suggests that the two phonemes have been the target of a single phonological process of raising that targets the low front vowels.

4.2 NEAR SQUARE Merger The second change from New Zealand English to be examined in this paper is the NEAR-SQUARE merger. This is a well-studied merger in progress on the community-wide level (Warren & Hay 2006), where the nucleus of the SQUARE diphthong rises to height of NEAR, followed by a period of lowering together.

Fig. 9: Normalized acoustic outputs of NEAR and SQUARE vowel classes over time (a) and the rates of change for both variants over time (b)

Using rates of Change as a Diagnostic of Vowel Phonologization 

 177

Fig. 9a and 9b show the acoustic realization of NEAR and SQUARE as well as the rate of change analysis. The rate of change graph shows that SQUARE begins negative, meaning raising in the vowel space, while NEAR remains stable at zero. However, for speakers born after 1960, SQUARE and NEAR both become positive, suggesting that they are lowering together. In this case, the rate of change analysis supports the finding that NEAR and SQUARE were originally two distinct categories in New Zealand English which merged into a single category for speakers born after 1960. It is this new single category that is targeted by the lowering change in the second half of the 20th century.

4.3 FEW – GOOSE Allophonic Split The final change that I will turn to is an allophonic split in the GOOSE vowel (Seyfarth & Sneller 2014). Seyfarth and Sneller found that /u/ following a /j/ token, as in the words FEW and MUSIC, underwent an allophonic split from the rest of the GOOSE class beginning with speakers born around the 1930s.

Fig. 10: Normalized acoustic outputs of GOOSE and FEW vowel classes over time (a) and the rates of change for both variants over time (b)

Fig. 10a and 10b show the acoustic realization of GOOSE and FEW, as well as the rate of change analysis of both variants. From looking at 10a, the acoustic data suggests that the two variants were considered a single category until around the 1930s, when we see them diverging in acoustic space. In the rate of change graph in 10b, the story is very similar. Before 1930, GOOSE and FEW are both positive, moving in the same direction. However, after 1930 GOOSE becomes negative while FEW remains relatively stable. Returning to the theory laid out in Section 2, this could mean one of two things. It could be that GOOSE and FEW were two different categories all along, which were undergoing two different changes that happened to look very similar until 1930. Alternatively, GOOSE and FEW could

178 

 Betsy Sneller

have belonged to a single mental category before 1930 and then split into two distinct allophones after this time. Seyfarth and Sneller found that external tests of the effect of duration supported the second analysis: the single allophone GOOSE split into two distinct allophones (FEW and GOOSE) after the 1930s. In this case, the rate of change investigation serves as an important tool to bolster the analysis of allophonic split.

5 Conclusions As has been shown in the examples above, a rate of change investigation can be a helpful tool in determining whether a possible vowel difference is purely phonetic or if it is phonological. If two potentially different vowels have two distinct rates of change, then there must be two categories involved, because phonological changes can only target a single category at a time. Finally, this method of phonological analysis has been shown to be most useful when it is also paired with additional category defining tests. As researchers begin to access larger and more complete databases of naturalistic speech, a rate of change analysis will be possible for more dialects and languages. This, in turn, should result in a richer base of empirical knowledge about category changes over the time course of a language.

References Brugman, H., & Russel, A. (2004). Annotating Multimedia / Multi-modal Resources with ELAN. In Proceedings of LREC 2004. Nijmegen, The Netherlands: Max Planck Institute for Psycholinguistics. Retrieved from http://tla.mpi.nl/tools/tla-tools/elan/ Fruehwald, J. (2013). Phonological Involvement in Phonetic Change (Ph.D. thesis). Cambridge: Cambridge University Press. Gordon, E., Campbell, L., Hay, J., Maclagan, M., Sudbury, A., & Trudgill, P. (2006). New Zealand English: Its Origins and Evolution. Kenney, M. (2004). Prince Charles has two ears/heirs: Semantic Ambiguity and the Merger of NEAR and SQUARE in New Zealand English. New Zealand English Journal, 18, 13–23. Labov, W., Yaeger, M., & Steiner, R. (1972). A Quantitative Study of Sound Change in Progress. Philadelphia: U.S. Regional Survey. Labov, W., Ash, A., & Boberg, C. (2006). The Atlas of North American English. New York: Gruyter. Labov, W., Rosenfelder, I., & Fruehwald, J. (2013). One Hundred Years of Sound Change in Philadelphia: Linear Incrementation, Reversal, and Reanalysis. Language, 89(1), 30–65. Maclagan, M., & Gordon, E. (2008). New Zealand English. Edinburgh: Edinburgh University Press.

Using rates of Change as a Diagnostic of Vowel Phonologization 

 179

Origins of New Zealand English Corpus. (1944–2002). Data compiled by Principal Investigator Jen Hay. Philadelphia Neighborhood Corpus of LING560 Studies. (1972–2013). With support of NSF contract 921643 to W. Labov. Rosenfelder, I., Fruehwald, J., Evanini, K., Seyfarth, S., Gorman, K., Prichard, H. & Yuan, J. (2014). FAVE (Forced Alignment and Vowel Extraction) Program Suite. Retrieved from http://fave. ling.upenn.edu Seyfarth, S., & Sneller, B. (2014). Diachronic Evidence for Phonological Reanalysis in New Zealand English /u/-fronting. Paper presented at 38th annual Penn Linguistics Conference, Philadelphia, March 2014. Strycharczuk, P. (2012). Phonetics-Phonology Interactions in Pre-Sonorant Voicing (Ph.D. thesis). University of Manchester. Warren, P, & Hay, J. (2006). Using Sound Change to Explore the Mental Lexicon. In C. Fletcher-Flinn & G. Haberman (Eds.), Cognition and Language: Perspectives from New Zealand (pp. 105–125). Bowen Hills, Queensland: Australian Academic Press.

Petra C. Steiner

Diversification in the Noun Inflection of Old English Abstract: This study investigates the frequency distributions of the inflectional affixes of the Old English noun paradigms. Counting and sorting these affixes and word forms leads to unordered integer partitions, which show typical diversification effects. Moreover, the distribution of the numbers of different affixes across the inflectional paradigms can be deduced from an urn model. For these frequency classes, the 3-displaced Hyperpoisson distribution yields excellent fits. The investigation shows results which are similar to former investigations on German and Icelandic noun paradigms. Keywords: Diversification, Old English, integer partitions, inflectional paradigms

1 Introduction Within quantitative linguistics, diversification has become a widely investigated and well-explained phenomenon on different linguistic levels (see Altmann 1985, Rothe 1991a). Concerning inflectional morphology, however, the research was rather restricted for a longer period (see Rothe 1991b), although this field has two major advantages. First, inflection can be regarded from the two sides of parole and langue, or use and system. Currently, the term text corpus is certainly the most straightforward which comes into a linguist’s mind if the buzz word empirical investigation is mentioned. For investigations on text corpora, inflectional affixes can be defined, classified and quantified, and hypotheses on distributions derived and corroborated. However, it is also a rewarding idea to have a look at the system. In this respect, inflectional morphemes are usually arranged in syntagmatic and paradigmatic relations. Their functional valeur (de Saussure 1974: 111f) is determined by their opposition to other inflectional affixes of the same inflectional paradigms. The investigation of inflectional patterns can therefore give insights not just into language use but also into the system of language. Second, the inventory of the means for inflection in synthetic languages is relatively small compared to the means of derivation. For most Indo-European Universität Hildesheim; [email protected]

182 

 Petra C. Steiner

languages, the set of inflectional morphs can be listed on one page, while derivational affixes usually fill more space. If the number of inflectional morphs is restricted, the number of paradigms is as well. Additionally, the distributions of the inflectional morphs within inflectional paradigms are even more limited due to the principles of diversification. Evidence for this was found in German noun inflection and Icelandic noun inflection (Steiner & Prün 2007; Steiner 2009). It could be shown that typical distributions occur with a small number of forms which occurred often and many forms occurring rarely, leading to typical distributions with steep beginnings and long tails. Hence, investigations of inflectional morphology in quantitative linguistics can provide interesting insights into the language system; and they are feasible, because the data can be easily obtained. Hypotheses and methods can be transferred from other usage-oriented investigations. This article investigates the inflectional paradigms of Old English in regard to the frequency distributions of their inflectional affixes. The following section provides the information on Old English noun inflection which is relevant for the further investigation. Then linguistic hypotheses are derived for combinatorial patterns of inflectional suffixes and distributions of complexity classes of inflectional paradigms. After the description of the data, two statistical hypotheses are tested. The short discussion sketches an outline of prospective projects in the field of inflection.

2 Diversification Effects in the Inflection of Old English Nouns Nouns in Modern English are generally not morphologically case-marked, although there are some residuals in personal pronouns. Old English free position is opposed to the relatively fixed word order of Modern English. From a functional viewpoint, this fixed word order is strongly connected to the case system of the first language period, while the methods of the linking between semantic case information and its expression in form have switched in the development towards Modern English. There are five different cases in Old English: nominative, accusative, genitive, dative and instrumental.1

1 See Lass (1994: 126f) on the development of Old English case, number, and gender from IndoEuropean.

Diversification in the Noun Inflection of Old English 

 183

In general, the relation of inflectional suffixes and their attributed case is far from being a one-to-one relation. Table 1 presents an example. Tab. 1: Inflectional paradigm of the Old English noun cyning (see Quirk & Wrenn 1993:20) se cyninġ (the king) singular

plural

nom

se cyninġ

ϸā cyninġas

acc

ϸone cyninġ

ϸā cyninġas

gen

ϸæs cyninġes

ϸāra cyninġa

dat

ϸǣm cyninġe

ϸǣm cyninġum

instr

ϸӯ cyninġe

ϸǣm cyninġum

The instrumental form for nouns is included in this 10-slot paradigm of 10 wordform cells, although it disappeared during the period of Old English and does not differ from the dative in its form. However, its existence is evident in combination with adjectives, which have different endings. The paradigm above shows six different endings, including a zero morph. It is a typical pattern of the 52 different inflectional paradigms which have been compiled for this investigation (see Table 10, Appendix).2 There is no Old English noun paradigm that comprises more than six different affixes (Baker 2007: 35) and there is not a case in which the wealth of inflectional endings is completely exploited. Lass infers: The moral is that there is a great difference between what a language has and what it does with it; this should make one suspicious of any kind of facile argument suggesting that changes are ‘caused’ by the growth of morphological ambiguity (Lass 1994: 139).

For inflectional affixes, the capacity of what a language “does” is to indicate various (different) combinations of grammatical properties. This is usually referred to as case syncretism (see Jakobson 1971: 67; Welte 1987: 16) and is defined as “homonymy of inflection markers in some grammatical domain” (Müller, Gunkel & Zifonun 2004: 6). Other than this, morphological cases have more than one function, e.g. genitive in Old English can be possessive, partitive or descriptive (Baker 2007: 37). Similar observations can be made for Icelandic and German (see Steiner 2009; Steiner & Prün 2007).

2 I would like to thank my colleague Theresa Wannisch (University of Rostock) for her assistance in classification decisions. All remaining inconsistencies are mine.

184 

 Petra C. Steiner

These kinds of phenomena are referred to as diversification effects. As human capacities of memory and time for retrieval are restricted, it is necessary to reduce the inventory of linguistic items. This leads to ambiguity in lexical and grammatical forms. At the hypothetical extreme point of such a process, this would lead to one form with many (grammatical) meanings, the manifestation of semantic diversification. However, such a language would be hard to decipher and will not develop in natural contexts. The reason for this is that there exists an antagonistic process, which is referred to as unification. On the level of meaning, semantic unification is the tendency to minimize the effort of disambiguation. At the hypothetical extreme point of such a process, this would lead to forms with just one (grammatical) meaning. As described above, formal unification, the reduction of forms, comes along with semantic diversification. As both tendencies are operating on the level of both form and meaning, a state of equilibrium is maintained and characteristic distributions of frequencies will occur. Only a few forms are used often, while the rest of the inventory is used with a low frequency. This equilibrium manifests itself not only in grammatical affixes, but also in derivational morphemes, lexemes, and other linguistic forms (Rothe 1991c). If we have a look at Table 2, we can find the following pattern of inflectional markers. Tab. 2: Inflectional suffixes of the Old English noun cyninġ singular

plural

nom

-

-as

acc

-

-as

gen

-es

-a

dat

-e

-um

instr

-e

-um

Sorting and counting these morphs, we arrive at the following frequency distribution: 2 -, 2 -e, 2 -as, 2 -um, 1 -es, 1 -a. This can be considered a so-called unordered integer partition of 10 with 10 = 2 + 2 + 2 + 2 + 1 + 1. Such a distribution is in agreement with the assumption that both unification and diversification operate in inflection. On the basis of these general considerations, some linguistic hypotheses will be specified in the next section.

3 Hypotheses Effects of diversification processes in inflectional morphology could be attested for Modern Standard German and Icelandic (Steiner 2009; Steiner & Prün 2007).

Diversification in the Noun Inflection of Old English 

 185

Therefore, similar results can be expected for Old English, which is an inflectional language too. Two hypotheses (a) and (b) can be made regarding diversification effects on inflectional paradigms. a) Within a range of ten inflected word-forms, 42 different combinations of unordered integer partitions are possible. However, it can be expected that the Zipfian forces of unification and diversification have their effects at both ends of this distribution: unification in its extreme form would lead to a 1:1 relation of form and grammatical meaning, so this would result in a different affix for each cell within the inflectional paradigm. On the other hand, diversification would lead to one single affix for all cases, resulting in no marking at all, because such an affix would be redundant. As both tendencies work within inflectional paradigms, there are a few (and often short) inflectional affixes used with high frequency, while some (and often longer) affixes are used with relatively low frequency. These tendencies have to show clearly within the sets of unordered integer partitions: complete syncretism and complete diversity will have very low frequencies. b) The distribution of the numbers of different affixes in the inflectional paradigms of English with n forms within a specific inflectional paradigm can be considered as an urn model (see Figure 1). Each urn represents one frequency class of inflectional paradigms.

Fig. 1: Number of different inflectional suffixes as urn model

The frequency of urns with four balls is proportional to the frequency of those with three elements. However, this proportion might not be stable if urns with five or six elements occurred too. The same holds in general for the frequencies of the following sizes (see Wimmer et al. 1994: 101). This can be formalized as Px ∝ Px−1

(1)

Px = g(x)Px−1

(2)

i.e. the probability (frequency) of the xth class is proportional to that of the (x-1)st. As this relation changes with the number of elements (inflectional paradigms), the proportionality is not constant but must be expressed by a simple function

186 

 Petra C. Steiner

where x is the number of affix tokens and g(x) the variable proportion between adjacent classes (see Best 2005a: 256ff; Best 2005b: 262ff; Wimmer & Altmann 1996: 112). Inserting simple functions in (2) which are linguistically well interpretable, one can obtain for instance the following distributions. a. The Poisson distribution for g(x) = a / x (3) b. The negative binomial distribution for g(x) = (a + bx) / cx c. The Hyperpoisson distribution for g(x) = a / (c + x) Wimmer & Altmann (1996) present an overview of these and other functions of the form Px = g(x)Px-1. The respective distributions can be fitted to ample language data (e.g. Altmann & Best 1996; Altmann et al. 1996; Best 1996; Best & Brynjólfson 1997). Similar investigations, although not with the same derivations, were carried out by Čebanov (1947) and Fucks (1955, 1956). While the data of the above-mentioned articles consists mainly of word length distributions, similar models can be postulated for frequency distributions of inflectional affixes with paradigms. This is not only according to the derivation from urn models, but also because the frequency of many linguistic units (e.g. morph, word form) is strongly connected with their length (see Köhler 1986: 69f; Krott 2002: 92ff). The number of different affixes is a measure of complexity for inflectional paradigms. The same assumptions hold for other measures such as the number of different word forms which comprise stem alternations and phonological variants.

4 Data Table 10 (in the Appendix) shows an overview of the full paradigms of Old English noun inflection. The table was compiled on the basis of the information found in the historical grammars of Baker (2007), Brunner (1965), Campbell (1987), and Quirk & Wrenn (1987). Gender and stem alternations are considered distinctive features. For investigations of inflectional variants, not only are the types of inflectional suffixes with n forms relevant, but so are stem alternations such as umlaut or vocalization. Therefore, three different measures of the complexity of inflection are defined, which are listed in Table 11: f1 is defined as the number of different word forms of a 10-slot paradigm, c1 is defined as the number of different affixes of a 10-slot paradigm, and c2 is defined for 10-slot paradigms as the sum of c1 and the number of stem alternations, such as umlaut, elision or epentheses. For example, the inflectional paradigm and the complexity values for sceōh (shoe) are summarized in Tables 3 and 4.

Diversification in the Noun Inflection of Old English 

 187

Tab. 3: Inflectional paradigm of the Old English noun sceōh (see Campbell 1987:225ff, Quirk & Wrenn 1993:21) se sceōh (the shoe) singular

plural

nom.

se sceōh

ϸā sceōs

acc.

ϸone sceōh ϸā sceōs

gen.

ϸæs sceōs

ϸāra sceōna

dat.

ϸǣm sceō

ϸǣm sceōm

instr. ϸӯ sceō

ϸǣm sceōm

Tab. 4: Complexity measures for the inflectional paradigm of sceōh (shoe) (paradigm 24 in Table 11) Ns 24 -

As

Gs Ds

Is

Np Ap Gp Dp Ip

uml etc

f1

c1

c2

-

-s

-

-s

h-elision 4 3 2 1 5

4

5

-

-s

-na -m

-m

p1

The inflectional paradigm has five different word forms (f1). The number of inflectional affixes, however, is just 4 (c1). As the inflectional paradigm has one feature of alternation (h-elision), this leads to the value of 5 for c2. For the investigation of hypothesis (a), the unordered integer partitions of inflectional affixes of the 10-slot paradigm are listed in p1. For instance, cyninġ (see Table 1) has the inflectional paradigm 15 (see Table 10) with the integer partitions (2 2 2 2 1 1) (see Table 11). The inflectional paradigm of sceōh leads to p1 = (4 3 2 1).

5 Tests of the Hypotheses 5.1 Distributions of Partitions The counts of all possible combinations of unordered partitions of inflectional affixes lead to the picture presented in Table 5. As observed for the inflectional paradigms of German and Icelandic (see Steiner & Prün 2007, Steiner 2009), larger integers (10, 9, 8) do not exist. Also, no paradigm has equal-sized integer partitions such as (2 2 2 2 2) or (5 5). The tendency is clearly one of frequency distributions which are characteristic for diversification: only 10 of 42 possible combinations exist for the 52 paradigms.

188 

 Petra C. Steiner

Tab. 5: Combinations of affix types and their frequencies Partitions of affixes

Frequency

10

0

9+1

0

8+2

0

8+1+1

0

7+ 3

0

7+2+1

3

7 + 1 +1 + 1

0

6+4

0

6+3+1

0

6+2+2

4

6+2+1+1

8

6+1+1+1+1

0

5+5

0

5+4+1

0

5+3+2

0

5+3+1+1

0

5+2+2+1

5

5+2+1+1+1

0

5+ 1 + 1 + 1 + 1 + 1

0

4+4+2

0

4+4+1+1

0

4+3+3

0

4+3+2+1

5

4+3+1+1+1

0

4+2+2+2

1

4+ 2 +2 + 1 + 1

13

4+ 2 + 1 + 1 +1 + 1

0

4 + 1 + 1 + 1 + 1 + 1 + 1 +1

0

3+3+3+1

0

3+3+2+2

2

3+3+2+1+1

0

3+3+1+1 +1 +1

0

3+ 2 + 2 + 2 + 1

1

3+2+2+1+1 +1

0

Diversification in the Noun Inflection of Old English 

Partitions of affixes

Frequency

3+2+1+1+1 +1 +1

0

 189

3 + 1 +1 + 1 + 1 + 1 + 1 + 1 0 2+2+2+2+2

0

2+2+2+2+1+1

10

2+2+2+1+1+1+1

0

2+2+1+1+1+1+1+1

0

2+ 1+1 +1+1+1+1+ 0 1+1 1+1+ 1+1 +1+1+1+1 0 +1+1 ∑ 52

The picture is one which is typical for diversification, with distributions of a strong gradient for the first part and a long tail for the second. Note that the four paradigms with the partition (6 + 2 + 2) are “minor” paradigms, which during the period of Old English partially shifted to other inflections (see Adamczyk 2011), leading to the more characteristic distribution of (2 2 2 2 1 1).

5.2 Distributions of the Complexity of Inflectional Paradigms The counts for c1 and c2 are presented in Tables 6 and 7. The frequency classes for f1 are presented in Table 8. The graphs are depicted in Figures 2, 3, and 4. Tab. 6: Frequency classes of complexity c1 in Old English noun inflection paradigms c1 f(c1)

NP(c1)

3 4 5 6

6.78 20.33 15.48 9.41

7 21 14 10

a = 1.02, b = 0.34, X2 = 0.21, DF = 1, P = 0.65

190 

 Petra C. Steiner

Tab. 7: Frequency classes of complexity c2 in Old English noun inflection paradigms c2 f(c2)

NP(c2)

3 4 5 6 7

5.48 14.91 15.31 9.68 6.62

7 12 18 8 7

a = 1.65, b = 0.61, X2 = 1.78, DF = 2, P = 0.41

Fig. 2: Frequency classes of complexity c1 in Old English noun inflection paradigms

Fig. 3: Frequency classes of complexity c2 in Old English noun inflection paradigms

For testing hypothesis (b), the Altmann-Fitter (2000) was used. For c1 and c2, the 3-displaced Hyperpoisson distribution Px =

b

a(x−3) , x = 3, 4, 5 …  1F1(1;b;a)

(x−3)

(4)

Diversification in the Noun Inflection of Old English 

 191

where b (x − 3) = b (b + 1) … (b + x− 4) and 1F1(1;b;a) is the confluent hypergeometric function of the first kind, i.e. a a F (1;b;a) = 1 +   +   + … b b (b + 1)

1 1

(5)

yields an excellent fit. For f1, the Hyperpoisson distribution yields good fits too. Tab. 8: Frequency classes of complexity f1 in Old English noun inflection paradigms f1

f(f1)

NP(f1)

3 4 5 6

6 20 12 14

5.40 18.01 16.13 12.46

a = 1.22, b = 0.37, X2 = 1.53, DF = 1, P = 0.22

Fig. 4: Frequency classes of complexity f1 in Old English noun inflection paradigms

According to the results of the chi-square test, the hypothesis that the frequencies of complexity show the effect of diversification can be accepted for all measures of complexity.

6 Discussion Both the distributions of frequencies within noun paradigms in Old English and the frequency distributions of complexity measures of the set of these inflectional paradigms show typical diversification effects. Therefore all hypotheses can be

192 

 Petra C. Steiner

corroborated. These results blend in well with previous investigations for German and Icelandic. To give an impression of the respective diversification effects within these languages, Table 9 shows the partitions of the Icelandic inflectional affixes for the 91 8-slot-noun paradigms. Tab. 9: Combinations of Icelandic affix types and their frequencies (Steiner 2009, 133) Partition of affixes

Frequency

8

0

7+1

0

6+2

0

6+1+1

1

5+ 3

0

5+2+1

0

5 + 1 +1 + 1

10

4+4

0

4+3+1

0

4+2+2

0

4+2+1+1

7

4+1+1+1+1

7

3+3+2

0

3+3+1+1

10

3+2+2+1

4

3+2+1+1+1

15

3+1+1+1+1+1

4

2+2+2+2

0

2+ 2 +2 + 1 + 1

0

2+ 2 + 1 + 1 +1 + 1

13

2 + 1 + 1 + 1 + 1 + 1 +1

20

1 + 1 + 1 + 1 + 1 + 1 + 1 +1

0



91

Diversification should also be noticeable in other non-Indo-European language types, insofar as these use morphological means for the expression of grammatical meaning. It would be worthwhile to construct some systematic and comparable investigations in this field. In general, it would be of interest to analyze the effects of diversification depending on the number of the lexemes for which a certain paradigms applies.

Diversification in the Noun Inflection of Old English 

 193

Also the frequency of use is an important factor for either preserving or losing forms (see Mańczak 2005: 616f). The shortening of forms is highly dependent on their frequency (see Mańczak 2005: 614ff; Zipf 1965: 172−176) and it will be easy to demonstrate this effect for the frequency of inflectional suffixes within the noun paradigms of Old English. In general, as morph frequency is connected to morph length, the frequency distributions of both properties can be postulated to be similar. This well-corroborated hypothesis was used as an additional foundation for hypothesis (b) in Section 3. However, concrete examinations of inflectional paradigms are still to be carried out. In general, being both products of parole and units of systems, inflectional paradigms with their forms yield a wealth of reliable data for many investigations.

References Adamczyk, E. (2011). Morphological Reanalysis and the Old English u-Declension. Anglia, 128(3), 365–390. Altmann, G. (1985). Semantische Diversifikation. Folia Linguistica, 19, 177–200. Altmann-Fitter. (2000). Altmann-Fitter 2.1 for Windows 95 / NT. Lüdenscheid: RAM-Verlag. Altmann, G., & Best, K.-H. (1996). Zur Länge der Wörter in deutschen Texten. In P. Schmidt (Ed.), Glottometrika 15: Issues in General Linguistic Theory and the Theory of Word Length (pp. 166–180). Trier: WVT Wissenschaftlicher Verlag Trier. Altmann, G., Erat, E., & Hřebíček, L. (1996). Word Length Distribution in Turkish Texts. In P. Schmidt (Ed.), Glottometrika 15: Issues in General Linguistic Theory and the Theory of Word Length (pp. 195–204). Trier: WVT Wissenschaftlicher Verlag Trier. Baker, P. S. (2007). Introduction to Old English (2nd ed.). Malden, MA / Oxford / Carlton, Victoria: Blackwell Publishing. Best, K.-H. (1996). Zur Wortlängenhäufigkeit in schwedischen Pressetexten. In P. Schmidt (Ed.), Glottometrika 15: Issues in General Linguistic Theory and the Theory of Word Length (pp. 147–157). Trier: WVT Wissenschaftlicher Verlag Trier. Best, K.-H. (2005a). Morphlänge. In R. Köhler, G. Altmann  & R. G. Piotrowski (Eds.), Quantitative Linguistik / Quantitative Linguistics: Ein internationales Handbuch / An International Handbook (pp. 255–260). Berlin / New York: de Gruyter. (Handbücher zur Sprach- und Kommunikationswissenschaft / Handbooks of Linguistics and Communicative Science, 27) Best, K.-H. (2005b). Wortlänge. In R. Köhler, G. Altmann & R. G. Piotrowski (Eds.), Quantitative Linguistik/Quantitative Linguistics: Ein internationales Handbuch / An International Handbook (pp. 260–273). Berlin / New York: de Gruyter. (Handbücher zur Sprach- und Kommunikationswissenschaft / Handbooks of Linguistics and Communicative Science, 27) Best, K.-H., & Brynjólfsson, E. (1997). Wortlängen in isländischen Briefen und Pressetexten. Skandinavistik: Zeitschrift für Sprache, Literatur und Kultur der nordischen Länder, 72(1), 24–40.

194 

 Petra C. Steiner

Brunner, K. (1965). Altenglische Grammatik. Nach der Angelsächsischen Grammatik von Eduard Sievers (3rd ed.). Tübingen: Niemeyer. (Sammlung kurzer Grammatiken germanischer Dialekte, A.3) Campbell, A. (1987). Old English Grammar. Oxford: Clarendon Press. Čebanov, S. G. (1947). O podčinenii rečevych ukladov ‘indoevropejskoj’ gruppy zakonu Puassona. Doklady Akademii Nauk SSSR, 55(2), 103–106. Fucks, W. (1955). Mathematische Analyse von Sprachelementen, Sprachstil und Sprachen. Köln: Westdeutscher Verlag. (Arbeitsgemeinschaft für Forschung des Landes Nordrhein-Westfalen, 34a) Fucks, W. (1956). Die mathematischen Gesetze der Bildung von Sprachelementen aus ihren Bestandteilen. Nachrichtentechnische Forschungsberichte, 3, 7–21. Jakobson, R. (1971). Beitrag zur allgemeinen Kasuslehre – Gesamtbedeutungen der russischen Kasus. In R. Jakobson, Selected Writings 2: Word and Language (pp. 23–71). The Hague: Mouton. Köhler, R. (1986). Zur linguistischen Synergetik: Struktur und Dynamik der Lexik. Bochum: Brockmeyer. (Quantitative Linguistics, 31) Krott, A. (2002). Ein funktionalanalytisches Modell der Wortbildung. In R. Köhler (Ed.), Korpuslinguistische Untersuchungen zur quantitativen und systemtheoretischen Linguistik (pp. 75–126). Retrieved from http://ubt.opus.hbz-nrw.de/volltexte/2004/279/pdf/ 04_krott.pdf Lass, R. (1994). Old English: A Historical Linguistic Companion. Cambridge: Cambridge University Press. Mańczak, W. (2005). Diachronie: Grammatik. In R. Köhler, G. Altmann & R. G. Piotrowski (Eds.), Quantitative Linguistik / Quantitative Linguistics: Ein internationales Handbuch / An International Handbook (pp. 605–627). Berlin / New York: de Gruyter. (Handbücher zur Sprach- und Kommunikationswissenschaft / Handbooks of Linguistics and Communicative Science, 27) Müller, G., Gunkel, L., & Zifonun, G. (2004). Introduction. In G. Müller, L. Gunkel & G. Zifonun (Eds.), Explorations in Nominal Inflection (pp. 1–20). Berlin / New York: de Gruyter. (Interface Explorations, 10) Rothe, U. (1991a). Diversification of the Case in German: Genitive. In U. Rothe (Ed.), Diversification Processes in Language: Grammar (pp. 140–156). Hagen: Rottmann. Rothe, U. (1991b). Diversification Processes in Grammar. An Introduction. In U. Rothe (Ed.), Diversification Processes in Language: Grammar (pp. 3–32). Hagen: Rottmann. Rothe, U. (Ed.). (1991c). Diversification Processes in Language: Grammar. Hagen: Rottmann. Quirk, R., & Wrenn, C. L. (1993). An Old English Grammar (2nd ed.). London: Routledge. Saussure, F. de. (1974). Course in General Linguistics. C. Bally & A. Sechehaye (Eds.). London: Owen. Steiner, P. (2009). Diversification in Icelandic Inflectional Paradigms. In R. Köhler (Ed.), Issues in Quantitative Linguistics (pp. 126–154). Lüdenscheid: RAM-Verlag. (Studies in Quantitative Linguistics, 5) Steiner, P., & Prün, C. (2007). The Effects of Diversification and Unification on the Inflectional Paradigms of German Nouns. In P. Grzybek & R. Köhler (Eds.), Exact Methods in the Study of Language and Text: dedicated to Professor Gabriel Altmann on the Occasion of his 75th Birthday (pp. 623–631). Berlin: de Gruyter. Welte, W. (1987). On the Concept of Case in Traditional Grammars. In R. Dirven & R. Günter (Eds.), Concepts of Case. Tübingen: Narr. (Studien zur englischen Grammatik, 4)

Diversification in the Noun Inflection of Old English 

 195

Wimmer, G., & Altmann, G. (1996). The Theory of Word Length: Some Results and Generalizations. In P. Schmidt (Ed.), Glottometrika 15: Issues in General Linguistic Theory and the Theory of Word Length (pp. 112–133). Trier: WVT Wissenschaftlicher Verlag Trier. Wimmer, G., Köhler, R., Grotjahn, R., & Altmann, G. (1994). Towards a Theory of Word Length Distribution. Journal of Quantitative Linguistics, 1, 98–106. Zipf, G. K. (1965). The Psycho-Biology of Language. An Introduction to Dynamic Philology. Cambridge, MA: M.I.T Press. (1st ed.: Boston: Houghton-Mifflin, 1935)

-

-

-

-

-

-

-

-

-

-

-

-

-

-

3.

4.

5.

6.

7.

8.

9.

10. -

11. -

12. -

13. -

14. -

15. -

-

-

-

-

-

-

-

-

2.

-

As

-

1.

Ns

-es

-es

-es

-es

-es

-es

-es

-es

-es

-e

-e

-a

-a

-

-

Gs

-e

-e

-e

-e

-e

-e

-e

-e

-

-e

-e

-a

-a

-

-

Ds

-e

-e

-e

-e

-e

-e

-e

-e

-

-e

-e

-a

-a

-

-

Is

-a

-a

-a

-

-

Ap

-a

-a

-a

-a

-a

Gp

-as

-

-

-

-

-

-

-

-

-as

-

-

-

-

-

-

-

-

-a

-ra

-ra

-a

-a

-a

-a

-a

-a

-e/-a -e/-a -a

-a

-a

-a

-

-

Np

-um

-rum

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

Dp

-um

-rum

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

Ip

sync

uml

uml

syn

uml

uml etc

cyninġ, wyrm, stān, secg

rīdend

hettend

mōnaD

land

tungol

frēond

cynn, hǣl

fōt, mann

dǣd, bēn

sāwol

hond

feld

bōc, gōs

mægD

example

m

m

m

m

n

n

m

n

m

f

f

f

m

f

f

g

20

29

29

23

30

26

26

28

30

29

Q/W

52

60f

52

55

56

59

55

60f

B

196f, 215

231

199

199

231

200

226f

217

207f

220f

219f

227f

Br

223, 229f, 240f

257

260

226f

257

229f, 243f

251f

241f

235f

246f

246f

252f

260

Ca

Tab. 10: The inflectional paradigms of Old English nouns Ns ... Ip: Nominative singular ... instrumental plural; uml etc.: umlaut, syncope, elision, epenthesis, vocalization; g: gender; Q/W, B, Br, Ca: references (page numbers) to Quirk / Wrenn (1993), Baker (2007), Brunner (1965), Campbell (1987).

Appendix

196   Petra C. Steiner

-

-

-

-

-

-

-

-

-

-

-e

-e

-n

-n

-an

-an

-e

-e

-e

-e

17. -

18. -

19. -

20. -

21. -

22. -

23. -

24. -

25. -

26. -

27. -

28. -

29. -

30. -a

31. -e

32. -e

33. -e

34. -e

35. -e

As

16. -

Ns

-es

-es

-es

-an

-an

-an

-n

-n

-e

-e

-s

-s

-es

-es

-es

-es

-es

-es

-es

-es

Gs

-e

-e

-e

-an

-an

-an

-n

-n

-e

-e

-

-

-e

-e

-e

-e

-e

-e

-e

-e

Ds

-e

-e

-e

-an

-an

-an

-n

-n

-e

-e

-

-

-e

-e

-e

-e

-e

-e

-e

-e

Is

-as

-as

-e

-an

-an

-an

-n

-n

-a

-a

-s

-s

-u

-u

-u

-u

-ru

-as

-as

-as

Np

-as

-as

-e

-an

-an

-an

-n

-n

-a

-a

-s

-s

-u

-u

-u

-u

-ru

-as

-as

-as

Ap

-a

-a

-a

-ena

-ena

-ena

-na

-na

-a

-a

-na

-na

-a

-a

-a

-a

-ra

-a

-a

-a

Gp

-um

-um

-um

-um

-um

-um

-m

-m

-um

-um

-m

-m

-um

-um

-um

-um

-rum

-um

-um

-um

Dp

-um

-um

-um

-um

-um

-um

-m

-m

-um

-um

-m

-m

-um

-um

-um

-um

-rum

-um

-um

-um

Ip

example

wēsten

clif, geat, fæt

hēafod

scip

cild

engel, fugol

dæg

j-eli

w-contr

here

ende, wine

wine

ēage

byrne, tunga

guma

bēo

gefēa

glōf, ār, sibb, lār, synn mǣd

h-elision fēoh

h-elision sceōh

e-ep

abl

sync

sync

abl

h-elision mearh

uml etc

m

m

m

n

f

m

m

m

f

f

n

m

n

n

n

n

n

m

m

m

g

22

27

27

27

25

21

21

23

24

23

28f

21

21

21

Q/W

58

54

54

54

52f

58

57

56

52

60

56

57

58

B

200, 213 200

213

221f

221f

221f

225

224

205f, 208f 211

198

198

200

198

200

196f

233

199f

197

198

Br

229f

229f, 240f

240f

248f

248f

248f

248f

248f

239f

234, 237f

225f

225f

230

224f

227

223

258f

226ff

224

225

Ca

Diversification in the Noun Inflection of Old English   197

As

-e

-e

-e

-er

-or

-or

-or

-or

-e

-e

-e

-u

-u

-u

-u

-u

-u

Ns

36. -e

37. -e

38. -e

39. -er

40. -or

41. -or

42. -or

43. -or

44. -u

45. -u

46. -u

47. -u

48. -u

49. -u

50. -u

51. -u

52. -u

-es

-es

-es

-e

-a

-a

-e

-e

-e

-or

-or

-or

-or

-er

-s

-s

-es

Gs

-e

-e

-e

-e

-a

-a

-e

-e

-e

-or

-er

-er

-er

-er

-e

-e

-e

Ds

-e

-e

-e

-e

-a

-a

-e

-e

-e

-or

-er

-er

-er

-er

-e

-e

-e

Is

-u

-u

-as

-e

-a

-a

-e

-a

-a

-or

-ra

-or

-or

-as

-e

-as

-u

Np

-u

-u

-as

-e

-a

-a

-e

-a

-a

-or

-ra

-or

-or

-as

-e

-as

-u

Ap

-a

-a

-a

-a

-a

-a

-a

-a

-a

-ra

-ra

-ra

-ra

-a

-a

-a

-a

Gp

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

Dp

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

Ip

vocal

vocal

uml

vocal

uml

uml

uml

uml etc

searu

snāw

bearu, sinu

hnutu

duru

sunu

strengu

beadu

talu

sweostor

mōdor

dohtor

brōDor

fæder

stede

bæcere

wīte, spere

example

n

m

m

f

f

m

f

f

f

f

f

f

m

m

m

m

n

g

23

30

28

23f

25

30

30

30

30

21

22

22

24

Q/W

58

55

59

58

52f

59f

59f

59f

59f

53

B

203

204

203

227

220f

218f

226

211

205f

229

229

229

229

229

200f, 213f

Br

231f

231f

252f

246f

246f

236f

239f

234f

256

255f

255f

255f

255f

229f, 243f

Ca

198   Petra C. Steiner

-

-

-

-

15.

17.

18.

-

14.

16.

-

-

12.

13.

-

-

10.

11.

-

-

-

7.

8.

-

6.

9.

-

-

4.

5.

-

-

2.

3.

-

1.

Ns

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

As

-es

-es

-es

-es

-es

-es

-es

-es

-es

-es

-es

-es

-e

-e

-a

-a

-

-

Gs

-e

-e

-e

-e

-e

-e

-e

-e

-e

-e

-e

-

-e

-e

-a

-a

-

-

Ds

-e

-e

-e

-e

-e

-e

-e

-e

-e

-e

-e

-

-e

-e

-a

-a

-

-

Is

-as

-as

-as

-as

-

-

-

-

-

-

-

-

-a

-a

-a

-a

-

-

Np

-as

-as

-as

-as

-

-

-

-

-

-

-

-

-e/-a

-a

-a

-a

-

-

Ap

-a

-a

-a

-a

-ra

-ra

-a

-a

-a

-a

-a

-a

-a

-a

-a

-a

-a

-a

Gp

-um

-um

-um

-um

-rum

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

Dp

-um

-um

-um

-um

-rum

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

Ip

m

m

m

m

n

n

m

n

m

f

f

f

m

f

f

g

sync

Abl

m

m

h-elision m

sync

uml

uml

syn

uml

uml etc

222211

222211

222211

222211

42211

42211

42211

42211

42211

42211

42211

4 2 2 2/ 3322 6211

3322

622

622

721

721

p1

6

6

6

6

5

5

5

5

5

6

5

5

4

4

3

3

4

3

f1

6

6

6

6

5

5

5

5

5

5

5

4

4

4

3

3

4

3

c1

7

7

7

6

5

5

5

5

6

6

5

5

4

5

3

3

3

3

c2

Tab. 11: Patterns and counts of the inflectional paradigms of Old English nouns p1: pattern of affix partitions; f1: number of different word-forms (≤ 10) for paradigm; c1: number of affixes for paradigm with instrumental; c2: sum of number of affixes and alternations for paradigm

Diversification in the Noun Inflection of Old English   199

-e

-e

-er

37.

38.

39.

-e

-e

35.

36.

-e

-e

33.

34.

-e

-e

31.

32.

-

-a

29.

-

28.

30.

-

-

26.

27.

-

-

24.

25.

-

-

22.

23.

-

-

20.

21.

-

19.

Ns

-er

-e

-e

-e

-e

-e

-e

-e

-an

-an

-n

-n

-e

-e

-

-

-

-

-

-

-

As

-er

-s

-s

-es

-es

-es

-es

-an

-an

-an

-n

-n

-e

-e

-s

-s

-es

-es

-es

-es

-es

Gs

-er

-e

-e

-e

-e

-e

-e

-an

-an

-an

-n

-n

-e

-e

-o

-

-e

-e

-e

-e

-e

Ds

-er

-e

-e

-e

-e

-e

-e

-an

-an

-an

-n

-n

-e

-e

-o

-

-e

-e

-e

-e

-e

Is

-as

-e

-as

-u

-as

-as

-e

-an

-an

-an

-n

-n

-a

-a

-s

-s

-u

-u

-u

-u

-ru

Np

-as

-e

-as

-u

-as

-as

-e

-an

-an

-an

-n

-n

-a

-a

-s

-s

-u

-u

-u

-u

-ru

Ap

-a

-a

-a

-a

-a

-a

-a

-ena

-ena

-ena

-na

-na

-a

-a

-na

-na

-a

-a

-a

-a

-ra

Gp

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-m

-m

-um

-um

-m

-m

-um

-um

-um

-um

-rum

Dp

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-m

-m

-um

-um

-m

-m

-um

-um

-um

-um

-rum

Ip

n

n

n

n

n

g

j-eli

vocal

m

m

m

n

m

m

m

n

f

m

m

m

f

f

h-elision n

h-elision m

e-ep

abl

sync

uml etc

5221

6211

42211

42211

42211

42211

6211

5221

6211

6211

6211

6211

4321

4321

32221

4321

222211

222211

222211

222211

222211

p1

4

4

5

5

6

4

4

4

4

4

4

4

4

4

6

5

6

6

6

6

6

f1

4

4

5

5

5

5

4

4

4

4

4

4

4

4

5

4

6

6

6

6

6

c1

4

4

5

5

6

5

4

4

4

4

4

4

5

4

6

5

7

7

7

6

6

c2

200   Petra C. Steiner

-u

-u

-u

-u

49.

50.

51.

52.

-u

-u

47.

48.

-u

-u

45.

46.

-or

-u

43.

44.

-or

-or

41.

42.

-or

40.

Ns

-u

-u

-u

-u

-u

-u

-e

-e

-e

-or

-or

-or

-or

As

-es

-es

-es

-e

-a

-a

-e

-e

-e

-or

-or

-or

-or

Gs

-e

-e

-e

-e

-a

-a

-e

-e

-e

-or

-er

-er

-er

Ds

-e

-e

-e

-e

-a

-a

-e

-e

-e

-or

-er

-er

-er

Is

-u

-u

-as

-e

-a

-a

-e

-a

-a

-or

-ra

-or

-or

Np

-u

-u

-as

-e

-a

-a

-e

-a

-a

-or

-ra

-or

-or

Ap

-a

-a

-a

-a

-a

-a

-a

-a

-a

-ra

-ra

-ra

-ra

Gp

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

Dp

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

-um

Ip

vocal

vocal

uml

vocal

uml

uml

uml

uml etc

n

m

m

f

f

m

f

f

f

f

f

f

m

g

42211

42211

222211

5221

622

622

6211

4321

4321

721

3322

5221

5221

p1

5

5

6

5

3

3

4

4

4

3

4

4

4

f1

5

5

6

4

3

3

4

4

4

3

4

4

4

c1

6

5

7

5

3

3

4

5

4

3

5

5

5

c2

Diversification in the Noun Inflection of Old English   201

Arjuna Tuzzi1 – Reinhard Köhler2

Tracing the History of Words Abstract: The temporal development of the frequencies of Italian words was observed in data taken from the corpus of the 64 end-of-year messages of the Italian presidents (1949−2012). For this study, data were organised in 10 data sets, one for each president, that include all messages delivered by the same representative during his presidential office. The Piotrowski-Altmann law was considered an appropriate model of the frequency dynamics over time. Results show that words could be ascribed to several categories of dynamics and the parameters of the Piotrowski-Altmann law proved to be a way to cluster words portraying a similar temporal evolution. Keywords: history of words, Piotrowski-Altmann law, categories of dynamics, presidential speeches, chronological corpora

1 Introduction Many corpora include texts which have a temporal order (chronological corpora). Some of them collect texts of a specific domain or by individual authors. In this study we are interested in chronological corpora which reflect the use of language within a specific thematic domain, and in particular in those which are documents of an individual discourse. It can be expected that the frequency behaviour of “words” over time, i.e. in such chronological corpora, mirror the changing relevance of concepts which are in the focus of the given discourse. The present contribution was inspired by previous studies (Trevisani & Tuzzi 2013; 2014) that explored the temporal pattern in words’ occurrences by means of functional (textual) data analysis and model-based curve clustering. In this new study, we attempt to highlight the distinctive features of texts, time spans, and clusters of words (e.g. words portraying the same temporal pattern) by means of a different, simpler, established and well-known linguistic model. The idea of “shaping” the history of words is quite uncommon in language studies: since the pioneering studies conducted by Migliorini (1960) on the Italian language, the main concern in the study of the history of language has always been dating the birth (or, sometimes, semantic changes) of individual words, 1 University of Padua; [email protected] 2 Universität Trier; [email protected]

204 

 Arjuna Tuzzi – Reinhard Köhler

but little attention has been paid to the fortunes (or death) of words, i.e. to the concept of the “quality of life” of words (Trevisani & Tuzzi 2013; 2014). In a chronological corpus the temporal evolution of a word can be expressed by the sequence of its frequencies over time (Tab. 1).

2012

2011

2010

2009

2008

2007

2006

2005

1953

2004

1952

2003

1951

2002

1950

di_PREP

15

14

12

10

17

il_DET

12 13 17 10 7

171 163 115 118 90

130 82

e_CONG

8

4

5

4

8

45

61

41

42

37

125 101 75

109 120 118 135

essere_V

12 3

3

4

3

58

52

34

45

36

61

46

46

80

48

61

54

a_PREP

3

6

7

9

8

70

73

56

63

38

68

57

58

70

76

70

68

anno_N

3

2

3

4

2

8

5

8

9

7

6

5

12

10

6

7

12

Italia_NM

0

1

1

0

0

16

17

8

16

9

9

15

15

13

15

17

15

popolo_N

1

1

1

1

2

11

5

3

3

4

0

0

0

0

2

0

0

pace_N

1

1

0

0

0

4

7

5

6

4

4

1

2

2

3

2

1

governo_N

0

0

0

0

0

4

5

2

1

0

1

3

2

3

2

4

8

giustizia_N

0

0

0

0

0

1

5

0

0

1

1

0

1

1

0

1

2

fiducia_N

Lemma

2001

1949

Tab. 1: Data taken from the corpus of end-of-year speeches of the presidents of the Italian Republic (1949-2012), excerpt of the lemma vocabulary

172 171 120 147 105 134 143 122 155 210 194 222



temporal evolution

104 139 144 123 138

1

0

1

1

0

6

6

7

3

2

3

5

2

4

1

6

6

Parlamento_N 0

0

0

0

0

0

2

2

1

1

4

1

3

4

2

4

3

pubblico_A

0

0

0

1

0

2

1

0

0

0

2

0

5

4

6

5

7

civile_A

1

0

0

0

1

2

1

3

0

2

4

0

1

2

1

5

3

istituzione_N

0

0

0

0

0

5

5

11

1

2

4

7

2

3

6

2

0

società_N

0

0

0

0

0

3

4

3

1

2

3

0

3

6

3

1

2

famiglia_N

0

0

0

0

0

1

5

4

6

3

3

2

4

6

2

2

4

speranza_N

0

0

0

0

2

0

0

2

3

2

1

2

0

4

2

1

1

2 The Quest for Patterns In any time-related linguistic analysis, timing is crucial to retrieve relevant information about the development of a language in terms of contents and topics. Although they are obvious in a linguistic environment, it should be underlined

Tracing the History of Words 

 205

that chronological variation observable over a limited time span, e.g. as in the case of this 64-year study, concern mainly the lexical level, i.e. the most superficial part of a language. As a consequence, only the occurrences of words were considered and other linguistic features (at the morphological or syntactic level) were disregarded in this study.1 In the results of any time-related analysis we also hope to find easy-to-recognise patterns, for example, words that show a constant presence over time, words whose presence in the corpus has regularly grown (an increasing trend), words that have regularly lost ground (a decreasing trend) or have disappeared, words that were very popular only for a short time span, etc. However, the frequencies of words taken from the corpus of Italian presidential speeches yielded the situation shown in Fig. 1, i.e. most words show irregular peak-and-valley trajectories that are difficult to interpret in terms of their temporal development. Nonetheless, despite the presence of multiple peaks, they may hide patterns of regularity and these patterns might be the basis upon which to find clusters of words that share a similar temporal evolution.

Fig. 1: The frequency dynamics of a set of words in the analysed texts

In our approach the shape is a theoretical continuous signal, i.e. a function f(t), and chronological data are its discrete observations. We aimed at increasing our understanding of the whole signal by picking out its main features.2 In many

1 Such chronological analyses do not belong to diachronic linguistics since language needs longer time-spans to experience significant changes. 2 Methods based on Fourier or similar analyses are not appropriate for our purposes, as the fundamental preconditions of such methods (e.g. periodicity) cannot be assumed. Moreover, time-

206 

 Arjuna Tuzzi – Reinhard Köhler

cases a simple function is not able to capture the irregular shape of the temporal development of a word. Nevertheless, we easily observed interesting “individual” stories, for example the clear logistic shape of the temporal evolution of the word Europa (Fig. 2).

Fig. 2: The word Europa ‘Europe’ shows a clear logistic shape

3 Data The corpus of end-of-year messages includes 64 speeches delivered by (all) the Italian presidents Luigi Einaudi, Giovanni Gronchi, Antonio Segni, Giuseppe Saragat, Giovanni Leone, Sandro Pertini, Francesco Cossiga, Oscar Luigi Scalfaro, Carlo Azeglio Ciampi, and Giorgio Napolitano in the time span 1949–2012. Every year they are read by the president and broadcast on television and radio (radio only before 1956). The texts are available on the institutional web pages of the presidential office (www.quirinale.it). Besides the traditional good wishes of New Year’s Eve, the speeches are a rich source of information about the last sixty years of the history of Italy. The size of the whole corpus is 106.772 in terms of word-tokens. The original word types (10.720 vocabulary entries) were replaced by lemma types (6.449 lemma-vocabulary entries). The lemmatization process merely associated each word type with a pair which included a lemma and a grammatical category (Tab. 2),

series analysis also fails to be appropriate for our case. Time series analysis aims at studying the temporal correlation (e.g. finding a model able to explain the long- or short-term correlation to forecast) but its output is never a shape.

Tracing the History of Words 

 207

and involved both a preliminary automatic screening (Taltac2 software, Bolasco 2010) of texts and a manual “human-expert” correction. The dependence of linguistic measures on text length is well known (Altmann 1978; Köhler & Galle 1993; Wimmer & Altmann 1999; Popescu 2009). Nevertheless, in order to attenuate the effect of the dimension of texts, absolute frequencies were replaced with relative frequencies (number of occurrences divided by text size). Tab. 2: Some examples of lemmatization type

lemma and grammatical category

vuole [wants]

volere_V [(to) [want_Verb]

istituzioni [institutions] istituzione_N [institution_Noun] intelligenti [clever pl.]

intelligente_A [clever_Adjective]

4 Previous Studies In order to identify temporal patterns (i.e. curves) and cluster words portraying similar temporal patterns, Trevisani & Tuzzi (2013; 2014) proposed a model-based curve clustering in the frame of functional data analysis approaches, applied to a linguistic environment. Although curve clustering has long been studied with splines, Trevisani & Tuzzi selected wavelets as the basis for signal decomposition and applied the class of wavelet-based functional clustering mixed models proposed by Giacofci et al. (2013). The temporal evolution of each word is a function (functional observation) and is represented by a curve. The existence of L unknown clusters is considered, and in the linear functional mixed model each curve yi is generated by a functional fixed effect µl (characterizing clusters), a random functional effect Ui (accounting for word-specific deviation from the cluster mean curve), and a random measurement error term Ei (a zero mean Gaussian process). yi(t) = µl(t) + Ui (t) + Ei (t) Ui (t) = N (0, Kl [s, t])

(1)

Once defined in the functional domain, the function is transformed from an infinite-dimensional problem to a finite-dimensional one by means of a functionally based representation of the model (wavelet-based representation, discrete wavelet transform). The new representation of the model is based on a further

208 

 Arjuna Tuzzi – Reinhard Köhler

wavelet (or scaling) φ and a mother wavelet (or simply wavelet) ψ and the curve yi(t) has the following decomposition. 2j0−1

2j −1

k = 0

j ≥ j 0 k = 0

yi(t) = ∑  ci,j0k φ j0k(t) + ∑  ∑  di,jk ψ jk(t)

(2)

The coefficients of the formula are ci = αl + υi + εi , di = βl + θi + εi

(3)

where αl, βl are the scaling and wavelet coefficients of fixed effects µl(t) and υi, θi are the scaling and wavelet random coefficients of random effects Ul(t) . Previous studies were able to represent the profiles of individual words using the above-described mathematical model and find clusters of words portraying similar temporal patterns; however, clear “regular” and “prototypical” patterns could not be observed. An example of the results obtained in Trevisani & Tuzzi (2013) is given in Fig. 3. Both the studies by Trevisani & Tuzzi (2013; 2014) should be considered purely explorative. In their conclusions, the authors considered the model promising, but left open issues about its validity, reliability, interpretability, and simplicity (cf. Altmann 1978; 1988). Any method could sometimes work for “practical purposes” but often methods are inappropriate or not interpretable in linguistic terms. For a linguistically meaningful and valid analysis of linguistic objects, linguistic models are required. The most reliable linguistic models are, of course, laws of language and text.

Fig. 3: An example of words (8 dashed lines) included in the same cluster (bold line)

Tracing the History of Words 

 209

5 The History of Words The material under study is not just a corpus or a set of texts but a sequence of texts representing an Italian political-institutional discourse. The temporal trajectory of the frequency of a word can be considered an indicator of the communicative relevance of a concept at a given time point. We can imagine, therefore, a limited number of different shapes that these trajectories can display, depending on the kind of temporal lapse of the discussed topics. There are, e.g., concepts which enjoy a steady but slowly growing interest in the public discussion, others which become gradually obsolete, topics which come and go again with variable speed, and even those which gain enormous interest within a few hours and vanish at the same velocity. There is an established law of a related phenomenon, the Piotrowski (or Piotrowski-Altmann) Law, which has been developed and used as a model of the diffusion of new elements in (and conversely, their gradual disappearance from) a linguistic community. The law was introduced in linguistics by Altmann (1983) as a model of language change, a long-term process which can take hundreds of years to show a clear effect. Hence, the model related the dimensions of time and degree of diffusion, whereas our phenomenon is a matter of time and frequency. We assume that this model is nevertheless a good representation of what goes on in our case. We relate time with the degree of diffusion of a word (representing a concept) in a given discourse. The basic form of the Piotrowski-Altmann (Altmann 1983) function is pt =

C . 1 + ae−bt

(4)

We have adopted the modified version proposed by Altmann (1983) pt =

C , 1 + ae−bt+ct 2

(5)

which is also able to cover reversible courses of development. The dependent variable pt represents the proportion of a word at time t. For fitting the function to the data, we utilized Phillip H. Sherrod’s software NLREG version 6.3. The estimated parameters are the variables used for clustering words according to their type of process.3 To cluster words portraying a similar temporal pattern, i.e., a similar

3 Many definitions of a distance between two curves are available for curve clustering (often based on integrals) but they are not appropriate when signals are irregular.

210 

 Arjuna Tuzzi – Reinhard Köhler

shape for the fitted function, we applied an agglomerative hierarchical cluster algorithm with complete linkage on the basis of an Euclidean distance measure. We segmented the series of 64 texts into ten groups (Tab. 3), one for each of the presidents, and tried to obtain interpretable trajectories over the whole time span, which can be sorted into classes of identical or very similar shapes. We selected the first 300 most frequent lemmas for fitting the function and clustered 76 lemmas that performed better in terms of goodness of fit (R2>50%). The results of the clustering procedure are synthesized in Tab. 4.

531

164

814

620

802

1253

1689

996

Napolitano

Ciampi

Scalfaro

Cossiga

Pertini

Leone

Saragat

92

Segni

di_PREP

Gronchi

Lemma

Einaudi

Tab. 3: Excerpt of the lemma vocabulary, by occurrences for each president

1180

il_DET

73

411

137

630

419

1047

873

1623

960

860

e_CONG

36

240

74

283

243

426

625

759

329

783

essere_V

31

116

17

249

243

692

337

1030

333

396

a_PREP

47

185

67

222

260

450

406

751

397

467

anno_N

18

31

7

44

37

68

62

136

62

58

Italia_NM

3

14

6

35

12

37

26

93

91

99

popolo_N

8

26

14

20

13

120

43

54

31

2

pace_N

2

15

8

25

7

40

37

69

36

15

governo_N

0

4

1

4

12

10

18

37

17

23

giustizia_N

0

5

3

15

15

13

16

43

10

6

fiducia_N

3

11

2

8

16

2

2

23

31

27

Parlamento_N

0

5

1

5

10

6

13

51

9

21

pubblico_A

1

10

1

16

11

8

26

14

3

29

civile_A

3

8

3

6

9

10

47

3

13

16

istituzione_N

0

9

0

9

4

0

31

7

33

24

società_N

0

3

0

10

15

3

48

2

17

18

famiglia_N

0

2

2

12

4

20

11

19

22

23

speranza_N

2

2

1

10

11

12

15

36

14

11

Tracing the History of Words 

 211

Tab. 4: Ten presidents, with the eight resulting clusters and their sizes in terms of words Cluster

Members

1

17

2

5

3

3

4

8

5

13

6

8

7

5

8

6

singletons

11

Total

76

Fig. 4 shows some examples of words grouped by the algorithm into Cluster 1. This cluster includes words mainly addressed in the oldest speeches and that experienced a decreasing trend over time. It is interesting to observe that words like solidarietà ‘solidarity’, volontà ‘will’, pensiero ‘thought’, opera ‘work’, and patria ‘home’ seem to have lost their strength over time. Fig. 5, in contrast, shows some examples of words in Cluster 5 that share a logistic shape behavior, i.e., these words were less common in the past, then their importance increased and in recent years their presence became established. It is interesting to observe words in this cluster like Europa ‘Europe’, governo ‘government’, legge ‘law’, storia ‘history’, unità ‘unity’, and forza ‘strength’. Fig. 6 and 7 show, respectively, a cluster of words whose presence in the corpus has grown over time (Cluster 2: scelta ‘choice’, futuro ‘future’, and Italia ‘Italy’) and a cluster of words that were popular only for a short period. In this second cluster we observe, for example, that good news about economic growth (ripresa ‘recover’ and sviluppo ‘development’) gained interest within a specific optimistic time span and then quickly vanished. The word nazione ‘nation’ is interesting as it experienced a sort of semantic shift and today the political discourse prefers the word paese ‘country’.

212 

 Arjuna Tuzzi – Reinhard Köhler

Fig. 4. Cluster 1, six examples (out of 17) with decreasing trajectories

Fig. 5. Cluster 5, six examples (out of 13) with a logistic shape

Tracing the History of Words 

 213

Fig. 6. Cluster 2, three examples (out of 5) with an increasing trend

Fig. 7. Cluster 3, three examples (out of 3) with an increasing-decreasing shape

6 Conclusions In their conclusions, Trevisani & Tuzzi (2013; 2014) highlighted some critical points in representing the temporal pattern of words as functional objects, the weaknesses of an explorative approach, and the lack of a linguistic theory to justify and interpret such a complex and extremely sophisticated model. As a consequence, the present study is based on a linguistic hypothesis which suggests the subsumption of the process we are studying here under a specific linguistic law. We assume that the individual kinds of dynamics reflect the relevance of the corresponding concepts in the political discourse but we do not propose a political interpretations of the findings or a general theory. Many words do not display a specific trajectory over time, which does not, of course, come as a surprise. Word usage depends a great deal upon grammatical and stylistic circumstances, in particular the function words, which should not change along with the political or social development in a country. Furthermore, content words often show irregular behavior with respect to their frequency because their usage depends on unpredictable thematic circumstances, individual preferences and presidents’ personal traits. The Piotrowski-Altmann function is able to fit the sequence of relative frequencies and the cluster analysis is able to find some interesting groups of words

214 

 Arjuna Tuzzi – Reinhard Köhler

showing a similar temporal pattern. However, the cluster analysis approach opens the way to other critical aspects. The model is simple, and justified from a theoretical viewpoint. At the moment we are not able to interpret the parameters. Last but not least, we are not sure that the corpus of end-of-year presidential speeches is a good example of chronological corpus. It is not chronological in the sense we expected, when time effects are not as important as presidents’ individual choices. It goes without saying that more general conclusions can be made only after follow-up studies on data from more corpora and possibly with other methods (e.g., smoothing or clustering).

References Altmann, G. (1978). Zur Verwendung der Quotiente in der Textanalyse. In G. Altmann (Ed.), Glottometrika 1 (pp. 91–106). Bochum: Brockmeyer. Altmann, G. (1983). Das Piotrowski-Gesetz und seine Verallgemeinerungen. In K.-H. Best & J. Kohlhase (Eds.), Exakte Sprachwandelforschung (pp. 59–90). Göttingen: Edition Hero-dot. Altmann, G. (1988). Linguistische meverfahren. In U. Ammon, N. Dittmar & K. Mattheier (Eds.), Sociolinguistics. Soziolinguistik (pp. 1026–1039). Berlin / New York: de Gruyter. Bolasco, S. (2010). TaLTaC 2.10. Sviluppi, esperienze ed elementi essenziali di analisi automatic dei testi. Milano: LED. (Series: Quaderni Taltac) Cortelazzo, M. A., & Tuzzi, A. (Eds). (2007). Messaggi dal Colle. I discorsi di fine anno dei presidenti della Repubblica. Venezia: Marsilio Editori. Giacofci, M., Lambert-Lacroix, S., Marot, G., & Picard, F. (2013). Wavelet-based Clustering for Mixed-Effects Functional Models in High Dimension. Biometrics, 69(1), 31–40. Köhler, R., & Galle, M. (1993). Dynamic Aspects of Text Characteristics. In: L. Hřebíček & G. Altmann (Eds.), Quantitative Text Analysis (pp. 46–53). Trier: Wissenschaftlicher. Migliorini, B. (1960). Storia della lingua italiana. Firenze: Sansoni. Popescu, I.-I. (2009). Word Frequency Studies. Berlin: Mouton De Gruyter. Popescu, I.-I., Macutek, J., & Altmann, G. (2009). Aspects of Word Frequencies. Studies in Quantitative Linguistics. Ludenscheid: RAM. Trevisani, M., & Tuzzi, A. (2013). Shaping the History of Words. In I. Obradović, E. Kelih & R. Köhler (Eds.), Methods and Applications of Quantitative Linguistics: Selected papers of the VIIIth International Conference on Quantitative Linguistics (QUALICO), Belgrade, Serbia, April 16–19. (pp. 84–95). Beograd: Akademska Misao. Trevisani, M., & Tuzzi, A. (2014). A portrait of JASA: the History of Statistics through Analysis of Keyword Counts in an Early Scientific Journal. Quality and Quantity. (online first version) Wimmer, G., & Altmann, G. (1999). Review Article: on Vocabulary Richness. J. Quant. Ling., 6(1), 1–9.

Relja Vulanović1 – Tatjana Hrubik-Vulanović2

Grammar Efficiency and the Idealization of Parts-of-speech Systems Abstract: Descriptions of parts-of-speech systems in different languages are often idealized for typological purposes. A sample of 20 languages, in which each language is described in two different ways, is used in order to determine whether there is a statistically significant difference between the descriptions. The data for the statistical analysis are obtained by converting the grammatical structure of simple intransitive sentences to the numerical value of grammar efficiency. It is shown that the grammar-efficiency values of a more detailed and realistic description are significantly different from those of an idealized description. Keywords: Grammar efficiency, Hengeveld's parts-of-speech systems, sign test

1 Introduction Any attempt to establish a classification of languages based on some criteria has to be accompanied with a  certain level of idealization. Hengeveld’s  (1992) classification of parts-of-speech (PoS) systems is no exception. Regarding the seven PoS system types he proposes, Hengeveld (1992: 47) states that “languages at best show a strong tendency towards one of the types.” This is echoed by Rijkhoff (2007: 718), who, speaking of the same classification, says: “It must be emphasized, however, that the various types of PoS systems […] should be regarded as reference points on a scale rather than distinct categories. Because languages are dynamic entities, they can only approximate the ideal types in this classification.” The original classification containing seven PoS system types is extended to 13 types in Hengeveld, Rijkhoff, & Siewierska (2004). Fifty languages, representing a  genetically, geographically, and typologically diverse sample, are classified in that paper and their characteristics relevant to the classification are discussed. The same sample is used in Vulanović & Köhler (2009) and Vulanović (2010) to analyze the correlation between the PoS system type and some other linguistic properties. However, Hengeveld & van Lier (2010a) show that there exist natural languages with PoS systems that possess parts of speech which are not considered in Hengeveld (1992) or Hengeveld et al. (2004). These PoS systems do not belong to any 1 Kent State University at Stark; [email protected] 2 Kent State University at Stark; [email protected]

216 

 Relja Vulanović – Tatjana Hrubik-Vulanović

of the originally introduced types. Another paper by Hengeveld & van Lier (2010b) presents information about PoS systems found in 22 languages, 17 of which are the same as in the Hengeveld et al. (2004) sample. The PoS systems of the 17 languages are in some cases described in the exact same way in both Hengeveld et al. (2004) and Hengeveld & van Lier (2010b), and in other cases there are differences which do not always seem to be negligible. Most of the differences are such that some parts of speech are omitted in Hengeveld et al.’s (2004) description of PoS systems. Since the description in Hengeveld & van Lier (2010b) is more detailed, a natural question is whether the idealization accompanying the classification in Hengeveld et al. (2004) significantly changes the overall picture that the sample presents. We answer this question here by statistically analyzing the numerical data obtained through the calculation of grammar efficiency for each PoS system of interest. To the 17 languages belonging to both samples, we add Hungarian, Tagalog, and West Greenlandic, the only languages that are considered in both Hengeveld & van Lier (2010a) and Hengeveld et al. (2004), but that are not in the Hengeveld & van Lier (2010b) sample. Therefore, we discuss 20 languages such that the PoS system of each is described in both Hengeveld et al. (2004) on the one hand, and Hengeveld & van Lier (2010a,b)1 on the other, the two descriptions not always being different. We calculate the grammar efficiency of each PoS system described and get two sets of data, for the Hengeveld et al. (2004) systems and for the corresponding Hengeveld & van Lier (2010a,b) systems. Then we analyze statistically whether the two sets of data are significantly different or not. We choose grammar efficiency to quantitatively represent the structure of each PoS system because grammar efficiency correlates with PoS system types Vulanović  &  Miller (2010). On the other hand, PoS system types correlate with other linguistic features, as we have already mentioned referring to Hengeveld et al. (2004), Vulanović & Köhler (2009), and Vulanović (2010) (see also Rijkhoff 2007; Hengeveld & van Lier 2010b; and Hengeveld 2013). Other papers where the grammar efficiency of PoS systems is measured include Vulanović (2008, 2012, 2013). Most papers on the grammar efficiency of PoS systems are based on the approach to grammar efficiency which is developed in Vulanović (2003, 2007).2 Two types of grammar efficiency are distinguished in Vulanović (2007), absolute

1 There are languages which are considered in all three papers. Between Hengeveld  &  van Lier’s  (2010a) and Hengeveld  &  van Lier’s  (2010b) descriptions of these languages, the latter one is chosen since Hengeveld  &  van Lier (2010b) is a  newer and more detailed source than Hengeveld & van Lier (2010a). 2 Vulanović (2003) is a more formal, mathematical predecessor of Vulanović (2007). Both papers, but particularly Vulanović (2007), also deal with grammar complexity, which is defined as the reciprocal of grammar efficiency.

Grammar Efficiency and the Idealization of Parts-of-speech Systems 

 217

and relative, the latter being a scaled version of the former. A simplified version of this approach is introduced in Vulanović (2013). It is developed for calculating grammar efficiency in the situation when word order is fixed. Fixed word orders are going to be considered primarily in the present paper and this is why we are going to use the grammar-efficiency formula from Vulanović (2013). Hengeveld’s  word classes and the PoS systems of interest are described in Section 2. Then, in Section 3, the grammar-efficiency formula is presented and applied to the PoS systems. The results are analyzed statistically in Section 4. Finally, Section 5 offers some concluding remarks.

2 Word Classes and PoS Systems In Hengeveld’s (1992) approach, word classes are defined according to what propositional functions they can fulfill. Four propositional functions (or syntactic slots) are considered: P, the head of the predicate phrase; R, the head of the referential (noun) phrase; r, the modifier of the referential-phrase head; and p, the modifier of the predicate-phrase head. Whereas P and R are obligatory, r or p, or both, may be missing in some languages. For instance, Mandarin Chinese has no slot for the modifier of the predicate phrase (Hengeveld et al. 2004, Hengeveld & van Lier 2010b). Verbs, nouns, adjectives, and manner adverbs3 are the typical word classes fulfilling the P, R, r, and p functions, respectively. These four word classes are the only rigid ones, a rigid word class being specialized for one and only one propositional function. There also exist flexible word classes, those that can have two or more different functions. The word classes of interest in this paper are presented in Tab.  1. Whereas the only propositional function verbs can have is the head of the predicate phrase, other word classes are defined by their predominant usage and may have functions different from those indicated in the table. Tab. 1: Word classes Word class

P

R

r

p

Verbs

V

-

-

-

Nouns

-

N

-

-

Adjectives

-

-

a

-

3 Only manner adverbs are considered because other adverbs typically modify the whole sentence and not just the head of the predicate phrase.

218 

 Relja Vulanović – Tatjana Hrubik-Vulanović

Word class

P

R

r

p

Manner adverbs

-

-

-

m

Heads

H

H

-

-

Nominals

-

N

N

-

Modifiers

-

-

M

M

Non-verbs

-

Λ

Λ

Λ

Contentives

C

C

C

C

The word classes in Tab. 1 are combined to create different PoS systems. Seven main PoS system types are introduced in Hengeveld (1992), three flexible and four rigid. A PoS system is flexible if it has at least one flexible word class, and it is rigid if all of its word classes are rigid. The 13 types of Hengeveld et al. (2004) are obtained by the insertion of six intermediate types, one between each pair of the consecutive main types. Flexible intermediate types are characterized by the presence of two word classes with overlapping propositional functions, whereas rigid intermediate types have one small, closed word class. The classification in (2004) does not cover all possible PoS system types. Other main types are reported in Hengeveld & van Lier (2010a) as attested and Hengeveld & van Lier (2010b) reveals that the way word classes can be combined to form a PoS system can be even more intricate. Tab. 2 shows the Hengeveld et al. (2004) and Hengeveld & van Lier (2010a,b) descriptions of the PoS systems in the 20 languages considered here. In eight of those languages, at least one description shows a flexible PoS system—these languages are grouped together in the beginning of the table. Twelve PoS systems are rigid in both descriptions. In Hengeveld & van Lier (2010b), there is a minor uncertainty regarding the PoS systems of Kayardild and Berbice Dutch. Depending on how the latter is treated, seven or eight languages have the same descriptions in Hengeveld et al. (2004) and Hengeveld & van Lier (2010a,b), three in the flexible PoS system group, and four or five in the rigid group. It should be mentioned that Hengeveld & van Lier (2010a) describes the PoS system of West Greenlandic as having N, V, and H as the only word classes. However, intermediate PoS systems are not considered there and no attention is paid to small and closed word classes. The original source, Sadock (2003), seems to confirm that a  small and closed class of adjectives is present in this PoS system, as reported in Hengeveld et al. (2004).

Grammar Efficiency and the Idealization of Parts-of-speech Systems 

 219

Tab. 2: The descriptions of PoS systems in 20 languages (parentheses indicate small, closed word classes) Language

Hengeveld et al. 2004

Hengeveld and van Lier 2010b/a

P

R

r

p

P

R

r

p

C

C

C

C

C

C

C

C

Λ

Λ

Λ

Flexible Samoan

V Imbabura Quechua

V

Λ

Λ

Λ

V

(m) Turkish

V

Lango

V

Λ N

Λ

Λ

M

M

M

M

same same

m Hungarian

V

N

a

Tagalog

C

C

C

same

Kayardild

V

N

a

V

+

+

m

V

N

N

m

Λ

Λ

Λ

a or (a)m West Greenlandic

+

V

N

(a)

V

N

H

H

(a)

Rigid Abkhaz, Basque, Georgian

V

N

a

m

same

Pipil

V

N

a

(m)

same

Berbice Dutch

V

N

a

same or V

N

a

(m)

Alamblak

V

N

a

V

N

(a)

(m)

Tamil

V

N

(a)

V

N

(a)

(m)

Mandarin Chinese

V

N

(a)

same

Garo

V

N

(a)

V

N

(m)

Nivkh

V

N

V

N

(m)

Krongo

V

N

same

Tuscarora

V

(N)

V

+

the second description taken from Hengeveld and van Lier (2010a)

N

220 

 Relja Vulanović – Tatjana Hrubik-Vulanović

3 Grammar Efficiency We first introduce a formal grammar which is to be used in the description of PoS systems and, consequently, in the determination of their grammar efficiency. We assume for simplicity that both predicate and referential phrases are continuous. Then, a string of word classes used in the PoS system represents a sentence if this string can be interpreted as at least one of the following 18 orders of propositional functions: a. PR, RP, PRr, PrR, RrP, rRP, PpR, pPR, RPp, RpP, b. PpRr, PprR,pPRr, pPrR, RrPp, RrpP, rRPp, rRpP.

(1)

Any PoS system can be combined with different word orders, which are described here by the corresponding orders of propositional functions. The information about word order is taken here from Hengeveld et al. (2004). Of the 20 languages we consider, 18 have an identifiable basic word order, which is the only order to be modeled, even though it is not necessarily fixed and some alternate word orders may exist. For instance, the basic word order in Imbabura Quechua can be represented as rRpP.4 If the PoS system of this language is taken to be as described in Hengeveld et al. (2004), then the following formal sentences, given below together with their interpretations, can be formed. ΛV → RP ΛΛV → RpP | rRP ΛΛΛV → rRpP

(2)

The sentences ΛV and ΛΛΛV are unambiguous, each having one interpretation only, but ΛΛV is ambiguous because it can be interpreted in two different ways, RpP or rRP, as indicated in (2). In the Hengeveld & van Lier (2010b) description, the sentences and their interpretations are: ΛV → RP ΛΛV → RpP | rRP ΛmV → RpP ΛΛΛV → rRpP ΛΛmV → rRpP

(3)

It should be noted that when grammar efficiency is discussed, it is irrelevant whether a word class is small/closed or large/open because a small/closed class can still be used as frequently as a large/open one.5 This is why the small and closed class of manner adverbs in Imbabura Quechua is represented in (3) simply as m.

4 The order pPrR is also possible in Imbabura Quechua and is accompanied with a  referential-phrase marker. This is modeled in Vulanović (2012), but is not considered here. 5 Another fact irrelevant for grammar efficiency is whether a word class is derived or not. All derived word classes are indicated in Hengeveld & van Lier (2010b), but this is omitted in the language representation here.

Grammar Efficiency and the Idealization of Parts-of-speech Systems 

 221

Generally speaking, the PoS system and the basic word order constitute a  grammar in which sentences are interpreted according to what word classes are used to fulfill which propositional functions. In order to make a fine differentiation between various grammatical structures, sentences are analyzed in each grammar, as in the approach of regulated rewriting (Dassow and Pǎun 1989). Each word-class symbol is rewritten as any propositional function it can fulfill,6 keeping the continuity of both predicate and referential phrases (1) in mind. The information about the permitted orders of propositional functions is only used after all possible sentence-analyses are obtained. This information serves as a  regulation that may eliminate some of the analyses. To illustrate this, let us consider Hengeveld et al.’s (2004) description of Imbabura Quechua and the sentences in (2). Their initial analyses are as follows: ΛV → RP | r- | pP- ΛΛV → RrP | RpP | rRP | p- ΛΛΛV → RrpP | Rp- | rRpP | p- (4) The final analyses are underlined. They are obtained after the prescribed order of propositional functions is applied. Note how this eliminates the RrP analysis of the sentence ΛΛV, but still leaves this sentence ambiguous. Some of the analyses are only attempted and cannot be completed successfully; this is indicated by a hyphen. For instance, when Λ in ΛV is interpreted as r, the analysis has to be abandoned because r needs to be followed by R, and V does not render this function. Let AA denote the total number of analysis attempts, whether they are finished successfully or not, and let US be the total number of unambiguous sentences. We define absolute grammar efficiency, AGE, as in (5). AGE =

US A

(5)

This is the same definition as in Vulanović (2013) and is based on the following points. Greater amount of ambiguity should imply smaller efficiency, all other things being equal (Frazier 1985: 135, Hawkins 2004: 38). This is why AGE is directly proportional to US. Since smaller values of AA indicate simpler analyses of sentences, AGE is inversely proportional to AA.

6 It is assumed that the analysis is carried out from left to right, one word-class symbol at a time, and without knowing the sentence length in advance. This is not intended to model how the human mind works.

222 

 Relja Vulanović – Tatjana Hrubik-Vulanović

The quotient US/AA replaces the so called parsing ratio from Vulanović (2003, 2007), which is more complicated to calculate. In the example considered above, see (4), US = 2, and AA = 11, giving AGE = 2/11 = 0.182 for the Hengeveld et al. (2004) description of the PoS system in Imbabura Quechua. For the Hengeveld  &  van Lier (2010b) description, we have AGE = 4/18 = 2/9 = 0.222. The counts US = 4 and AA = 18 come from the fact that there are two more sentences in (3) than in (2), and they are analyzed as follows: ΛmV → RpP | r- | p- ΛΛmV → RrpP | Rp- | rRpP | p-

(6)

This adds two more unambiguous sentences and seven more analysis attempts to the counts in (4). It can be immediately noted that AGE = 1 for all rigid PoS systems. This is so because in a rigid PoS system there are no ambiguous sentences and every sentence only has one analysis attempt, which is successful. As for the eight flexible languages, their values of absolute grammar efficiency are given in Tab. 3. Tab. 3: AGE of the eight flexible languages in the sample Language

Basic word order

AGE Hengeveld et al. 2004 Hengeveld and van Lier 2010b/a

Samoan

PpRr

2/32 = 1/16 = 0.062

4/43 = 0.093

Imbabura Quechua Turkish

rRpP

2/11 = 0.182

4/18 = 2/9 = 0.222

rRpP

7/25 = 0.280

same

Lango

RrPp

2/3 = 0.667

same

Hungarian

rRpP

1

4/8 = 1/2 = 0.500

Tagalog

PRr PrR rR[p]P [p]PrR

1/10 = 0.100

same

1

16/44 = 4/11 = 0.364

1

8/12 = 2/3 = 0.667

Kayardild

West RrP Greenlandic

The two descriptions are identical for Turkish, Lango, and Tagalog. This leaves only five flexible languages that have different AGE values corresponding to their different descriptions. Hungarian, Kayardild, and West Greenlandic are described in Hengeveld et al. (2004) as rigid languages and this is why AGE = 1 in these three cases. Of the 20 languages, Tagalog and Kayardild are the only ones for which the basic word order cannot be identified. This is why two orders of propositional functions are considered in these languages. For Kayardild, the modifier of the

Grammar Efficiency and the Idealization of Parts-of-speech Systems 

 223

predicate phrase is bracketed because this propositional function is missing in the Hengeveld et al. (2004) description, but present in the Hengeveld & van Lier (2010b) description. Let l denote the number of propositional functions in a PoS system and let k be the number of word classes used. All grammars sharing the same value of l, as well as the same value of k, belong to the same family of grammars, Γ(l,k). In the case of each of the five languages in Tab.  3 which are described differently in Hengeveld et al. (2004) and Hengeveld  &  van Lier (2010a,b) (Samoan, Imbabura Quechua, Hungarian, Kayardild, and West Greenlandic), the two PoS systems belong to different families of grammar. In this situation, relative, rather than absolute, grammar efficiency may be more appropriate for comparing two PoS systems. Relative grammar efficiency, RGE, shows how efficient a grammar is in comparison to the most efficient grammar in the same family of grammars. By using RGE, we use the same yardstick to measure the efficiency of grammars belonging to different families (Vulanović 2003, 2007, 2013). A formula for calculating relative grammar efficiency can be obtained by scaling the formula in (5). It is convenient to represent the scaling coefficient as cl/k (cf. Vulanović 2007). l RGE = c⋅AGE⋅  . k

(7)

The ratio l/k in the formula in (7) is in agreement with the following point. A grammar is more efficient if it conveys more information with fewer grammatical devices. In our model, propositional functions represent the information conveyed by word classes. This is why RGE is directly proportional to l and inversely proportional to k. The order or propositional function is another grammatical device used, but it is already incorporated in AGE through the counts US and AA. The coefficient c in (7) depends on l and k. If l ≥ k, c is defined so that RGE = 1 for the grammar G*(l,k) which has the greatest AGE value in Γ(l,k), while all its sentences are unambiguous. In all other cases, i.e., when either l < k or G*(l,k) does not exist, the coefficient is defined to be c = 1. In this way, RGE ≤ 1, with the value RGE = 1 indicating maximally efficient grammars. The Hengeveld et al. (2004) description of the Samoan PoS system has l = 4 and k = 1. It is impossible to avoid ambiguity in the Γ(4,1) family of grammars, thus G*(4,1) does not exist and c = 1 for this family. We also have c = 1 in the Hengeveld  &  van Lier (2010a) description of the West Greenlandic PoS system because in this grammar l = 3 < k = 4. Moreover, if l = k, the family Γ(l,k) contains a grammar with the rigid PoS system, which is the grammar G*(l,k) having the greatest AGE value, AGE =1. Therefore, c is defined as c = 1 when l = k. This is why c = 1 in both descriptions of Lango, as well as of Kayardild, and in the Hengeveld

224 

 Relja Vulanović – Tatjana Hrubik-Vulanović

et al. (2004) descriptions of Hungarian and West Greenlandic. For the remaining PoS systems in Tab. 3, we need to find G*(4,2), G*(4,3), and G*(3,1). This has already been done in Vulanović (2013) for l = 4 and k = 2, 3. The greatest value of AGE in Γ (4,2) is 1/2 and it is achieved by the grammar with H and M as the only word classes and with the propositional functions ordered as pPRr or rRPp. This is also why c = 1 for the Hengeveld et al. (2004) description of Imbabura Quechua and for the Hengeveld & van Lier (2010b) description of Samoan. In Γ (4,3), the greatest value of AGE is 2/3. There are several grammars achieving this, one of them being a type 3 grammar with the PpRr order of propositional functions. This means that c should be set equal to 9/8 in Γ (4,3), i.e., for the Hengeveld & van Lier (2010b) description of Imbabura Quechua, for the Hengeveld & van Lier (2010a) description of Hungarian, and for both descriptions of Turkish. Finally, we need G*(3,1) for both descriptions of Tagalog. This grammar has the PoS system of Tagalog, but only one fixed order of propositional functions, which may be any of the four possible orders. In this grammar, AGE = 1/5, giving c = 5/3 in this case. All RGE values of the PoS systems of interest are presented in Tab. 4. Their differences are calculated by subtracting the RGE values for the Hengeveld & van Lier (2010a,b) description from the corresponding values for the Hengeveld et al. (2004) description. There are five non-zero differences and all of them are positive. For the remaining 12 languages in Tab. 2, the RGE values are all equal to 1, so that 12 more differences in the whole sample are equal to 0. Tab. 4: RGE of the eight flexible languages in the sample Language

RGE Hengeveld et al. 2004 Hengeveld and van Lier 2010b/a Difference

Samoan

1/4

8/43

0.064

Imbabura Quechua

4/11

1/3

0.030

Turkish

7/25

same

0

Lango

2/3

same

0

Hungarian

1

3/4

0.250

Tagalog

1/2

same

0

Kayardild

1

4/11

0.636

West Greenlandic

1

1/2

0.500

Grammar Efficiency and the Idealization of Parts-of-speech Systems 

 225

4 Statistical Analysis The data set consists of paired relative-grammar-efficiency values for each language in the sample. The idealized structures, presented in the Hengeveld et al. (2004) description of PoS systems, are expected to be simpler and, because of that, more efficient. This can be observed in each pair of different RGE values. Therefore, we want to test the null hypothesis H0: The median of grammar-efficiency values in the Hengeveld et al. (2004) description is equal to the median of grammar-efficiency values in the Hengeveld & van Lier (2010a,b) description,

against the alternative hypothesis H1: The median of grammar-efficiency values in the Hengeveld et al. (2004) description is greater than the median of grammar-efficiency values in the Hengeveld & van Lier (2010a,b) description.

In order to use a  parametric test, we need either normally distributed data or a sufficiently large sample size (at least 30). Since neither is satisfied in our case, we have to use a  nonparametric test. We first considered the Wilcoxon signedranks test, but the distribution of the data differences is not symmetrical and this test cannot be justified. Instead, we used the one-tailed sign test, as described in Triola (1995), with a significance level of α = 0.05. In this test, the ties, represented by zeros, do not count. We only need the number of non-zero differences, which is n = 5 in our case. The test statistic, x, is the number of times the less frequent sign occurs. Since all non-zero differences are positive, the number of negative differences is x = 0. The SPSS program cannot run with n = 5, but according to Table A-7 in (Triola 1995: A-15), the critical value for n = 5 is 0. Therefore, the data support the rejection of H0. Another possibility is to follow the recommendation for one-tailed sign tests (Triola 1995: 668), which is to count the 15 ties as negative differences in support of H0. When this is done, we have n = 20 and in that case the same table, Table A-7, indicates 5 as the critical value, which is equal to the number of positive differences, x = 5. Alternatively, SPSS can be used with n = 20 and it produces p = 0.41. In either case, H0 should be rejected. Based on the above results, we can conclude that the different descriptions of PoS systems produce RGE values which are significantly different at the α = 0.05 level.

226 

 Relja Vulanović – Tatjana Hrubik-Vulanović

5 Conclusion In this paper, we represent the grammatical structure of parts-of-speech systems by a  formal model and calculate grammar efficiency for each structure. Any modeling involves a  certain degree of simplification. Generally speaking, grammar efficiency (or equivalently, complexity) can only be analyzed locally, in a restricted area of grammar. According to Miestamo (2008), overall, global language complexity is impossible to measure using our current, limited linguistic and mathematical tools. The local area of grammar we deal with here consists of simple intransitive sentences which only carry the information about up to four propositional functions that are of interest in Hengeveld’s (1992) approach to parts-of-speech systems. This describes the essence of parts-of-speech systems. Any other grammatical categories or grammatical markers are excluded from the model considered here, although parts-of-speech systems may use such grammatical devices for disambiguation (cf., Vulanović 2012). The model is exclusively focused on the differences between the more idealized description of parts-of-speech systems in Hengeveld et al. (2004) and the less idealized description in Hengeveld & van Lier (2010a,b).7 These differences mainly come from the omission of some parts of speech in Hengeveld et al. (2004). We calculate grammar efficiency of simple intransitive sentences for 20 languages, the parts-of-speech systems of which are described in the two different ways, and we use the data to analyze statistically whether there is a significant difference between the two kinds of description. We answer this question in the affirmative, in spite of the fact that in some cases (rigid parts-of-speech systems) grammar efficiency remains the same even though the two descriptions are different. In conclusion, linguists should be careful when idealizing the structure of parts-of-speech systems or any linguistic structure in general.

7 However, it should be pointed out that although the structures of parts-of-speech systems are described more realistically in Hengeveld & van Lier (2010a,b) than in Hengeveld et al. (2004), some idealization may still be present.

Grammar Efficiency and the Idealization of Parts-of-speech Systems 

 227

References Dassow, J., & Pǎun, G. (1989). Regulated Rewriting in Formal Language Theory. New York: Springer. Frazier, L. (1985). Syntactic Complexity. In D. R. Dowty, L. Karttunen & A. M. Zwicky (Eds.), Natural Language Parsing: Psychological, Computational, and Theoretical Perspectives (pp. 129–189). Cambridge: Cambridge University Press. Hawkins, J. A. (2004). Efficiency and Complexity in Grammars. Oxford / New York: Oxford University Press. Hengeveld, K. (1992). Parts of Speech. In M. Fortescue, P. Harder & L. Kristoffersen (Eds.), Layered Structure and Reference in Functional Perspective (pp. 29–55). Amsterdam/ Philadelphia: John Benjamins. Hengeveld, K. (2013). Parts-of-speech System as a Basic Typological Determinant. In J. Rijkhoff & E. van Lier (Eds.), Flexible Word Classes: Typological Studies of Underspecified Parts of Speech (pp. 31–55). Oxford: Oxford University Press. Hengeveld, K., Rijkhoff, J., & Siewierska, A. (2004). Parts-of-speech Systems and Word Order. Journal of Linguistics, 40, 527–570. Hengeveld, K., & van Lier, E. (2010a). An Implicational Map of Parts of Speech. Linguistic Discovery, 8, 129–156. Hengeveld, K., & van Lier, E. (2010b). Parts of Speech and Dependent Clauses in Functional Discourse Grammar. In U. Ansaldo, J. Don & R. Pfau (Eds.), Parts of speech: Empirical and theoretical advances (pp. 253–285). Amsterdam/Philadelphia: John Benjamins. Miestamo, M. (2008). Grammatical Complexity in a Cross-Linguistic Perspective. In M. Miestamo, K. Sinnemäki & F. Karlsson (Eds.), Language Complexity: Typology, Contact, Change. Studies in Language Companion Series 94 (pp. 23–41). Amsterdam: Benjamins. Rijkhoff, J. (2007). Word Classes. Language and Linguistics Compass, 1, 709–726. Sadock, J. M. (2003). A Grammar of Kalaallisut (West Greenlandic Inuttut). Languages of the World / Materials 162. München: LINCOM Europa. Triola, M. F. (1995). Elementary Statistics (6th ed.). Reading, MA: Addison-Wesley. Vulanović, R. (2003). Grammar Efficiency and Complexity. Grammars, 6, 127–144. Vulanović, R. (2007). On Measuring Language Complexity as Relative to the Conveyed Linguistic Information. SKY Journal of Linguistics, 20, 399–427. Vulanović, R. (2008). A Mathematical Analysis of Parts-of-speech Systems. Glottometrics, 17, 51–65. Vulanović, R. (2010). Word Order, Marking, and a Two-Dimensional Classification of Parts-of-speech System Types. Journal of Quantitative Linguistics, 17, 229–252. Vulanović, R. (2012). Efficiency of Grammatical Markers in Flexible Parts-of-speech Systems. In S. Naumann, P. Grzybek, R. Vulanović & G. Altmann (Eds.), Synergetic linguistics: Text and language as dynamic systems (pp. 241–256). Wien: Praesens. Vulanović, R. (2013). Efficiency of Word Order in Flexible Parts-of-speech Systems. In R. Köhler & G. Altmann (Eds.), Issues in Quantitative Linguistics, Vol. 3. Studies in Quantitative Linguistics 13 (pp. 150–167). Lüdenscheid: RAM-Verlag. Vulanović, R., & Köhler, R. (2009). Word Order, Marking, and Parts-of-speech Systems. Journal of Quantitative Linguistics, 16, 289–306. Vulanović, R, & Miller, B. (2010). Grammar Efficiency of Parts-of-speech Systems. Glottotheory, 3(2), 65–80.

Yanru Wang1 – Xinying Chen2

Structural Complexity of Simplified Chinese Characters Abstract: In this paper, we studied, from a synergetic perspective, the relationship between the structural complexity and frequency of simplified Chinese characters. We measured the structural complexity of simplified Chinese characters by both the number of strokes and the number of components. Then, we tested whether the relationship between simplified Chinese character’s structural complexity and their frequency fits Zipf’s law by analyzing the most frequent 3061 simplified Chinese characters from the Chinese Character Frequency Dictionary. The result shows that the relationship between simplified Chinese characters’ structural complexity of both measurements and the frequency of those Chinese characters abides by the Zipf-Mandelbrot law. Keywords: Zipf’s law, structural complexity, Chinese characters, components, strokes, frequency

1 Introduction Existing studies of Chinese morphology discuss the morphemic combination of characters into words (Packard 2000) and the placement of aspectual markers (Stallings 1975). However, there is a rarely explored topic in this area, namely the structural complexity of Chinese characters. There is little theoretical research which has addressed this question, and most of the studies of the structural complexity of Chinese characters were driven by interest in natural language processing applications and teaching of Chinese (Bunke & Wang 1997). Although there are few exceptions which tried to discuss the structural complexity of Chinese characters from a theoretic point of view, the lack of sufficient data and solid theoretical standpoints is still a problem common to all of them (Wang 2007). This study addresses the question of how to measure the structural complexity of simplified Chinese characters by using the methods that Köhler and colleagues (2005) and Wang (2011) established and tested for morphological research on inflected languages. 1 Xi’an Jiaotong University & East China Normal University; [email protected] 2 Xi’an Jiaotong University & Goethe University; [email protected]

230 

 Yanru Wang – Xinying Chen

The synergetic linguistics proposed by Köhler and colleagues (2005) regards language as a self-organizing and self-regulating dynamic system, and moreover provides a linguistic framework building on that foundation. Köhler (1986) built and tested a synergetic-linguistic model on the German language for a lexical subsystem. The model described, and in a way explained, the relationship between different linguistic features, such as the relationships between structural complexity, number of meanings, and frequency of language units, and it has been proved applicable to many languages. For Chinese, Wang’s (2011) work proved that the relationship between Chinese words’ polysemy and word length fits this model. Wang (2014) also studied the relationship between word length, polysemy, polytexuality, and frequency of Chinese words according to this model. The results further proved the model’s applicability to the Chinese lexical subsystem. According to this model, which has been proved applicable in many other linguistic features of Chinese, we speculated that the more complex characters would be less frequent due to the minimum effort of language production, which is similar to the principle of least effort proposed by Zipf (1949), and that the relation between frequency and complexity of characters should abide by Zipf’s Law. However, there is still no agreement among Chinese linguists on the question of whether the number of strokes (referred to as NS) or the number of components (referred to as NC) is a better measurement of the structural complexity of Chinese characters. Before we discuss this issue further, we will first give some explanation of the concepts of stroke and component. A stroke is commonly known as the basic unit of the structure of Chinese characters and is a continuous line that people write in one motion. The definition of component is not yet agreed upon, even among Chinese scholars. In this paper, we use the definition of Fu (1991) and Pan (2002) for its clarity. According to Fu (1991) and Pan (2002), a component is a structural unit that is larger than a stroke and no larger than a Chinese character. It is the relatively independent part that is detachable and is a stroke or a structural block made up of strokes. Take the Chinese character蜜mi ‘honey made by bees’, for example. It is made of the three components 宀, 必, and虫. The question of whether NS or NC should be the better measurement of the structural complexity of Chinese characters has been put forward early in the 1950s by Du (1954), who himself made contradictory statements regarding the question of whether strokes or components should be the basic units of Chinese characters. It was not until the 1990s that Chinese linguists developed a systemized theory of the structural units of Chinese characters. Su (1994) proposed that in order to analyze the structure of modern Chinese characters, scholars should set up a new theory, where there are three levels in the structure of Chinese characters (stroke, component, and a complete Chinese character), among which component is the core. However, traditional Chinese linguists have not pointed

Structural Complexity of Simplified Chinese Characters 

 231

out which of the three units is more suitable for the measurement of the structural complexity of Chinese characters. In quantitative linguistics, Bohn (2002) proposed that we should use NS to measure the structural complexity of Chinese characters, while Altmann (2004) proposed a universal procedure of measuring script complexity by assigning numerical values to different types of elementary units and connections and then adding up all the values of a given sign. Liu and Huang (2012) argued that, when it comes to the measurement of the structural complexity of Chinese characters, both Bohn’s proposal and Altmann’s proposal can serve certain research purposes. However, neither of the two methods provides an interpretation of the structural complexity of Chinese characters from the perspectives of linguistics and philology. Although Altmann’s method is universally applicable to different scripts, this method cannot describe the inner structure of Chinese characters as well as NS and NC do. Therefore, we used NS and NC as the measurements of structural complexity in this study. Moreover, our goals are twofold: to test our hypothesis that the frequency and structural complexity relationship of Chinese charters would fit Zipf’s law, and to gain insight on measuring structural complexity of characters by comparing the results from two measurements.

2 Data and Materials Our data are derived from The Chinese Character Frequency Dictionary1, which was compiled according to the People’s Daily, a famous newspaper in China. The size of the corpus is over 179 million Chinese characters. The dictionary contains the most frequent 3061 different simplified Chinese characters and their frequency in the corpus, the total of which amounts to 99.43% of the whole corpus. Other materials we used as standards to measure complexity of the 3061 simplified Chinese characters include the Dictionary of Chinese Character Information2, which presents basic information, namely the pinyin, number of strokes, and components of each given Chinese character, for commonly used simplified

1 The Chinese Character Frequency Dictionary was downloaded from the website of the International R & D Center for Chinese Education: http://nlp.blcu.edu.cn/downloads/download-resources/50.html.(Retrieved 2013-1-30) 2 The Dictionary of Chinese Character Information was downloaded from the website of the International R & D Center for Chinese Education: http://nlp.blcu.edu.cn/downloads/downloadresources/25.html. (Retrieved 2013-1-30)

232 

 Yanru Wang – Xinying Chen

Chinese characters; the Table of Basic Components of Chinese Characters3, which lists 560 basic components of Chinese characters; and the Modern Commonly Used Standardized Chinese Dictionary (Zhang 2006), which also presents basic information for characters in the dictionary, namely the origin, simplification approach, pinyin, number of strokes, and components.

3 Experiments and Results In his two books, the Psychobiology of Language (Zipf 1935) and Human Behavior and the Principle of Least Effort (Zipf 1949), George Kingsley Zipf proposed that word frequency distribution follows a power law, which is now well-known as Zipf’s law. The law originally takes the form below. f = C/r

(1)

In this equation, ‘f’ and ‘r’ represent word frequency and the rank order of that frequency respectively. ‘C’ is a constant based on the corpus in question. Later, to address a problem that some researchers found, the word frequency distribution was slightly deviated from the original Zipf distribution at both ends, a modified function was proposed (Mandelbrot 1953). Thus came the famous Zipf-Mandelbrot law below, which can more accurately describe empirical data in linguistics. f =C(r+a)-b

(2)

Although Zipf’s law was first discovered in linguistics, it has been proved to be applicative in many other scientific fields such as bibliometrics, economics, sociology, and informatics. In linguistics, this law has been tested by many scholars and proved applicable to various languages, such as Korean (Choi 2000), Greek (Hatzigeorgiu et al. 2001), Spanish (Ha et al. 2006), French (Ha et al. 2006), Irish (Ha et al. 2006), Latin (Ha et al. 2006), and Indian languages (Jayaram &  2008). Moreover, it was tested on the ancient language Meroitic, used in northern Sudan more than 15 centuries ago (Smith 2007). Zipf’s law has also been tested on the Chinese language. Wang and colleagues (2009) used Dream of the Red Chamber (红楼梦), Selected Works of MaoTse-tung (毛泽东选集) and Selected Works of Deng Xiaoping (邓小平文选) as their research corpus and proved the applicability of Zipf’s distribution law

3 The Table of Basic Components of Chinese Characters was issued by the State Language Work Committee in 1997.

Structural Complexity of Simplified Chinese Characters 

 233

in Chinese. Guan and colleagues (1995) used Chinese word frequency statistics shared on the Internet and found out that modern Chinese was in keeping with Zipf’s law at the level of characters, words, and so on. Recently, Chen and colleagues (2012) investigated Chinese literature from the Tang Dynasty (A.D. 618–A.D. 907) to the present at both word and character level, and found that although Chinese changed dramatically over time, Chinese word and character frequency always abide by the Zipf-Mandelbrot and Menzerath-Altmann laws. Differently from previous studies, which mainly focused on the word or complete character levels, we would like to look into the inner structure of simplified Chinese characters. For testing our hypothesis in the first section, we carried out linear fitting experiments by using the equation of Zipf-Mandelbrot law, as below, with ‘f’ being the frequency of Chinese characters and ‘N’ being the NS or NC of each given Chinese character. f =C(N+a)-b

(3)

3.1 Experiments The experiments were carried out by the following steps. First, we calculated the NS and NC of all 3061 simplified Chinese characters according to the materials discussed in Section 2. Second, we added up the frequency of Chinese characters that share the same NS or NC and then calculated the average frequency by dividing the sum of frequency by the number of Chinese characters (NCC) sharing the same NS or NC. Third, we carried out linear fitting experiments by using the data we got from the last two steps. Finally, we looked at the results of Step 3 and discussed whether the relationship between the structural complexity of simplified Chinese characters and their frequency fits Zipf’s law, drawing conclusions thereby.

3.2 Results Tab. 1 and 2 show the data from Chinese characters that we used for linear fitting experiments.

234 

 Yanru Wang – Xinying Chen

Tab. 1: Summary of the number of strokes of Chinese characters NS

NCC

Examples

Sum of frequency Average frequency

1

2

一yi (the number ‘one’),乙yi

0,0104818889

0,0052409444

2

19

人ren (human being),了le,力li

0,0270823814

0,0014253885

3

51

大da (big, old, etc.),上shang,工gong

0,0535283278

0,0010495751

4

113

不bu(no, not),中zhong,为wei

0,0924586324

0,0008182180

5

145

发fa (to give out),业ye,民min

0,0989700908

0,0006825524

6

237

年nian(year, annually),有you,在zai

0,1514154083

0,0006388836

7

307

来lai (to come),这zhe,作zuo

0,1035224004

0,0003372065

8

379

国guo (nation, country),的de,和he

0,1611448260

0,0004251842

9

368

是shi (to be),要yao,政zheng

0,1006057580

0,0002733852

10

343

家jia (family, relatives),部bu,展zhan

0,0675444718

0,0001969227

11

290

基ji (foundation),得de,理li

0,0437326729

0,0001508023

12

276

提ti (to lift, to propose),等deng,就jiu

0,0366750136

0,0001328805

13

177

新xin (new, innovative),意yi,解jie

0,0215748566

0,0001218918

14

124

道dao (road, rules, laws),管guan,赛sai 0,0111661176

0,0000900493

15

100

德de (morality, virtue),增zeng,题ti

0,0084463093

0,0000844631

16

55

器qi (appliances),融rong,整zheng

0,0033965338

0,0000617552

17

33

藏cang (to hide),繁fan,疑yi

0,0012677566

0,0000384169

18

11

翻fan (to turn over),覆fu,藤teng

0,0002244518

0,0000204047

19

13

警jing (alert, police),疆jiang,攀pan

0,0005720697

0,0000440054

20

11

籍ji,灌guan,耀yao (honor, to dazzle)

0,0002580577

0,0000234598

21

4

露lou,霸ba,髓sui (marrow)

0,0001694056

0,0000423514

22

2

囊nang,镶xiang (to inlay)

0,0000231615

0,0000115808

23

1

罐guan (pottery for containing things)

0,0000151569

0,0000151569

Note: NS stands for the number of strokes and NCC stands for the number of Chinese characters (NCC) sharing the same NS. Tab. 2: Summary of the number of components of Chinese characters. NC

NCC

Examples

Sum of frequency Average frequency

1

187

人ren (human being),一yi,中zhong

0,2008274347

0,0010739435

2

969

国guo (nation, country),的de,和he

0,4164280467

0,0004297503

3

1186 发fa (to give out)在zai,是shi

0,2922298614

0,0002463995

4

534

0,0682191084

0,0001277511

高gao (high),能neng,说shuo

Structural Complexity of Simplified Chinese Characters 

 235

NC

NCC

Examples

Sum of frequency Average frequency

5

150

题ti (the subject, to sign),领ling,解jie

0,0153812537

0,0001025417

6

32

歌ge (to sing, song),疑yi,衡heng

0,0010030980

0,0000313468

7

3

疆jiang (border, limit),凝ning,颤chan

0,0001869466

0,0000623155

Note: NC stands for the number of components and NCC stands for the number of Chinese characters sharing the same NC.

According to Tab. 1 and 2, the NS and NC ranges from 1 to 23 and 1 to 7 respectively. And the most complex Chinese character in term of NS is ‘罐’guan (pottery used for containing things or drawing water) with 23 strokes. The most complex simplified Chinese characters in terms of NC are ‘疆’jiang (region, area, or border, limit, to draw boundaries), ‘凝’ning (to change from gas to liquid or from liquid to solid, to stay attentive and focused on something), and ‘颤’chan (to vibrate, the vibration of something) with 7 components. As shown in Fig. 1 and 2, the sum of the frequency of Chinese characters first increases and then decreases as NS or NC increases, which seems to contradict our hypothesis that the more complex characters would be less frequent. This maybe due to the fact that characters that have different structural complexity are not evenly distributed. As shown in Fig. 3 and 4, as the characters’ structural complexity increases, the number of characters follows a downward parabola. And this downward parabola tendency may override Zipf’s law, causing the downward parabola tendency of the sum of the frequency of Chinese characters changing with NS or NC.

Fig. 1: The sum of frequency of Chinese characters that share the same NS

236 

 Yanru Wang – Xinying Chen

Fig. 2: The sum of frequency of Chinese characters that share the same NC

Fig. 3: The number of Chinese characters that share the same NS

Fig. 4: The number of Chinese characters that share the same NC

Structural Complexity of Simplified Chinese Characters 

 237

In order to reduce the influence of these downward parabolas, we used the average frequency of characters with the same NS or NC instead of the sum of the frequency to do the experiments. The fitting results for NS and NC are a=−0.5251, b=1.0339, C=0.0024 and a=0.2626, b=1.3949, C=0.0015 respectively, with significant determinant coefficient R2=0.9787 and R2=0.9804. Both fitting results are satisfactory. The results are shown in Fig. 5 and 6.

Fig. 5: The relationship between NS and the average frequency of Chinese characters that share the same NS

Fig. 6: The relationship between NC and the average frequency of Chinese characters that share the same NC

238 

 Yanru Wang – Xinying Chen

4 Discussion and Conclusion As the results in the previous section show, the relationship between structural complexity and frequency abides by Zipf’s law, with the structural complexity measured both by NS and NC and frequency being the average frequency of characters sharing the same structural complexity. This proves once again the universal applicability of Zipf’s law. As we have mentioned in the first section, the best measurement of the structural complexity of Chinese characters hasn't been concluded yet. In this study, we used two measurements, NS and NC. Both measurements turned out to fit Zipf’s law with a significant determinant coefficient. Although the results still need to be tested with larger corpora or texts including more genres, according to our results both measurements are proper for linguistic, and especially quantitative linguistic, studies. Although our study did not show a distinct advantage for stroke measurement over component measurement, taking NS as the structural complexity measurement of simplified Chinese characters still has clear advantages in linguistic studies. As stated by Altmann (2004), the complexity of things is not an inherent property of the things themselves, but a property of how people interpret their structure. Although both NS and NC can reflect the inner structure of Chinese characters as perceived by Chinese people, NS is a more practical measurement of complexity for two reasons. Firstly, the definition of a stroke is clear and noncontroversial while the definition of a component is not so. The definition of the stroke as the basic unit of the structure of Chinese characters, a continuous line people write in one motion, is commonly agreed on by linguists, whereas the definition of a component is quite controversial. Ban & Zhang (2004) listed as many as eight of the most representative definitions of component. Secondly, the standards for distinguishing a stroke are quite straightforward due to its clear definition, while the standards for splitting a character into components are not so explicit. Even the two most authoritative component standards issued by the Chinese State Language Committee are considered impractical by Wang & Huang (2013) and to have an unacceptably large number of components. In conclusion, the relationship between the structural complexity of simplified Chinese characters, measured both by NS and NC and frequency abides by Zipf’s law. And according to our research, both the number of strokes and components are proper measurements of the complexity of simplified Chinese characters. Future research still needs to be done to address the issue of whether NS or NC are better for use in linguistic research.

Structural Complexity of Simplified Chinese Characters 

 239

References Altmann, G. (2004). Script Complexity. Glottometrics, 8, 68–74. Ban, J. Q., & Zhang, Y. J. [班吉庆, & 张亚军] (2004). Definitions of Chinese Character Component [汉字部件的定义]. Journal of Yangzhou University: Humanities and Social Sciences [扬州大学学报: 人文社会科学版], 8(4), 62–65. Bohn, H. (2002). A Study on Chinese Writing and Language [Untersuchungen zur chinesischen Sprache und Schrift]. In R. Köhler (Ed.), Corpus Studies in Quantitative and Systems Theoretical Linguistics [Korpuslinguistische Untersuchungen zur Quantitativen und Systemtheoretischen Linguistik]. Bunke, H., & Wang, P. S. (Eds.). (1997). Handbook of Character Recognition and Document Image Analysis. Singapore: World Scientific. Chen, Q., Guo, J., & Liu, Y. (2012). A Statistical Study on Chinese Word and Character Usage in Literatures from the Tang Dynasty to the Present. Journal of Quantitative Linguistics, 19(3), 232–248. Choi, S. W. (2000). Some Statistical Properties and Zipf’s Law in Korean Text Corpus. Journal of Quantitative Linguistics, 7(1), 19–30. Du, D. Y. [杜定友] (1954). The Strange Organization of Square-Shaped Characters [方块字的怪组织]. Studies of the Chinese Language [中国语文], (12), p. 27. Fu, Y. H. [傅永和] (1991). Components of Chinese Characters [汉字的部件]. Construction of Chinese Language [语文建设], (12), 12–16. Guan, Y., Wang, X. L., & Zhang, K. [关毅, 王晓龙, & 张凯] (1995). Frequency-frequency Rank Relation of Language Unit in Computational Language Model of Modern Chinese [现代汉语计算语言模型中语言单位的频度-频级关系]. Journal of Chinese Information Processing [中文信息学报], 13(2). Ha, L. Q., Stewart, D. W., Hanna, P. J., & Smith, F. J. (2006). Zipf and Type-token Rules for the English, Spanish, Irish and Latin Languages. Web Journal of Formal, Computational and Cognitive Linguistics, 1(8), 1–12. Hatzigeorgiu, N., Mikros, G., & Carayannis, G. (2001). Word Length, Word Ffrequencies and Zipf’s Law in the Greek Language. Journal of Quantitative Linguistics, 8(3), 175–185. Jayaram, B. D., & , M. N. (2008). Zipf’s Law for Indian Languages. Journal of Quantitative Linguistics, 15(4), 293–317. Köhler, R. (1986). Synergetic Linguistics: Structure and Dynamics of Lexicon [Zur linguistischen Synergetik: Struktur und Dynamik der Lexik]. Bochum: Brockmeyer. Köhler, R. (2005). Synergetic linguistics. In R. Köhler, G. Altmann & R. G. Piotrowski (Eds.), Quantitative Linguistics. An International Handbook. Berlin: de Gruyter. Liu, H. T., & Huang, W. [刘海涛, & 黄 伟] (2012). Quantitative Linguistics: State of the Art, Theories and Methods [计量语言学的现状, 理论与方法]. Journal of Zhejiang University (Humanities and Social Sciences) [浙江大学学报 (人文社会科学版)], 42(2), 178–192. Mandelbrot, B. (1953). An Informational Theory of the Statistical Structure of Language. In B. Jackson (Ed.), Communication theory. Woburn, MA: Butterworth. Packard, J. (Ed.). (2000). The Morphology of Chinese: A Linguistic and Cognitive Approach. Cambridge: Cambridge University Press. Pan, D. F. [潘得孚] (2002). On the Systemization of the Split of Chinese Characters [论汉字拆分的系统性]. The Culture of Chinese Characters [汉字文化], (4).

240 

 Yanru Wang – Xinying Chen

Smith, R. D. (2007). Investigation of the Zipf-plot of the Extinct Meroitic Language. Glottometrics, 15, 53–61. Stallings, W. (1975). The Morphology of Chinese Characters: a Survey of Models and Applications. Computers and the Humanities, 9(1), 13–24. Su, P. C. [苏培成] (1994). Outline of Modern Chinese Characterology [现代汉字学纲要]. Peking: Peking University Press [北京: 北京大学出版社]. Wang, G. A. (Ed.). (2007). A Handbook for 1,000 Basic Chinese Characters. Hong Kong: The Chinese University Press. Wang, L. (2011). Polysemy and Word Length in Chinese. Glottometrics, 22, 73–84. Wang, L. (2014). Synergetic Studies on Some Properties of Lexical Structures in Chinese. Journal of Quantitative Linguistics, 21(2), 177–197. Wang, D. P., & Huang, W. L. [王道平, & 黄文丽] (2013). Thoughts about two Chinese Character Component Standards [关于两个汉字部件规范的一点思考]. Journal of Chinese Information Processing [中文信息学报], 27(2), 74–78. Wang, Y., Liu, Y. F., & Chen, Q. H. [王洋, 刘宇凡, & 陈清华] (2009). Zipf’s Word Frequency Distribution in Chinese Literature Works [汉语言文学作品中词频的Zipf分布]. Journal of Beijing Normal University (Natural Science) [北京师范大学学报(自然科学版)], 45(4), 424–427. Zhang, W. Y. [张万有] (2006). The Modern Commonly Used Standardized Chinese Dictionary [现代常用汉字规范字典]. Xi’an: Shaanxi People’s Education Press [西安: 陕西人民教育出版社]. Zipf, G. K. (1935). The Psycho-Biology of Language: An Introduction to Dynamic Philology. Cambridge, MA: M.I.T. Press. Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. Cambridge, MA: Addison-Wesley.

Aris Xanthos1 – Guillaume Guex2

On the Robust Measurement of Inflectional Diversity Abstract: Lexical diversity measures are notoriously sensitive to variations of sample size and recent approaches to this issue typically involve the computation of the average variety of lexical units in random subsamples of fixed size. This methodology has been further extended to measures of inflectional diversity such as the average number of wordforms per lexeme, also known as the mean size of paradigm (MSP) index. In this contribution we argue that, while random sampling can indeed be used to increase the robustness of inflectional diversity measures, using a fixed subsample size is only justified under the hypothesis that the corpora that we compare have the same degree of lexematic diversity. In the more general case where they may have differing degrees of lexematic diversity, a more sophisticated strategy can and should be adopted. A novel approach to the measurement of inflectional diversity is proposed, aiming to cope not only with variations of sample size, but also with variations of lexematic diversity. The robustness of this new method is empirically assessed and the results show that while there is still room for improvement, the proposed methodology considerably attenuates the impact of lexematic diversity discrepancies on the measurement of inflectional diversity. Keywords: inflectional diversity, mean size of paradigm, MSP, RMSP, lexical diversity, robustness, random sampling

1 Introduction 1.1 Lexical Diversity, Sample Size, and Random Sampling The measurement of lexical diversity is one of the most studied topics in quantitative linguistics. The basic ingredient of all diversity measures is variety, namely the number V of distinct lexical units in a text sample. It is well-known that V is critically dependent on the number N of tokens in the sample, so that samples of differing sizes cannot be directly compared based on this index. Many studies 1 University of Lausanne; [email protected] 2 University of Lausanne; [email protected]

242 

 Aris Xanthos – Guillaume Guex

have tried to circumvent this issue using instead the type-token ratio TTR := V/N. However, TTR is also dependent on N in a non-linear fashion and the same holds for the various transforms of TTR that have been proposed by Guiraud (1954), Herdan (1960), and several others (see e.g. Tweedie & Baayen 1998 and references cited therein). Many recent approaches to diversity measurement rely on a different way of compensating for sample size variations, based on an idea formulated seventy years ago by Johnson (1944): computing and reporting the average TTR (or, equivalently, variety) in a number of fixed-size subsamples drawn from the sample under consideration. In Johnson’s original proposal (sometimes called mean segmental TTR), subsamples are defined as contiguous, non-overlapping sequences of lsub tokens (1 ≤ lsub ≤ N). Consequently, the number nsub of subsamples is determined by the integer division ⌊N/lsub⌋. Furthermore, when lsub is not a factor of N, adopting this sampling scheme implies discarding a “residue” of at most lsub−1 tokens. The constraint that subsamples should consist of contiguous tokens has usually been relaxed in later studies, as illustrated by Dubrocard (1988), where the N tokens composing the sample are randomly assigned to the subsamples, regardless of their position in the text. Malvern & Richards (1997) have further advocated a sampling procedure where each subsample is built by drawing tokens without replacement in the text—similarly to Johnson’s or Dubrocard’s method—but a given token may occur in any number of subsamples (including 0). The consequence of this change in design is that the number nsub of subsamples becomes an actual parameter, whose value may be set to an arbitrary large number, irrespective of subsample size lsub. Malvern & Richards (1997) proceed with the specification of a sophisticated approach that has become the current de facto standard for measuring lexical diversity. This approach, called VOCD, relies on the calculation of the average TTR in subsamples of increasing size (35, 36, … 50 tokens), in order to build a so-called “empirical” TTR curve. A curve-fitting procedure is then applied to find the “theoretical” curve which matches the empirical one most closely, among a family of curves generated by the variation of a single parameter in a mathematical model of the relationship between sample size and TTR. The parameter value generating the curve with the best fit is eventually reported as the measured diversity. The usefulness of the VOCD algorithm has been seriously challenged in a recent contribution by McCarthy & Jarvis (2007). These authors convicingly argue that (i) the curve-fitting procedure underlying VOCD has no other use than smoothing the fluctuations induced by random sampling; and (ii) that a better way of achieving this effect is to calculate analytically the expected TTR in all possible subsamples of a given size—a calculation whose details (based on the

On the Robust Measurement of Inflectional Diversity 

 243

hypergeometric law) have already been specified by Serant (1988), and were essentially ignored for the next two decades.

1.2 Inflectional Diversity The notion of inflectional diversity relies on the distinction between inflected wordforms or simply forms (such as walk, walked, and walking) and lexemes or lemmas, i.e.  the abstract lexical categories to which related wordforms belong (such as the verb conventionally referred to using the infinitive to walk). In what follows, we will denote the number of distinct wordforms in a  sample by F, and we will call this number the sample’s wordform variety. Similarly, the number of distinct lexemes will be denoted by L and called lexematic variety. Both quantities capture distinct but interrelated aspects of lexical diversity. The measurement of inflectional diversity has a much shorter history than that of its lexical counterpart. In particular, many studies have simply used the average number of wordforms per lexeme, also known as the mean size of paradigm1 (see Xanthos & Gillis 2010 and references cited therein), defined as MSP := F/L, i.e. the ratio of wordform variety to lexematic variety. However, being a type/type ratio, MSP is easily shown to inherit its components’ dependence on sample size. As such, it cannot either be used for directly comparing samples of differing sizes. To the best of our knowledge, there have been only two proposals for the measurement of inflectional diversity that explicitly take into account the issue of dependence to sample size. The first is based on VOCD (see Section 1.1) and due to Malvern, Richards, Chipere, & Durán (2004). Based on the observation that VOCD consistently returns slightly lesser values when applied to lexemes than to wordforms, Malvern and colleagues propose to use the difference between these two indices as a measure of inflectional diversity (which they call ID). Xanthos & Gillis have argued that in spite of its promises, this measure suffers from several shortcomings, chief among which are that “the unit in which ID is expressed has no meaningful interpretation” (2010: 179) and that in the context of an increase in lexical diversity..., ID is liable to detect spurious increases in inflectional diversity—increases that are mere side-effects of the subtractive definition of the measure (2010: 180).

1 A lexeme’s paradigm is the set of wordforms belonging to this lexeme.

244 

 Aris Xanthos – Guillaume Guex

On these grounds, Xanthos & Gillis have put forward an alternate measure which is easier to compute and, arguably, to interpret. Building on the idea of using random sampling to deal with the dependence on sample size, they define the normalized MSP as the average MSP computed in nsub subsamples of lsub tokens drawn randomly from the original sample.2 They provide empirical evidence showing that using random sampling significantly increases the measurement’s robustness with regard to variations of sample size, while preserving its ability to detect variations of inflectional diversity. It should be noted that as far as we know, the problem of analytically calculating the expected MSP in all possible subsamples of a given size has not yet been solved. Our own preliminary investigations have given us no reason to believe that it has a solution as simple and elegant as what Serant (1988) has offered for lexical variety.

1.3 Normalized MSP and Lexematic Diversity While normalized MSP, as defined above, appears to be robust with regard to sample size variations, the same does not hold for variations of lexical diversity. Xanthos & Gillis briefly touch upon the issue of the relation between normalized MSP and lexematic diversity. […] given that sample size remains constant, any increase in the diversity of lemmas is matched by a corresponding decrease in the average frequency of lemmas. As more distinct lemmas occur, each of them has less frequent occurrences, which means less space for deploying the variety of its inflected wordforms. Rarer inflections are thus less likely to appear in the sample, and on average a lemma will tend to have a smaller number of distinct wordforms. Overall, a decrease in inflectional diversity should occur as a result of the increase in lexical diversity (2010: 179).

In the present contribution, we wish to take this line of reasoning one step further and argue that a sound measure of inflectional diversity should not only be robust with regard to variations of sample size but also with regard to variations of lexematic diversity. Indeed, if normalized MSP reports spurious decreases in inflectional diversity when lexematic diversity increases, it does not fare any better than ID and its own spurious increases (cf. Section 1.2).

2 The same sampling scheme as Malvern & Richards (1997) is used (cf. Section 1.1).

On the Robust Measurement of Inflectional Diversity 

 245

The first contribution of this study is to introduce an algorithm for computing MSP in such fashion that variations in both sample size and degree of lexical diversity are being taken into account and compensated for; we optimistically propose to call the resulting measure of inflectional diversity robust MSP, or RMSP. Secondly, we offer an empirical assessment of the extent to which this new index is less dependent on variations of lexematic diversity than standard normalized MSP (which will henceforth be abbreviated as NMSP). To that effect, we describe a presumably novel method for generating artificial text samples using a probabilistic model whose degree of lexematic diversity can be controlled without modifying its degree of inflectional diversity. The remainder of this contribution is organized as follows. The next section begins with the justification and specification of the algorithm used for computing the new RMSP index. Then we describe the method that we have designed for controlling the degree of lexematic diversity of artificially generated text samples. We proceed with the description of our experimental setup, including the source data used for our experiments and the way in which they are preprocessed. In Section 3, we show the results obtained by NMSP and RMSP, focusing in particular on their relative dependence on lexematic variety. These results are then discussed in Section 4 and our main findings briefly summarized in Section  5.

2 Method 2.1 The RMSP Algorithm The normalized MSP (NMSP) algorithm attempts to compensate for the dependence of MSP on sample size. It takes as input a set of text samples and computes for each sample the average MSP on nsub subsamples of size lsub. The main constraint is that lsub must be set to a fixed value lesser than or equal to the size l of the smallest sample in the dataset (Xanthos & Gillis 2010). Normalized versions of lexematic (or wordform) variety (or TTR) can be calculated in the same way, which will be exploited shortly for computing the robust MSP (RMSP) index. The RMSP algorithm can be thought of as a variant of NMSP where a second layer of normalization is added, in order to compensate not only for the dependence of MSP on sample size, but also on lexematic diversity. Indeed, as noted in Section 1.3 above, setting the size of subsamples to a fixed value leads to an underestimation of MSP in samples that have a greater degree of lexematic diversity. In these samples, each lexeme type will have fewer occurrences on average, which in turn

246 

 Aris Xanthos – Guillaume Guex

means that it will tend to have less distinct inflected forms—a faithful scale model of the dependency of variety on sample size. The basic idea underlying the RMSP algorithm is to counterbalance this underestimation issue by adjusting the subsample size lsub separately for each sample, in such fashion that samples with a smaller degree of lexematic diversity (relatively to other samples in the dataset) are assigned a smaller subsample size. In particular, the algorithm attempts to find, for each sample, the subsample size that ensures that lexemes have the same number of tokens on average in all subsamples of all samples; in other words, it seeks to minimize the variance of average lexeme frequency or, equivalently, of its reciprocal, lexematic TTR. To that effect, a maximal subsample size lmax is first chosen, with the constraint that it must be lesser than or equal to the size l of the smallest sample in the set. Then the normalized lexematic TTR (henceforth NLTTR) of each sample is computed with a fixed subsample size of lmax tokens. The maximal NLTTR value obtained this way determines the target value (NLTTRtarget) that the algorithm consequently tries to reach for each (other) sample in the dataset. In particular, for each sample, the algorithm searches for the subsample size 2 ≤ lsub ≤ lmax that is optimal in the sense that the resulting NLTTR value is as close as possible to NLTTRtarget. Finally, the NMSP of this sample is computed with the optimal subsample size lsub that has just been found, and the result is reported as the value of the RMSP index for this sample. The algorithm can be described more formally as in Fig. 1 below. RMSP algorithm Input: set S of text samples with size at least l maximum subsample size lmax ≤ l Output: RMSP(s, S, lmax) value for each sample s∈S – –

NLTTRtarget ← maxs∈S(NLTTR(s, lmax)) for each s∈S do: – llow ← 2, lhigh ← lmax – lsub ← lhigh – while NLTTR(s, lsub) ≠ NLTTRtarget and llow ≠ lhigh do: – lsub ← integer((lmax + lmin) / 2) – if NLTTR(s, lsub) < NLTTRtarget, set lhigh to lsub – else if NLTTR(s, lsub) > NLTTRtarget, set llow to lsub – RMSP(s, S, lmax) ← NMSP(s, lsub)

Fig. 1: Algorithm for robust MSP (RMSP) computation

The following difference between NSMP and RMSP should be stressed. The NMSP value computed for a given sample depends only on the chosen subsample size lsub, so that it can be directly compared with any other NMSP value obtained with

On the Robust Measurement of Inflectional Diversity 

 247

the same subsample size. By contrast, the RMSP value of a sample depends not only on the maximum subsample size lmax but also on the set of samples with which this sample is compared—or to be precise, on the maximal NLTTR value obtained with a sample of this set for subsample size lmax (NLTTRtarget). Consequently, in order to compare this RMSP value with that of a new sample, the following conditions must be met: (i) the new sample must be of size at least lmax and (ii) its NLTTR for subsample size lmax must be at most NLTTRtarget; if so, the new sample can be processed separately in the same way as each sample of the original dataset. Otherwise, the algorithm must be run again on the entire dataset consisting of the new sample and the old one(s) with which it should be compared.

2.2 Sample Generation In order to evaluate the gain in robustness brought about by the RMSP algorithm, we have designed a method for generating artificial text samples whose degree of lexematic diversity can be controlled without altering their degree of inflectional diversity. This method relies on an L  ×  F contingency table, where each row corresponds to a lexeme type, each column corresponds to a wordform type, and each cell gives the count of a pair (lexeme, wordform).3 Normalizing over the table’s grand total yields a joint probability model that can be used to generate a text sample of size l by drawing l pairs (lexeme, wordform) with replacement. In what follows, it will be useful to refer to L, F, and F/L as the model’s theoretical lexematic variety, wordform variety, and MSP, respectively. The models’ theoretical lexematic variety can be reduced by aggregating two lexeme types (rows) in the contingency table. Let f and g be the wordform frequency distribution of any two lexemes, ordered by decreasing frequency. By placing the additional constraint that f and g be proportional, we ensure that the aggregated lexeme, defined as the vector sum of f and g, is also proportional to f and g. In order to substantially decrease the lexematic variety L of the model, we perform nagg > 1 aggregations at a time. Now, given that L will be reduced by nagg after nagg aggregations, in order for the theoretical MSP to remain constant, the theoretical wordform variety F should be decreased by nagg· MSP = nagg (MSP – 1) + nagg. The first wordform type of all aggregated lexeme types will contribute to the reduction of F by nagg, so the number of wordform types minus 1 in the aggregated lexeme types should be nagg (MSP – 1). This can be achieved as follows. First, ran-

3 In practice, these counts are typically derived from an existing text, as described in the next section.

248 

 Aris Xanthos – Guillaume Guex

domly pick lexeme types among those that have more than one wordform type4, until the wordform “surplus” (i.e. the number of wordform types in the selected lexemes minus the number of selected lexemes) reaches nagg (MSP – 1); then, complete the nagg aggregations by randomly selecting lexeme types among those that have only one wordform type. We call the process of doing nagg lexeme aggregations as described above an aggregation round. After an aggregation round, the modified contingency table can be normalized to build a new joint probability model, which in turn can be used to generate new samples. The process can be repeated as long as there remain enough lexeme types with proportional wordform distribution to aggregate.

2.3 Experimental Design As described in the previous section, our empirical assessment of NMSP and RMSP is based on a probabilistic mechanism for sample generation. The parameters of this mechanism could in principle be themselves generated according to some theoretical model. However, we have rather chosen to estimate them on the basis of natural language data, in order to preserve some degree of resemblance between our experimental design and the “naturalistic” conditions in which the measurement of inflectional diversity is likely to take place. The data in question are taken from the Project Gutenberg eBook of Eduard Bernstein’s Sozialismus einst und jetzt (2008). A German text was chosen on the grounds that its degree of inflectional diversity would in principle be relatively high (at least when compared to English, whose inflection is quite limited) so that there would actually be something to measure for our indices. For the same reason, we decided to focus exclusively on the subsystem of verb inflection in this corpus. Bernstein’s text was automatically tokenized, lemmatized, and annotated with part-of-speech (POS) tags using TreeTagger (Schmid 2004). Orange Textable (Xanthos 2014) was then used to parse the output of TreeTagger and discard all tokens but verbs. The result is a list of N = 8106 verb tokens corresponding to F = 2012 wordform types and L = 1078 lexeme types, hence a (raw) MSP of 1.87 forms per lexeme.5

4 Subject to the proportionality constraint discussed above. 5 Note that homophonous wordforms belonging to different lexemes are treated as distinct wordform types (e.g. gehabt, which can be the past participle of haben ‘to have’ or gehaben ‘to behave’).

On the Robust Measurement of Inflectional Diversity 

 249

Five rounds of 50 lexeme aggregations were made, preserving the theoretical MSP. At each step (starting with no aggregation), 100 text samples of size 500, 1000, 1500, 2000, and 2500 were produced, for a total of 3000 text samples. NMSP was computed with nsub = 1000 subsamples of size lsub = 100, 200, 300, and 400. RMSP was computed with nsub = 1000 subsamples and maximum size lmax = 100, 200, 300, and 400.

3 Results As shown in Fig. 2, while lexeme aggregation reduces the model’s theoretical lexematic and wordform variety by more than 20% (from 1078 to 828 lexeme types and from 2012 to 1540 wordform types), it causes only a slight decrease in theoretical MSP (from 1.866 to 1.860, i.e. less than 0.5%).

Fig. 2 – Left: Theoretical values of lexeme (dashed) and wordforms (solid) types vs aggregation rounds. Right: Theoretical MSP vs aggregation rounds

Fig. 3 confirms the impact of sample size on the lexematic and inflectional diversity of generated samples, as measured by their raw (i.e. not normalized) lexematic TTR and MSP. More importantly, the figure shows that lexeme aggregation influences both the lexematic TTR and the MSP of generated samples. In particular, the latter increases as the former decreases, notably for larger sample sizes: the MSP increase ranges between 3% for samples of 500 tokens and 7.6% for samples of 2500 tokens. One should not be surprised that the raw MSP increases with aggregation rounds although the theoretical MSP remains approximately constant; indeed, the predicted effect of lexeme aggregation on the average MSP of samples of fixed size is exactly the same as the predicted effect of lexeme aggregation on NMSP for a given subsample size.

250 

 Aris Xanthos – Guillaume Guex

Fig. 3 – Left: Raw lexematic TTR vs aggregation rounds. Right: Raw MSP vs aggregation rounds. On both figures, light to dark represents samples from size 500 to 2500

The normalization performed by the NMSP and RMSP algorithms effectively lessens the dependence of diversity measurement on sample size, as indicated by the overlap of curves in Fig. 4 (obtained with lsub, lmax = 100). The figure also shows that the reported RMSP is systematically lower than the corresponding NMSP. Finally, it can be seen that both measures are affected by lexeme aggregation, although not to the same extent.

Fig. 4: NMSP (dashed) and RMSP (solid) vs aggregation rounds (lsub, lmax = 100). Light to dark represents samples from size 500 to 2500.

Fig. 5 shows the behavior of NMSP and RMSP for lsub, lmax = 100, 200, 300, and 400 tokens (aggregating the results observed for all sample sizes). While both measures are increasing with lexeme aggregations for all values of lsub and lmax, the increase is consistently lesser for RMSP than for NMSP. The visual impression

On the Robust Measurement of Inflectional Diversity 

 251

is confirmed by the results of Spearman’s correlation test assessing the degree of dependence of NMSP and RMSP on the number of aggregation rounds. With the exception of RMSP with lmax = 100, both diversity measures always have a significant correlation with the number of aggregation rounds (cf. Tab. 1). However, the correlation itself is consistently lesser for RMSP.

Fig. 5: NMSP (dashed) and RMSP (solid) vs aggregation rounds for different subsample size. Results for different sample lengths have been aggregated Tab. 1: Spearman's correlation between aggregation rounds and NMSP/RMSP Subsample size NMSP

RMSP

100

0.090 (p≈9e-7)

0.190 (p≈0)

200

0.318 (p≈0)

0.179 (p≈0)

300

0.414 (p≈0)

0.241 (p≈0)

400

0.484 (p≈0)

0.285 (p≈0)

252 

 Aris Xanthos – Guillaume Guex

4 Discussion and Conclusion In this contribution, we have argued that while the resampling scheme underlying the normalized MSP (NMSP) measure of inflectional diversity proposed by Xanthos & Gillis (2010) effectively reduces the dependence of the measure on sample size, a more sophisticated approach is needed when dealing with data samples whose degree of lexematic diversity is heterogeneous. We have introduced a novel algorithm called robust MSP (RMSP), which relies on the idea that what should be normalized is not merely the number of tokens per subsample, but the number of tokens per lexeme in subsamples. To that effect, rather than setting a fixed subsample size for all samples in the considered dataset, the RMSP approach sets the size of subsamples separately for each sample, in such fashion that the variance of average lexeme frequency over all subsamples is minimized. In order to evaluate the gain in robustness brought about by the RMSP algorithm, we have developed a method for generating artificial text samples (based on lexeme and wordform frequencies observed in a real text) whose degree of lexematic diversity can be controlled without altering their degree of inflectional diversity. These data have enabled us to show that raw MSP is not only dependent on sample size, but also on variations of lexematic diversity. Applying the NMSP algorithm to the generated samples confirms that while it is much less dependent on sample size than raw MSP, it is also affected by variations of lexematic diversity. Finally, although RMSP is also dependent on lexematic diversity, it proves more robust than NMSP with regard to lexematic diversity fluctuations. When the samples under consideration are homogeneous from the point of view of their degree of lexematic diversity, RMSP essentially reduces to NMSP (with a slight computational overhead). Otherwise, the RMSP algorithm attempts to compensate for lexematic diversity fluctuations by discarding (through resampling) even more tokens than the standard NMSP algorithm. All other things being equal, discarding more tokens means discarding more types, which explains why the reported values of RMSP are typically lower than those of NMSP. Thus, while RMSP is in principle more widely applicable than NMSP (since it can handle data that display variations of lexematic diversity), it also gets closer to the extreme and absurd case where diversity is evaluated on the basis of a single token. A priority for future research will be to determine the conditions under which the RMSP approach might lead to an information loss so severe that it ultimately fails to provide a meaningful evaluation of inflectional diversity.

On the Robust Measurement of Inflectional Diversity 

 253

References Bernstein, E. (2008). Der Sozialismus einst und jetzt. Streitfragen des Sozialismus in Vergangenheit und Gegenwart. Project Gutenberg. N. H. Langkau & I. Knoll (Eds.). Retrieved December 4, 2011, from http://www. gutenberg.org/files/24523/24523-8.txt Dubrocard, M. (1988). Evaluation de l›étendue du lexique. Quelques essais de simulation. In P. Thoiron, D. Labbe & D. Serant (Eds.), Etudes sur la richesse et la structure lexicale. Vocabulary Structure and Lexical Richness (pp. 43–66). Paris/Genève: Champion/Slatkine. Guiraud, H. (1954). Les Caractères Statistiques du Vocabulaire. Paris: Presses Universitaires de France. Herdan, G. (1960). Type-Token Mathematics: A Handbook of Mathematical Linguistics. The Hague: Mouton & Co. Johnson, W. (1944). Studies in Language Behaviour: I. A Program Approach. Psychological Monographs, 56, pp. 1–15. Malvern, D., & Richards, B. (1997). A New Measure of Lexical Diversity. In A. Ryan & A. Wray (Eds.), Evolving Models of Language (pp. 58–71). Clevedon, UK: Multilingual Matters. Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical Diversity and Language Development: Quantification and Assessment. Basingstoke: Palgrave MacMillan. McCarthy, P. M., & Jarvis, S. (2007). vocd: A Theoretical and Empirical Evaluation. Language Testing, 24(4), 459–488. Serant, D. (1988). A propos des modèles de raccourcissements de textes. In P. Thoiron, D. Labbe & D. Serant (Eds.), Etudes sur la richesse et la structure lexicale. Vocabulary Structure and Lexical Richness (pp. 43–66). Paris/Genève: Champion/Slatkine. Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of International Conference on New Methods in Language Processing. Tweedie, F. J., & Baayen, R. H. (1998). How Variable May a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities, 32(5), 323–352. Xanthos, A. (2014). Textable: programmation visuelle pour l’analyse de données textuelles. In Actes des 12èmes Journées internationales d’analy-se statistique des données textuelles (JADT 2014) (pp. 691–703). Xanthos, A., & Gillis, S. (2010). Quantifying the Development of Inflectional Diversity. First Language, 30(2), 175–198.

Makoto Yamazaki

The influence of Word Unit and Sentence Length on the Ratio of Parts of Speech in Japanese Texts Abstract: This study provides a  supplementary examination of the degree to which prior research that established Kabashima’s  law concerning the ratio of parts of speech applies using data from a large-scale corpus. Moreover, this study has used a  written language corpus to demonstrate how the ratio of parts of speech changes depending on sentence length and the types of word units used to measure texts. As a result the following four points were identified. 1) Kabashima’s law does not apply particularly well when there is a high ratio of nouns. 2) When the word units are different, the ratio of parts of speech changes. 3) The ratio of parts of speech changes due to sentence length, and there are a variety of aspects to this because of the register of the text. 4) Characteristics of texts that had not previously been seen became clear by including particles and auxiliary verbs in the ratio of parts of speech.words Keywords: ratio of parts of speech, Kabashima’s law, word unit, sentence length, Japanese, corpus

1 Introduction Quantitative research on the ratio of parts of speech in Japanese texts has examined the distribution of vocabulary and has been invetigated as an important research theme since the 1950s. However, there has not been any significant progress since the initial research conducted by Ohno (1956) and Kabashima (1954, 1955). The lack of progress between the description of the ratio of parts of speech by Mizutani (1983) and the description of the ratio of parts of speech in the Keiryo Kokugogaku Jiten (Encyclopedia of Japanese Quantitative Linguistics) (2009: 95−96) shows the stagnation of research in this field. Problems concerning the ratio of parts of speech can be divided into two types: cases where measurements are made based on the number of tokens and cases where they are made based on the number of types. Ohno (1956) conducted his analysis according to the number of types, and Kabashima (1954, 1955) conNational Institute for Japanese Language and Linguistics; [email protected]

256 

 Makoto Yamazaki

ducted his based on the number of tokens. Each of them discovered that there is a law that governs the ratio of parts of speech, and their laws are called Ohno’s law and Kabashima’s law, respectively. Ohno’s law and Kabashima’s law differ in that the former is concerned with Classical Japanese whereas the latter is concerned with Modern Japanese. However, both of these laws have demonstrated that the ratios of parts of speech differ depending on the type of text (register), and that they both have certain tendencies. This study has conducted a supplementary examination of Kabashima’s law using a large-scale corpus; in addition, it has studied the previously insufficiently examined question of how types of word unit and sentence length are related to the ratio of parts of speech. Japanese sentence length is frequently measured by the number of characters, but this study measures it using the word units that are used in word counts and corpus annotation. In concrete terms, these word unit refer to the short unit word and the long unit word that are used in the Balanced Corpus of Contemporary Written Japanese (hereafter abbreviated as BCCWJ). This study investigates how the difference in these two types of word units changes the ratio of parts of speech. In addition, it reports results that include particles and auxiliary verbs (parts of speech that were not included in Kabashima’s research).

2 Prior Research 2.1 Kabashima’s Law Kabashima’s  research (1954, 1955) showed that the ratio of parts of speech changes depending on the genre of text but does not change depending on sentence length. However, Kabashima (1955) also stated that there was a positive correlation between the ratio of nouns and sentence length. Tab. 1 is the data that Kabashima (1955) used in his study. Kabashima formularized Tab. 1 and derived the following relational formulas. a. Ad = 45.67 − 0.60N b. log I = 11.57 − 6.5 logN c. V = 100 − (N + Ad + I)

(1)

The following types of relationship arise when the ratio of each of N, V, Ad, and I are compared; these letters represent noun groups, verb groups, adjective and adverb groups, and conjunction and interjection groups, respectively. Moreover, a graphical representation of this relationship is shown in Fig. 1.

The influence of Word Unit and Sentence Length on the Ratio of Parts of Speech […] 

 257

Tab. 1: Ratios of parts of speech Text Genre

N

V

Ad

I

Ordinary conversation

41.6

26.3

20.1

12.0

Conversation in novels

45.2

31.4

18.0

5.4

Philosophy books

48.6

31.4

17.0

3.0

Narrative part of novels

49.4

32.3

15.4

2.9

Science books

52.6

30.8

14.5

2.0

Tanka*

54.4

31.7

13.7

0.2

Haiku**

60.2

29.4

10.4

0.0

Newspapers

65.6

28.6

28.6

0.8

N: Noun group, V: Verb group, Ad: Adjective and Adverb group, I: Interjection and Conjunction group *A tanka is a Japanese short poem with 31 syllables. **A haiku is a Japanese short poem with 17 syllables.

Fig. 1: Kabashima’s law (1955:56)

2.2 Problematic Points of Kabashima’s Law Kabashima’s study suffers from three problematic points. The first of these is that it is not clear what type of word units was used in his series of studies. This was likely due to the fact that the concept of word units was not recognized at that time. The second of these is that the relationship between sentence length and

258 

 Makoto Yamazaki

the ratio of parts of speech was not elucidated for anything but the ratio of nouns. The third of these was that when he conducted his survey, he excluded particles and auxiliary verbs. These are so-called functional words, and they make up a large proportion of the number of tokens in Japanese text. In other words, Kabashima’s analysis only targeted content words, and it cannot be said that he handled the entirety of the text in a strict sense. Taking these problematic points into account, this study has posed and concretely examined the following four research questions, and concretely examined them. a. What happens to Kabashima’s law when it is examined using (2) large-scale corpus data? b. Is the ratio of parts of speech also different when the word units differ? c. What type of relationship is there between the sentence length and the ratio of parts of speech? d. What would the ratio of parts of speech be like when the auxiliary verbs and particles are included?

2.3 Parts of Speech and Word Units in Japanese The parts of speech that are typically used in Japanese are noun, pronoun, verb, i-adjective, na-adjective, adnominal, adverb, conjunction, interjection, particle, auxiliary verb, prefix, and suffix. This study also uses these parts of speech. However, when conducting the supplementary examination of Kabashima’s law, these parts of speech are used upon grouping. In Japanese, it is not customary to uses spaces as word boundaries, and thus, the concept of a  word is ambiguous. Therefore, since the 1950s, artificial word units have been invented for use in word counts. The first word unit was called an α-unit. Since then, every time a word count has been conducted, different word units have been created. Currently, there are two primary types of word unit that are used in areas such as corpus annotation. These two types are short unit word and long unit word (hereafter referred to as SUW and LUW respectively). SUW is a word unit that focuses on the morphological aspects of the language and is restricted to being at the most a linear combination of morphemes. On the other hand, LUW is a word unit that focuses on the syntactic aspects of language and there are no limitations on the number of combinations. Roughly speaking, SUW corresponds to the dictionary entry and LUW corresponds to the content word part of bunsetsu, which is the syntactic constituent of a sentence according to Japanese school grammar. For example, the Japanese sentence Watashi wa nihongo

The influence of Word Unit and Sentence Length on the Ratio of Parts of Speech […] 

 259

bunpo ni tuite benkyo shinakereba naranai (‘I have to study Japanese grammar’) counts 14 words using SUW and 6 words using LUW. This example is somewhat extreme, but there are many sentences where the number of words differs depending on whether it is counted with SUW or LUW. More about the SUW and LUW is described in Maekawa et al. (2014), and rules about the SUW and LUW will be obtained from Ogura et al. (2010).

3 Data The corpus used in the present study is the BCCWJ. The BCCWJ is the first Japanese balanced corpus; it was constructed by the National Institute for Japanese Language and Linguistics. It was completed in 2011, and is currently being widely used for research. The BCCWJ is comprised of three sub-corpora and 13 registers. The notion of register that is being referred to here is a concept similar to genre and could also be called a text type. This study conducts an analysis by considering the BCCWJ’s register as the genre of the text. Fig. 2 shows the structure of the BCCWJ, and Tab. 2 shows the number of words in each of the BCCWJ’s registers when counted as both SUW and LUW. Publication subcorpus 35 million words (SUW) Books, magazines, and newspapers 2001 to 2005

Library subcorpus 30 million words (SUW) Books 1986 to 2005

Special-purpose subcorpus 35 million words (SUW) White papers, textbooks, PR papers, best sellers, Q&A bulletin boards, blogs, verse, law, proceedings Fig. 2: Structure of the BCCWJ Tab. 2: Number of words in the BCCWJ, by register and S/LUW Register Library books Best sellers Q&A bulletin boards

SUW

LUW

30.307.625

25.031.768

3.737.668

3.182.019

10.235.490

8.592.375

Laws

1.079.083

706.250

Proceedings

5.102.439

4.007.806

260 

 Makoto Yamazaki

Register

SUW

LUW

PR Papers

3.750.468

2.303.793

Textbooks

924.940

742.685

Verse White papers

223.181

200.866

4.880.892

3.098.691

Blogs

10.125.783

8.217.870

Published Books

28.450.509

22.767.324

Magazines

4.424.573

3.461.009

Newspapers

1.369.772

997.074

104.612.423

83.309.530

Total

4 Results 4.1 Verification of Kabashima’s Law Fig.  3 is a  fitting of Kabashima’s  law when words are measured in SUW, while Fig. 4 is a fitting of Kabashima’s law when words are measured in LUW. In order to match the conditions of Kabashima’s study, this figure excludes particles and auxiliary verbs. By looking at the results, one can see that this fitting is not particularly good in either cases of counting with SUW or with LUW. This is because all of the actual values of the verb groups are lower than the estimated values and all of the actual values of the adjective and adverb groups are higher than the estimated values. However, their deviation seems to be almost uniform, so it may be possible to eliminate by adjusting the parameters. A different problematic aspects of Kabashima’s formulation is that when the value of noun groups exceeds 77, the estimated value of the adverb and adjective groups becomes negative. Since there are no cases when the actual values are negative, Kabashima’s law also requires revision to fix this point. In fact, the highest ratio of nouns was 79 in this study when measured using SUW and 85.6 when using LUW. When these values are simply entered into Kabashima’s formulas, the estimated values become negative.

The influence of Word Unit and Sentence Length on the Ratio of Parts of Speech […] 

 261

Fig. 3: Kabashima’s law by SUW

Fig. 4: Kabashima’s law by LUW

4.2 Word Units and the Ratio of Parts of Speech Tab. 3 shows the ratio of parts of speech for each part of speech. The following two points are differences in the measurement that occur when using SUW and LUW. a. The ratio with LUW is smaller than that with SUW: nouns and suffixes. (3) b. The ratio with LUW is larger than that with SUW: particles and auxiliary verbs. Since the consecutive nouns that are treated as separate in SUW are recognized as compound word in LUW, the number of nouns when calculated by LUW becomes relatively low. Thus, their ratio with LUW is smaller than that with SUW. The same situation also applies to suffixes. This is because almost all of the suffixes are included in compound words when using LUW; thus, their number decreases.

262 

 Makoto Yamazaki

On the other hand, the particle ni tuite (‘about, concerning’) is treated as a single-word compound case particle when using LUW. When using SUW, however, it is treated as one verb and two particles. Consequently, the number of particles decreases when using LUW but in reality, the ratio of particles is higher when using LUW than when using SUW. The reason for which the ratio of particles and auxiliary verbs is higher for LUW than it is for SUW is unknown. Tab. 3: Ratio difference between SUW and LUW PoS

SUW

LUW

Noun

0,350

0,293

Pronoun

0,014

0,017

Verb

0,135

0,127

i-adjective

0,015

0,016

na-adjective

0,013

0,021

Adnominal

0,010

0,012

Adverb

0,017

0,027

Conjunction

0,005

0,008

Interjection

0,002

0,002

Particle

0,300

0,334

Auxiliary verb

0,098

0,143

Prefix

0,008

0,000

Suffix

0,032

0,000

4.3 Sentence Length and the Ratio of Parts of Speech Does the ratio of parts of speech change depending on the sentence length? In Kabashima’s study (1955), a positive correlation was identified only between sentence length and noun groups, but there is no reference to parts of speech other than nouns. This study investigated the correlation between sentence length and the ratio of parts of speech for all of the parts of speech using both SUW and LUW. Sentence length was measured by the number of SUWs and LUWs that were included in a  section enclosed by the tag of the BCCWJ. However, almost all of the tags were automatically annotated by computers, so some items not normally considered sentences were included, such as newspaper headings, book chapter titles, and captions for tables and figures. In order to exclude these types of abnormal sentences, the portions that were enclosed

The influence of Word Unit and Sentence Length on the Ratio of Parts of Speech […] 

 263

by the tags that ended with the symbols “。”, “.”, “!”, “?”, and “」” were considered normal sentences and the subject of this study was limited to these. There are a total of 4.374.273 normal sentences in the entire BCCWJ, and that corresponds to approximately 80% of what is enclosed by a  tag. Tab. 4 shows the correlation coefficients between sentence length (from 1 to 100) and the ratio of each of the parts of speech for normal sentences. The reason sentence length was limited from 1 to 100 was that the values of every part of speech are comparatively stable in this range. The results are shown in Tab. 4. When a strong correlation is defined as being more than ±0.7, the parts of speech with a positive correlation with sentence length were nouns and suffixes when using SUW, and na-adjectives and adnominals when using LUW. Moreover, the parts of speech with a negative correlation with sentence length were pronouns, i-adjectives, and auxiliary verbs when using SUW and i-adjectives when using LUW. Moreover, it was determined that verbs and interjections are not affected by sentence length. Tab. 4: Correlation coefficients between sentence length and ratios of parts of speech PoS Noun

SUW

LUW

0,948 *

0,004

Pronoun

-0,752 *

-0,549 *

Verb

-0,142

0,036

i-adjective

-0,777 *

-0,760 *

na-adjective

-0,369 *

0,766 *

Adverb

-0,549 *

-0,262 *

0,092

0,770 *

Adnominal Conjunction Interjection Particle Auxiliary verb

0,395 * -0,235 0,040

0,229 -0,249 0,410 *

-0,712 *

-0,252

Prefix

0,671 *

-0,174

Suffix

0,904 *

-0,034

* = p-value < 0,01

Fig. 5 and 6 indicate how the ratio of all the parts of speech changes as the sentence length increases using both SUW and LUW, with the entire BCCWJ as the target of analysis. In order to make them compact, the figures only include significant parts of speech and sentence lengths of 50 or less, but the trends hardly

264 

 Makoto Yamazaki

change for higher values. Although values are unstable until sentence length is approximately 10, certain interesting facts come to light when observations are made for values greater than 10. One such fact is that the ratio of nouns overtakes the ratio of particles in the case of SUW. The value at which the ratio of nouns overtakes that of particles is when sentence length reaches 26. This change is not observed in the case of LUW. Focusing on this phenomenon, an investigation of the relationship between the ratios of nouns and particles in every register of the BCCWJ according to sentence length was conducted. Relationships between these two can be logically classified into the following five categories. Type A: Parallel A1: N is higher than P. A2: P is higher than N. A3: N is equal to P

(4)

Type B: Crossing B1: N overtakes P B2: P overtakes N

(5)

Fig. 5: Sentence length and parts of speech (SUW)

The influence of Word Unit and Sentence Length on the Ratio of Parts of Speech […] 

 265

Fig. 6: Sentence length and parts of speech (LUW)

Tab. 5 shows what type each register belongs to, and Tab. 6 organizes these types by register. In Tab.  5, there are four blank spaces. These are types A3 and B2 for SUW and types B1 and B2 for LUW. Further investigation will be necessary to determine whether these blank spaces are a product of coincidence or if they are something that inevitably arises due to the characteristics of word units. In Tab. 6, it can be observed that there are many cases where types differ between SUW and LUW. The changes in types can be summarized in the following way when looking primarily at the SUW. Type A1 is seen in 7 registers when using SUW, but six of these change into different types when using LUW. Conversely, type A2 is seen in 4 registers when using SUW, and they all remain type A2 when using LUW. Type B1 is seen in 2 registers when using SUW, but both change into different types when using LUW. By combining the types when SUW is used and the types when LUW is used, the 13 registers can be classified into 5 categories. Moreover, it has been confirmed that the relationship between the ratio of verbs and the ratio of auxiliary verbs also differs depending on the register. Thus, it is likely possible to make classifications into even smaller categories by combining these. Tab. 5: Types of texts according to N and P ratios Type

SUW

LUW

A1 (N>P)

Laws, PR papers, textbooks, white papers, blogs, magazines, newspapers

Laws

266 

 Makoto Yamazaki

Type

SUW

LUW

A2 (P>N)

Best sellers, Q&A Bulletin boards, Proceedings, Verse

Library books, Best sellers, Q&A bulletin boards, Proceedings, Textbooks, Verse, Blogs, Published books, Magazines

A3 (N=P) B1 (N x P)

PR papers, White papers, Newspapers Library books, Published books

B2 (P x N)

Tab. 6: Types of N and P ratios Registers

SUW

LUW

Library books

B1

A2

Best sellers

A2

A2

Q&A bulletin boards

A2

A2

Laws

A1

A1

Proceedings

A2

A2

PR papers

A1

A3

Textbooks

A1

A2

Verse

A2

A2

White papers

A1

A3

Blogs

A1

A2

Published books

B1

A2

Magazines

A1

A2

Newspapers

A1

A3

5 Conclusion and Future Challenges The following conclusions were reached about the four research questions that were posed in this study. a. Kabashima’s law does not apply particularly well. It is also necessary (6) to revise the parameters. b. The ratio of parts of speech changes depending on the word units. A  change was observed in nouns, suffixes, particles, and auxiliary verbs.

The influence of Word Unit and Sentence Length on the Ratio of Parts of Speech […] 

 267

c. The ratio of parts of speech changes depending on the sentence length. In particular, the ratio of nouns and that of particles undergo significant changes. d. The relationship between the ratio of nouns and that of functional words (particularly particles) changes depending on the register. A few of the issues that need to be investigated in the future are as follows: making observations in cases where the parts of speech are grouped in a manner similar to Kabashima’s, attempting the classification of registers using parts of speech other than nouns and particles, and exploring the possibility of a revised edition of Kabashima’s law that takes sentence length into account.

Acknowledgments This paper is one outcome of the collaborative research project “Foundations of Corpus Japanese Linguistics” conducted from 2009 to 2015 at the National Institute for Japanese Language and Linguistics. Texts included in the registers of Library Books and Published Books within the BCCWJ were compiled by MEXT KAKENHI Grant Number 18061007.

References Kabashima, T. (1954). On the Ratio of Parts of Speech in Present-day Japanese and the Cause of its Fluctuation. Studies in the Japanese Language (Kokugogaku), 18, 15–20. Kabashima, T. (1955). Regulations of Classified Parts of Speech. Kokugo Kokubun, 24(6), 55–57. Maekawa, K. et al. (2014). Balanced Corpus of Contemporary Written Japanese. Language Resources and Evaluation, 48(2), 345–371. Mizutani, S. (1983). Vocaburary. Tokyo: Asakura Publishing Co. Mathematical Linguistic Society of Japan. (2009). Encyclopedia of Japanese Quantitative Linguistics [Keiryo Kokugogaku Jiten]. Tokyo: Asakura Publishing Co. Ogura, H. et al. (2010). Regulations for the Morphological Information of the “Balanced Corpus of Contemporary Written Japanese” (4th ed., Book 1 & 2) [Gendai Nihongo Kaki kotoba Koopasu Keitairon Jouhou Kiteishu (dai 4 han (jou & ge))]. National Institute for Japanese Language and Linguistics. Ohno, S. (1956). Studies on the Basic Vocabulary of Japanese: In the Japanese Classical Literature. Studies in the Japanese Language (Kokugogaku), 24, 24–46.

Index of Names Adamczyk, E. 189, 193 Adams, J. Q. 158 Altmann, G. 12–14, 25, 27–28, 30, 31, 33, 36, 40, 63–64, 72–73, 87, 89, 96, 101, 125–126, 133–134, 137–138, 151, 181, 186, 190, 193–195, 203, 207–209, 213–214, 227, 231, 233, 238–239 Andreev, S. 1, 12 Andres, J. 14, 20, 25, 28, 30, 37, 40 Argamon, S. 12, 87, 153, 163, 164 Ash, S. 167, 178 Baayen, R. H. 242, 253 Baker, P. 183, 186, 193, 196 Bakhtin, M. 1, 12 Ban, J. Q. 238–239 Benešová, M. 13, 25, 27, 30, 40, 72 Bernstein, E. 248, 253 Best, K. H. 186, 193, 214, 259, 266 Boberg, Ch. 167, 178 Bohn, H. 231, 239 Bolasco, S. 207, 214 Brainerd, B. 51, 61 Brugman, H. 172, 178 Brunner, K. 186, 194, 196 Brynjólfsson, E. 193 Buk, S. 72, 127, 134, 138 Bunke, H. 229, 239 Bunshō no Kakikata 15, 25 Burger, J. D. 75–77, 86 Burrows, J. F. 1, 12, 154–155, 163–164 Bush, G. W. 162 Campbell, A. 178, 186–187, 194, 196 Can, F. 1, 12, 30 Carayannis, G. 239 Carpenter, R. H. 159, 163 Clinton, B. 162 Cortelazzo, M. A. 214 Craig, H. 153, 163 Cramer, I. 30, 40 Čebanov, S. G. 186, 194 Čech, R. 12, 72, 151 Dalby, D. 125, 138 Dassow, J. 221, 227 Davydov, A. 125, 138 Dębowski, Ł. 41–43, 48

de Saussure 181, 194 Diané, M. 138 Dickinson, B. 77, 87 Dubrocard, M. 242, 253 Du, D. Y. 230, 239 Dumais, S. T. 163 Durán, P. 243, 253 Eder, M. 103, 113 Ehret, K. 94, 101 Erat, E. 193 Evanini, K. 179 Favreau, J. 159 Ferrer i Cancho, R. 127, 138 Figueredo, G. 104, 113 Fišerová, E. 25, 40 Ford, G. 162 Fourier, J. 205 Frazier, L. 93, 101, 221, 227 Fruehwald, J. 165, 168, 171, 172, 173, 174, 175, 178, 179 Fucks, W. 186, 194 Fujie-Winter, K. 25 Furnas, G. 160, 163 Fu, Y. H. 230, 239 Galle, M. 207, 214 Garfield, J. A. 158 Giacofci, M. 207, 214 Gillis, S. 243–245, 252–253 Gomez, L. M. 163 Gordon, E. 175–176, 178 Gries, S. Th. 103, 113 Grzybek, P. 51, 61, 72–73, 87, 138, 194, 227 Guan, Y. 233, 239 Guillot, G. 58, 61 Guiraud, H. 48, 242, 253 Gunkel, L. 183, 194 Guo, J. 239 Habein, Y. 25 Hadamitzky, W. 25 Hagiwara, H. 151 Halliday, M. A. K. 104–105, 107–108, 113 Ha, L. Q. 232, 239 Hamilton, A. 159 Hanna, P. J. 239 Harrison, W. H. 158

270 

 Index of Names

Hatzigeorgiu, N. 232, 239 Hawkins, J. A. 93, 101, 221, 227 Hay, J. 176, 178–179 Hengeveld, K. 215–227 Herdan, G. 42, 48, 242, 253 Heydel, M. 103, 113 Hilberg, W. 41–43, 47–48 Hoffman, D. R. 158, 164 Hoover, D. L. 1, 12, 155, 164 Hota, S. R. 1, 12 Howard, A. D. 158, 164 Hřebíček, L. 13–14, 25, 30, 39–40, 193, 214 Huang, W. L. 86, 231, 238–240 Hu, W. 77, 87 Chen, Q. H. 239, 240 Chen, X. 229, 233 Chipere, N. 243, 253 Choi, D. 12 Choi, S. W. 232, 239 Chvosteková, M. 25, 40 Isahara, H. 151 Jàanɛ̀, B. M. 130, 138 Jakobson, R. 183, 194 Jarvis, S. 242, 253 Jayaram, B. D. 232, 239 Ji, M. 113 Johnson, Ch. H. 5 Johnson, W. 242, 253 Juola, P. 1, 94, 101, 153, 164 Kabashima, T. 255–258, 260–262, 266, 267 Kamada, T. 19, 25 Kantε, S. 125 Kennedy, J. F. 159, 161, 163 Kenney, M. 178 Ke, S.-W. 112–113 Kinney, A. F. 153, 163 Klecka, W. R. 6, 12 Klein, D. 164 Kobayashi, S. 139, 151 Köhler, R. 12, 30, 38, 40, 48, 63–65, 67, 71–73, 87, 89–91, 93–94, 96–97, 100–101, 138, 150–151, 186, 193–195, 207, 214–216, 227, 229–230, 239 Koike, S. 17, 25 Koizumi, M. 139, 150–151 Koppel, M. 12, 153, 163–164 Koso, A. 139, 151

Krott, A. 186, 194 Kubáček, L. 25 Labbé, D. 153–154, 156, 159–164 Labov, W. 76, 87, 167, 170, 172–174, 178–179 Lambert-Lacroix, S. 214 Landauer, T. K. 163 Lass, R. 182–183, 194 Lay, T. C. 1, 12 Liu, Y. 76, 87, 89, 101, 231, 239, 240 Lotman, Y. M. 1–2, 12 Louchard, G. 43, 49 Lukin, A. 104, 108, 113 Maclagan, M. 175–176, 178 Mačutek J. 12, 63–65, 69–73, 151 Madison, J. 159 Maekawa, K. 259, 267 Malvern, D. 242–244, 253 Mańczak, W. 193–194 Mandelbrot, B. 41, 49, 138, 229, 232–233, 239 Manning C. D. 156, 164 Mansfield, K. 107–108, 113 Marot, G. 214 Martynenko, G. Y. 1, 12 Matthiessen, C. M. I. M. 104–105, 108, 113 McCarthy, P. M. 242, 253 McMenamin, G. R. 1, 12 Merriam, T. 124, 154, 164 Miestamo, M. 101, 226–227 Migliorini, B. 203, 214 Mikros, G. K. 6, 12, 64, 69, 70, 72–73, 75, 79, 82, 87, 239 Milička, J. 64, 72–73, 137–138 Miller, B. 41, 49, 77, 87, 216, 227 Mitchell T. M. 155, 157, 164 Miyaoka, Y. 151 Mizutani, S. 255, 267 Monbushō Kyōkashokyoku Chōsaka Kokugo Chōsashitsu 17, 25 Montemurro, M. A. 127–128, 138 Motalová, T. 14, 18, 25 Müller, G. 183, 194 Muraoka, S. 140, 151 Naumann, S. 1, 12, 61, 63–65, 67, 72–73, 89, 91, 93, 96, 100–101, 227 Nitsū, N. 16, 25 Noda, H. 140, 151 Oakes, M. P. 103, 113

Index of Names  Obama, B. 158–159 Ogino, T. 140–142, 151 Ogura, H. 259, 267 Ohno, S. 255–256, 267 Packard, J. 229, 239 Pagano, A. 104, 108, 113 Pan, D. F. 230, 239 Patton, J. M. 1, 12 Pǎun, G. 221, 227 Pennebaker, J. W. 1, 12, 77, 87, 163 Picard, F. 214 Piotrowski, R. G. 1, 12, 72–73, 138, 193–194, 203, 209, 213–214, 239 Popescu, J. I. 12, 207, 214 Prün, C. 182–184, 187, 194 Pushkin, A. 182 Quirk, R. 183, 186–187, 194, 196 Rao, D. 75–76, 81, 87–88 Reagan, R. 162 Ribāsu, J. 25 Richards, B. 242–244, 253 Rijkhoff, J. 215–216, 227 Rosenfelder, I. 172, 175, 178–179 Rothe, U. 30, 181, 184, 194 Rousset, F. 58, 61 Rovenchak, A. 65, 70, 72–73, 125–127, 133–134, 138 Rudman, J. 1, 12 Russel, A. 172, 178 Rybicki, J. 103, 113 Sadock, J. M. 218, 227 Saeki, T. 139, 151 Sanada, H. 63, 73, 139, 142, 151 Sanders, R. 56, 61 Satō, F. 25 Savoy, J. 153–154, 164 Sawa, T. 150–151 Sebastiani, F. 157, 164 Seltzer, R. V. 159, 163 Serant, D. 243–244, 253 Seyfarth, S. 177–179 Sherrod, P. H. 209 Schler, J. 163–164 Schmid, H. 248, 253 Schreiber, H. 152 Schütze, H. 156, 164 Siewierska, A. 215, 227

 271

Singer, Y. 164 Smith, F. J. 239 Smith, R. D. 232, 240 Sneller, B. 165, 177–179 Solé, R. V. 127, 138 Sommerfeldt, K. E. 140, 152 Sorensen, T. 159 Soshi, T. 151 Spáčilová, L. 14, 18, 25 Spahn, M. 18, 25 Stallings, W. 229, 240 Steiner, E. 106, 113 Steiner, P. C. 182–184, 187, 192, 194 Stewart, D. W. 239 Stewart, L. 1, 12 Strycharczuk, P. 168, 179 Su, P. C. 230, 240 Szmrescanyi, B. 94 Szpankowski, W. 43, 49 Švarný, O. 17, 25 Tamaoka, K. 151 Tanaka, H. 152 Taylor, Z. 158 Tesnière, L. 140, 152 Tokunaga, T. 139, 150, 152 Toutanova, K. 154, 158, 164 Trevisani, M. 203–204, 207–208, 213–214 Triola, M. F. 225, 227 Tuzzi, A. 203–204, 207–208, 213–214 Tweedie, F. J. 242, 253 van Erven, T. 43, 49 van Lier, E. 215–220, 222–227 Vrbková, J. 25 Vulanović, R. 61, 215–217, 220–224, 226–227 Vydrine, V. 125, 130, 132, 138 Wang, D. P. 238, 240 Wang, G. A. 229, 240 Wang, L. 229–230, 240 Wang, Y. 232, 240 Wannisch, T. 183 Ward, J. H., Jr. 108–111 Warren, P. 176, 179 Washington, G. 87, 88, 158, 159 Welte, W. 183, 194 Wimmer, G. 65, 73, 134, 138, 185–186, 195, 207, 214

272 

 Index of Names

Wrenn C. L. 183, 186–187, 194, 196 Wulff, S. 103, 113 Xanthos, A. 243–245, 248, 252–253 Yallop, C. 106, 113 Yoneyama, T. 25 Yoshimoto, B. 13, 25 Zhang, Y. J. 75, 87, 89, 101, 232, 238–240

Zhao, Y. 154–156, 164 Zifonun, G. 183, 194 Zigdon, I. 12 Zipf, G. K. 41–42, 48–51, 61, 90, 92, 125–128, 138, 193, 195, 229–233, 235, 238–240 Zobel, J. 154–156, 164

Subject Index Abbreviation 75–76, 78, 86, 129 Absolute frequency 127, 155, 207 Absolute grammar efficiency (AGE) 221–224 Acoustic space 166 Aggregation round 248–251 Algorithm 28, 30, 75, 77–79, 82–84, 86, 94, 132, 210–211, 242, 245–247, 250, 252 Allophone 165, 171, 174, 178 Allophonically different 167, 172 Allophonic split 165, 172, 177 Altmann-Fitter 96, 190, 193 Analysis attempt (AA) 221–223 Anaphora 99 Aphasia 39 Aspectual marker 229 Associative semantic potential (ASP) 115–116, 118 Author 1, 9, 13–14, 16, 30, 37, 56, 64–65, 68, 71, 75–79, 81–83, 112, 123, 128, 153–163, 203, 208, 242 Author Multilevel N-gram Profile (AMNP) 75, 79, 82–83, 85–86 Authorship attribution 153 Average frequency 233–234, 237–238, 244 Balanced Corpus of Contemporary Written Japanese 256, 267 Barron’s inequality 47 Bayes model 155, 157 Belletristic 64, 66, 68 Bimodal distribution 166 Blogs 64, 66, 68, 259, 265 Brand acceptance 75 Business news 96 Case syncretism 183 Clade structure 109 Classical Japanese 256 Clause 105–106, 108, 141–143, 227 Closed-class attribution problem 153 Cluster analysis 103, 107, 112–213, 214 Clustering 78, 103–104, 107, 109–112, 203, 207, 209–210, 214 Coarticulation 168–170 Code 41, 43–45, 48–49 Complement 139–150 Complete character 233

Complexity 28, 39, 89–100, 150, 182, 186, 189–191, 216, 226, 229–231, 233, 235, 238 Component 1, 11, 13–18, 23, 92, 105, 229–232, 234–235, 238–240, 243 ––experiential component 105 ––logical component 105 Computer science articles 96 consequent segmenting of the text 29 Conservation principle (CP) 38–39 ––of grammar 39 ––of information complexity 39 ––of lexical stability 39 ––of semantic saturation 39 ––of syntactic structure 39 Consonant 54, 65, 130–131, 133, 137, 168 Constituent 27, 31, 33–35, 89–92, 94, 96–97, 99, 258 Constructs 31, 33–35, 86, 94 ––language constructs 31 Content-bearing word 154, 160–161 Coordinate 6 Core vocabulary 126–128, 137 Corpus 17, 77–79, 81–82, 85, 89–91, 93–97, 99–100, 103–104, 106–108, 125–129, 137, 151, 154, 158, 160, 162–163, 165, 172, 175–176, 181, 203–206, 209, 211, 214, 231–232, 248, 255–259 ––linguistics corpus 17 ––translational corpus 103, 113 Correlation coefficient 68, 263 Criterion of ‘islands’ 18 Cross-linguistic comparison 104 Decreasing trend 205, 211 Deep case 139–140, 142–144, 146, 149 Degree of freedom 148 Delta rule 154 Dendrogram 107–109, 111–112 Dependent clause 106, 141–143, 227 Derivation of lows 115 Descriptive statistics 65, 79–80, 103 Diachronic 165, 168–169, 171, 205 Dictionary 19, 25, 38, 121–123, 129–137, 158, 231–232, 258 Dictionary of Pushkin’s Language 121, 124

274 

 Subject Index

Diphthong 31, 65, 172, 174–176 Direct 6, 15, 19, 108, 139, 156 ––speech 6, 15, 19, 108 Direct object – indirect object (DO-IO) 139, 150 Discourse 51, 78, 80, 106, 153, 203, 209, 211, 213 Discriminant model 1 Dissipative stochastic mathematical models 115 Distance-based model 154 Distribution 34–35, 42, 47, 64, 96, 98–100, 115–116, 118, 120–124, 156, 181–182, 184, 186–187, 189, 191, 193 ––age-polysemic 115–116, 118–119, 122–123 ––polysemic 115, 118, 124 ––switch 41, 43–44, 46–47 ––universal 41–42, 47 Diversification effects 181–182, 184–185, 191–192 Dramas 64, 66, 68 Dynamic system 30, 230 Dynamic system 61 Effect of duration 178 ELAN 172, 178 Element 7, 30, 51, 86, 105, 185, 209 Ellipse 16, 76, 78, 139 Emoticon 80–81, 86 Emotion 4, 78 Emphatic line 6 Empirical investigation 96, 181 Enjambment 5 Entropy 48–49, 138, 164 Ethnicity 75 Euclidean distance 108, 210 Evolution 115, 203–207 Facebook 75 Factor 1, 51, 64, 68, 71, 76, 91, 94, 139, 170 ––biological 76 ––socio-psychological 76 FAVE 172, 175, 179 Fiction 96 Final position of feature 58 Fixed effect 207–208 Fixed word order 182 Flexible word class 217–218 Forensics 76, 81

Formal 1–2, 5–6, 9–11, 101, 184, 216, 220, 226–227, 239 ––grammar 220 ––model 226 Formula 28, 36–37, 43, 118–119, 156, 208, 217, 223 ––complete 28 ––truncated 28, 37 frequency 5, 15, 20, 23, 28, 33–37, 51, 52, 63–64, 76, 78, 82, 86, 103–107, 121, 125, 127, 129, 132, 138–139, 143, 154–158, 160–162, 181–189, 191–193, 203–205, 207, 209, 213, 229–239, 244, 246–247, 252 ––analysis 130, 154 Gaussian proces 207 Gender 51, 75–86, 182, 196 ––identification 75–76, 87 ––keywords 75 Genre 1, 64–65, 67–68, 71, 75, 256, 259 ––language 75 Geography 51–52, 56–58 Grammar efficiency 215–217, 220–223, 226–227 Grammatical 17, 99, 105, 107–109, 112, 206–207, 215, 221, 226, 227 ––category 206–207 ––functions 17, 99, 105, 107–109, 112 ––structure 215, 221, 226 Grapheme 30, 32–33, 35, 65 Graphological unit 111, 112 Greek 69–70, 75, 79, 85–87, 232, 239 ––Modern 69–70, 75, 79, 85–87 Greek Twitter Corpus (GTC) 76, 79–80 Hash tag 77, 80–81 Herdan’s laws 42 Hilberg’s conjecture 41–43, 47–48 Hiragana character 16–17 History of words 203, 209, 214 Hungarian Szeged corpus 91, 96 Hyper-Pascal distribution 89, 91–93, 96–100 Hyperpoisson distribution 181, 186, 190–191 Hypotactical relation 108 Character 2, 4, 10, 14–19, 21–24, 32, 47, 75–77, 81–82, 87, 229–231, 233, 235, 238–240 Chi-square test 191

Subject Index 

Chronological corpora 203 Chronotope 1, 4, 11 Chunk 75, 86 Ideational metafunction 105 Ideophones 132 Increasing trend 205, 213 Indirect object – direct object (IO-DO) 139, 150 Inflectional 181–187, 191–193, 196, 199, 241, 243–245, 247–249, 252 ––diversity 241, 243, 253 ––paradigms 181–183, 185–187, 191, 193, 196, 199 Information 15, 17, 28, 37, 39, 41–43, 45–48, 51, 64, 67, 75–78, 82, 85, 89, 94, 100, 105–106, 139, 150, 153–154, 160, 162–163, 167, 182, 186, 204, 206, 216, 220–221, 223, 226, 231–232, 252 ––mutual 41–43, 45–47 ––textual 82 Intercomma 14, 16, 20, 21, 22 Internet 75, 108, 233 Interpersonal metafunction 105 Intransitive sentence 215, 226 Inversion 6, 8 Island segmentation method 18 Isogloss 58–59 Kabashima’s law 255–258, 260–261, 266–267 Kana script 14, 24 Kanji character 14, 16, 24 Kinship term 78 Köhler’s complexity 89–90, 94 Kolmogorov complexity 94 Kullback-Leibler Divergence 154–156 Labbé’s Intertextual Measure 156 Language 13–24, 27, 31, 38–39, 41–44, 47–48, 51–52, 56, 58, 60, 63–65, 69, 71, 75–76, 82, 89, 92–93, 97, 100, 103–106, 112, 115, 117–122, 125, 133, 137, 139, 150–151, 165, 178, 181–186, 192, 203–205, 208–209, 215, 220, 225–227, 229–230, 232, 248, 255, 258 ––Balkan Romance 53 ––complexity 93, 226 ––Latin 53, 76, 232, 239 ––level 13–15, 19, 21, 23–24, 137

 275

––Modern Romanian 53 ––natural 41–44, 47–48, 115, 215, 229, 248 ––Romanian 51–54, 61 ––sign 115, 124 ––system 27, 38–39, 89, 103–104, 106, 182 ––unit 13, 15–16, 18, 20, 24, 64, 71, 230 ––variation 52, 58 Langue 138, 181 Laplace smoothing 156 Large-scale 165–166, 171, 175, 255–256, 258 ––corpus 255–256, 258 Leaving-one-out methodology 159 Legal documents 96 Lemma 43, 204, 210 Lemmatization 206–207 Lempel-Ziv code 43–45 Length 5, 14, 23, 31, 33–35, 41–47, 59, 63–65, 68–72, 76–77, 89–90, 93–94, 103, 118, 120, 125, 129, 133–137, 139, 157–158, 186, 193, 207, 221, 230, 251, 255–258, 262–267 ––constituent 35 ––construct 35 ––sentence 63, 221, 255–258, 262–264, 267 ––text 42–43, 68, 71, 207 ––word 63–65, 69, 71–72, 133, 139, 186, 230 Lexematic 241, 243–247, 249–250, 252 ––diversity 241, 244–247, 252 ––variety 243, 245, 247 Lexeme 184, 192, 241, 243, 245–250, 252 Lexicogrammar 105–106, 112 Lidstone’s law 156, 158 Linguistic ––formalisms 89 ––quantitative 12, 25, 38, 40, 48, 72–73, 87, 89, 101, 124, 138, 164, 193–195, 214, 227, 239–240, 255, 267 ––sign 115, 117–118, 120 ––synergetic 27, 38, 40, 61, 89–90, 230, 239 Linguistic Inquiry and Word Count text analysis program 77 Listener 39, 105 Locus 1, 4, 9, 10, 11 Loess-smoothed data 171 Long unit word 256, 258 Mahalanobis distance 8–10 Machine learning approach 157

276 

 Subject Index

Main clause 141–143 Manding language 125, 137 Maninka ––consonants 130 ––vowels 129 Markov chains 43, 47 Maximizing Compactness 90 Meaning 17, 30–31, 57, 81, 104–106, 115–118, 123, 140–141, 150, 154, 167, 177, 184–185, 192, 230 Mean size of paradigm index (MSP) 241, 243–250, 252 Median 225 Menzerath-Altmann Law (MAL) 13–17, 19–20, 24–25, 27–31, 33, 36–40, 72–73, 125–126, 133 Meta-functional Profile 105 Methodology 13, 28–29, 37, 75, 81, 86, 104, 107, 159, 241 Minimal pair 167 Minimizing Complexity 90 Modalization 105 Model identification 115 Model unigram 41, 44, 47–48 Modern Japanese 256 Modifier 217, 222 Modulation 105 Mora 126, 133–137 Morpheme 30–33, 181, 184, 258 ––zero 32 Morphemic combination 229 Morphology 92–93, 181–182, 184, 229 Motif 63–65, 68–73 Multidimensional scaling 51–52, 57–58 Multivariate 6, 103, 107 ––analysis 107 ––statistics 103 Musical compositions 63 Negative 186 ––binomial distribution 186 ––difference 225 NEGRA corpus for German 90 Neurobiology 76 Newspaper articles 96, 127, 151 N-gram 75, 79, 82, 87 Nko 125–130, 132, 137–138 ––periodicals corpus 126, 128

––writing system 125 NLREG 65, 209 Nonparametric test 225 Non-zero difference 224–225 Normalized MSP (NMSP) 244–246, 248–252 Note 63, 170 ––duration 63 Number of components (NC) 229–231, 233–238 Number of strokes (NS) 229–238 Object 31, 59–60, 76, 139, 141–143, 150–151 Ohno’s law 256 Olomouc Speech Corpus 31 One-tailed sign test 225 Open-set problem 153 Orange Textable 248 Origins of New Zealand English Corpus (ONZE Corpus) 165, 175–176, 179 Palatalization 31, 52–53 Paragraph 14–15, 19, 31–33, 35–37, 103–104, 108–109, 111–112 Parametric test 225 Paratactical relation 108 Parole 181, 193 Parsing ratio 222 Participant 105 Part-Of-Speech ––features 6 ––system (PoS) 215–225, 262–263 ––tag (POS) 153–154, 160, 162 Pattern 51–52, 54–55, 58–60, 64, 75–76, 78, 86, 98, 103–104, 111–112, 125–126, 128, 132–133, 137, 140, 174, 181–184, 199, 203, 205, 207–209, 213–214 ––behavioral 75 Peak-and-valley trajectory 205 Philadelphia Neighborhood Corpus (PNC) 165, 172–175, 179 Phoneme 30, 38, 51, 65, 125–126, 128–132, 137–138, 165–167, 171, 175–176 Phonemically different 167 Phonological rule 168, 170, 175 Phonology 138, 179 Piotrowski-Altmann law 203 Plot 2–3, 85, 240 Poetic 1–4, 8–11 ––space 1, 3, 10–11

Subject Index 

––time 1–2, 4, 8–9 Poisson distribution 97–98, 186 Polysemy 115–124, 230 Pooled variance 148 Post-hoc test 7, 85 Postposition 139–151 Power law 41, 44, 48, 232 Predicate phrase 217, 223 Primary loci 3, 4, 7 Principle 14, 18, 23, 38–39, 43, 92, 156, 159, 182, 230, 248, 252 ––conservation 38–39 ––economization 38 Principle of least effort 230 Process 13, 39, 41, 52, 55–56, 105, 116–117, 145, 174–176, 184, 194, 206–207, 209, 213, 248 Profiling 75, 83, 106, 153 Project Gutenberg eBook 248 Pronoun 78, 141, 154–155, 162, 182, 262–263 Propositional function 217–218, 220–224, 226 Prose 2, 12, 64, 66, 68 Proximity 103, 107, 112–113, 120 Punctuation marks 16–17, 20–24, 97, 98, 100, 127 Quantificational study 58 Quantitative 1–2, 28–32, 38–39, 51–53, 57, 85, 89, 137, 181–182, 231, 238, 241 ––analysis 32, 137 ––linguistics 38–39, 181–182, 231, 241 ––methods 52 Random 41–42, 47–48, 72, 75, 79, 82, 86, 116, 119–120, 207–208, 241–242, 244 ––effects 208 ––Forests 75, 79, 82, 86 ––sampling 241 ––text 41, 72 Rank 41, 95, 108, 125, 127–128, 232 Rank-frequency 41, 127–128 ––curve 127 ––dependence 127–128 ––distribution 41, 127 Rate of change 165–166, 168–178 Ratio of parts of speech 255–256, 258, 261–262, 266–267 Real 2, 9–11, 72, 120, 153, 162, 165, 252

 277

––space 11 ––time 11, 120 Referential phrase 220–221 Rejet enjambment 5 Relative 13, 122, 141, 154–155, 162, 207, 213, 217, 223, 225, 227, 245 ––frequency 154, 162, 207, 213 ––grammar efficiency (RGE) 223–225 ––pronouns 141 Rigid word class 217 Robust MSP (RMSP) 241, 245–252 Robustness 241, 244, 247, 252 Romanian Online Dialect Atlas (RODA) 52, 55–58, 61 Same-language comparison 103 Sample size 156, 225, 241–246, 249–250, 252 Scaling coefficient 223 Scientific papers 64, 66, 68 ––in physics 64, 66, 68 ––in the humanities 64, 66, 68 Secondary loci 5, 9 Second-order loci 3 Segmentation 13–14, 17–18, 24, 27, 29–33, 35–37 Self regulation 150 ––model 150 Semanticity 14 Semiotics 138 Sentence 6, 14–17, 19–20, 30, 63, 72–73, 93, 96–97, 99, 103–104, 108–109, 111–112, 139–146, 148, 150–151, 215, 217, 220–223, 226, 255–259, 262–267 ––length 221, 255–258, 262–265, 267 ––silent 17 Sermons 64, 66, 68 Short unit word 256, 258 Schwa 55–56 Sign 31, 44, 115–120, 122–215, 225, 231, 235 Silent sentence 17 Similarity 10, 38, 103–104, 110, 112, 175–176 Simply form 243 Smileys 75–76 Smoothing 122, 156, 158, 214, 242 Sociolinguistics 76, 86, 214 Sound 31, 166, 172, 244 ––speech 31

278 

 Subject Index

Source text 103–104, 107–108, 110–112 SOV language 150 Speakers 31, 39, 125, 167–170, 172, 175, 177 Spearman’s correlation test 251 Sport reports 64, 67–68 Square-shaped graphic field criterion 17 Statistical 20, 28, 30, 37–38, 41, 47, 51, 58, 61, 73, 76, 78, 82–85, 88, 101, 103–104, 113, 125–126, 129, 138, 148, 164, 182, 215, 225, 239 ––analysis 215 ––method 28, 103, 104 ––test 82 Stroke 14–19, 229–232, 234–235, 238 Structural complexity 93–94, 229–231, 233, 235, 238 Student essays 96 Stylometry 12, 103, 113 Subject 39, 58, 60, 89, 105, 116, 139–146, 148–149, 151, 175, 235, 248, 263 Subordinate sentences 6 Subphonemic vowel 165 Subsample 77, 241–242, 246–247, 249, 251–252 Support Vector Machine (AVM) 75–77, 79, 82–86 Surface case 139–144, 146, 149 Surparagraph 14–15 SUSANNE corpus for English 90 Syllable 30–33, 35, 63–65, 125–126, 128, 132–137, 257 Syllable co-occurrence 128 Synergetic-linguistic model 89, 230 Synergetic linguistics 27, 38, 89–90, 230 Synergy 12 Syntactic 5–6, 8, 39, 51, 89–95, 101, 151, 205, 217, 227, 258 ––complexity 89–94 ––feature 5 ––pause 5–6 ––pauses 5–6, 8 ––slots 217 Synthetic languages 181 System 13–14, 27, 29–32, 38–39, 53, 57, 89, 92–93, 103–106, 122, 125, 154, 162, 181, 182, 215–216, 218, 220–224, 230 ––of modality 105

––of mood 105 ––of polarity 105 ––of transitivity 105 Systemic functional linguistics (SFL) 104–106, 108, 113 Target text 103–104, 107–112 Taxis 105 Temporal pattern 203, 207–209, 213–214 Text ––analysis 18, 77, 103, 106, 112 ––corpus 125, 181 ––Corpus 239 ––in natural language 41–44, 47–48 ––Modern Greek 69–70 ––random 41, 72 ––Ukrainian 63–64 ––unigram 41–44, 47–48 Textual metafunction 105 The General Online Dialect Atlas (GODA) 56 The Mantel test 58 The Mantel test 57 Theme 4, 11, 106, 139–141, 146, 149, 255 Theoretical lexematic variety 247 The principle of minimizing 38 ––decoding effort (MinD) 38 ––memory (MinM) 38 ––production effort (MinP) 38 Token 63–65, 73, 94, 103, 128, 156–158, 161, 171, 173–175, 177, 186, 206, 239, 241–242, 244, 246, 248–250, 252, 255–256, 258 Tonal pattern 125–126, 128, 132, 137 Topical word 154, 162–163 Topos 1 Transcription 31, 35 ––phonetic 31 Transient equilibrium 39 Transitivity 105 Tree-banks 89 Treetagger 248 T-statistics 148 T-test 139, 148 TüBa-D/Z Treebank 99 Tweet 75–81, 85–87 Twitter 75–81, 86–88

Subject Index 

Type 2, 5, 18, 28, 63–65, 73, 77, 90, 96, 103, 105–106, 112, 127, 129, 132, 137, 139, 145, 165, 186, 188, 192, 206–207, 209, 215–216, 218, 224, 227, 231, 239, 242–243, 245, 247–249, 252–253, 255–259, 262, 264–266 Type-token ratio (TTR) 64–68, 71, 242, 245–246, 249, 250 Typologically Balanced Ukrainian Text Database 64, 72 Unambiguous sentences (US) 153, 158, 162, 221–223 Unification 90, 92, 184–185, 194 Unigram 41–44, 47–48, 77 Unit 13, 15–21, 23–24, 27–33, 35, 39, 63–64, 71–72, 89–92, 94, 101, 104–105, 111–112, 118–123, 125, 127, 132–135, 186, 193, 230–231, 238–239, 241, 243, 255–258, 261, 265, 266 Unordered integer partition 181, 184–185, 187 Urn model 181, 185, 186 Valency 89, 101, 139–142, 151–152 Valency database 140–142 Valeur 181 Variety 19, 51, 241–247, 249, 255 Verb valency 89, 101, 151 Verification 13, 28, 153, 260 VOCD algorithm 242 Vowel 31, 52–53, 55–56, 65, 129–134, 165–168, 170–179 Ward method 108 wavelet-based model 207

 279

wavelet coefficients 208 Webster’s Collegiate Dictionary 121 Wilcoxon signed-ranks test 225 word 10, 14, 16–17, 24–25, 30–33, 35, 38, 41, 51, 53–55, 58, 63–65, 68–72, 75, 77–80, 82–86, 90, 92–94, 96, 99, 103–104, 106, 108–109, 111–112, 119–127, 129, 132–137, 139–140, 153–163, 167–168, 172–173, 177, 181–183, 185–187, 199, 203–209, 211, 213, 217–224, 229–230, 232–233, 246, 255–262, 265–267 ––birth 119 ––classes 217–221, 223–224 ––distribution 41, 123 ––frequent 78–79, 82, 127, 153–155, 162 ––function 75, 155, 213 ––functional 82, 153–155, 162, 258, 267 ––graphical 32–33, 35 ––order 92–93, 139–140, 182, 217, 220–222 ––orders 217, 220 ––unit 16, 255–256, 258 ––units 255–258, 265–266 word-class symbol 221 wordform 243, 245, 247–249, 252 ––variety 243, 247, 249 word-syllable-mora level 133 word-token 158, 206 Wrenn C. L. 183, 186–187, 194, 196 Zipf-Mandelbrot law 229, 232–233 Zipf’s law 41, 126–229, 231–233, 235, 238 Z score 155 α-unit 258

Authors’ Addresses Benešová, Martina Department of General Linguistics Faculty of Arts, Palacky University, Olomouc e-mail: [email protected] Birjukov, Denis Department of Asian studies Faculty of Arts, Olomouc Debowski, Łukasz Institute of Computer Science Polish Academy of Sciences e-mail: [email protected] Embleton, Sheila York University, Toronto, Canada e-mail: [email protected] Figueredo, Giacomo P. Federal University of Ouro Preto, Brazil e-mail: [email protected] Faltýnek, Dan Department of General Linguistics Faculty of Arts, Palacky University, Olomouc e-mail: [email protected] Guex, Guillaume University of Lausanne Anthropole, CH–1015 Lausanne e-mail: [email protected] Hrubik-Vulanović, Tatjana Department of Mathematical Sciences Kent State University at Stark 6000 Frank Ave NW, North Canton, Ohio 44720, USA e-mail: [email protected].

282 

 Authors’ Addresses

Chen, Xinying School of Foreign Studies, Xi’an Jiaotong University & Text Technology Lab Department of Computer Science and Mathematics, Goethe University No 28 Xianning West Road, Xi’an Jiaotong University, Xi’an, 710049, Shaanxi, China e-mail: [email protected] Lukin, Annabelle Macquarie University, Australia e-mail: [email protected] Mačutek, Ján Department of Applied Mathematics and Statistics Comenius University Mlynská dolina, SK-84248 Bratislava, Slovakia e-mail: [email protected] Mikros, George K. Department of Italian Language and Literature School of Philosophy e-mail: [email protected] Naumann, Sven Universität Trier 54286 Trier, Germany e-mail: [email protected] Pagano, Adriana S. Faculdade de Letras Federal University of Minas Gerais Av. Antonio Carlos, 6627 – Pampulha, Belo Horizonte MG, 31270-901, Brazil e-mail: [email protected] Perifanos, Kostas Department of Linguistics National and Kapodistrian University of Athens, Greece e-mail: [email protected]

Authors’ Addresses 

Poddubnyy, Vasiliy National Research Tomsk State University Tomsk, Russia e-mail: [email protected] Polikarpov, Anatoly Lomonosov Moscow State University Moscow, Russia e-mail: [email protected] Rovenchak, Andrij Ivan Franko National University of Lviv, Ukraine e-mail: [email protected] Sanada, Haruko Faculty of Economics Rissho University (Tokyo) 4-2-16, Osaki, Shinagawaku, Tokyo 141-8602, Japan e-mail: [email protected] Savoy, Jacques Computer Science Dept. University of Neuchatel rue Emile Argand 11, 2000 Neuchatel (Switzerland) e-mail: [email protected] Sneller, Betsy University of Pennsylvania e-mail: [email protected] Steiner, Petra C. Institut für Informationswissenschaft und Sprachtechnologie Universität Hildesheim e-mail: [email protected] Tuzzi, Arjuna Dept. FISPPA, Sociology buildings University of Padua via M. Cesarotti 10/12 35123 Padova, Italy e-mail: [email protected]

 283

284 

 Authors’ Addresses

Köhler, Reinhard Universität Trier e-mail: [email protected] Uritescu, Dorin York University, Toronto, Canada Vulanović, Relja Department of Mathematical Sciences Kent State University at Stark 6000 Frank Ave NW, North Canton, Ohio 44720, USA e-mail: [email protected] Wang, Yanru School of Foreign Studies, Xi’an Jiaotong University School of Psychology and Cognitive Science, East China Normal University No 3663 Zhongshan North Road, East China Normal University, 200062, Shanghai, China e-mail: [email protected] Wheeler, Eric S. York University, Toronto, Canada e-mail: [email protected] Xanthos, Aris University of Lausanne Anthropole, CH–1015 Lausanne e-mail: [email protected] Yamazaki, Makoto National Institute for Japanese Language and Linguistics e-mail: [email protected] Zamečník, Lukáš Hadwiger Department of General Linguistics Faculty of Arts, Palacky University, Olomouc e-mail: [email protected]