Multivariate Humanities 3030691497, 9783030691493

This case study-based textbook in multivariate analysis for advanced students in the humanities emphasizes descriptive,

214 43 13MB

English Pages 456 [441] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Global table of contents
Contents
Part I The Actors
1 Introduction: Multivariate studies in the Humanities
1.1 Preliminaries
1.1.1 Audience
1.1.2 Before you start
1.1.3 Multivariate analysis
1.1.4 Case studies: Quantification and statistical analysis
1.2 The humanities—What are they?
1.3 Qualitative and quantitative research in the humanities
1.4 Multivariate data analysis
1.5 Data: Formats and types
1.5.1 Data formats
1.5.2 Data characteristics: Measurement levels
1.5.3 Characteristics of data types
1.5.4 From one data format to another
1.6 General structure of the case study chapters
1.7 Author references
1.8 Wikipedia
1.9 Web addresses
2 Data inspection: The data are in. Now what?
2.1 Background
2.1.1 A researcher's nightmare
2.1.2 Getting the data right
2.2 Data inspection: Overview
2.2.1 The normal distribution
2.2.2 Distributions: Individual numeric variables
2.2.3 Inspecting several univariate distributions
2.2.4 Bivariate inspection
2.3 Missing data
2.3.1 Unintentionally missing
2.3.2 Systematically missing
2.3.3 Handling missing data
2.4 Outliers
2.4.1 Characteristics of outliers
2.4.2 Types of outliers
2.4.3 Detection of outliers
2.4.4 Handling outliers
2.5 Testing assumptions of statistical techniques
2.5.1 Null hypothesis testing
2.5.2 Model testing
2.6 Content summary
3 Statistical framework
3.1 Overview
3.2 Data formats
3.2.1 Matrices: The basic data format
3.2.2 Contingency tables
3.2.3 Correlations, covariances, similarities
3.2.4 Three-way arrays: Several matrices
3.2.5 Meaning of numbers in a matrix
3.3 Chapter example
3.4 Designs, statistical models, and techniques
3.4.1 Data design
3.4.2 Model
3.5 From questions to statistical techniques
3.5.1 Dependence designs versus internal structure designs
3.5.2 Analysing variables, objects, or both
3.6 Dependence designs: General linear model—glm
3.6.1 The t test
3.6.2 Analysis of variance—anova
3.6.3 Multiple regression analysis—mra
3.6.4 Discriminant analysis
3.6.5 Logistic regression
3.6.6 Advanced analysis of variance models
3.6.7 Nonlinear multivariate analysis
3.7 Internal structure designs: General description
3.8 Internal structure designs: Variables
3.8.1 Principal component analysis—pca
3.8.2 Categorical principal component analysis—CatPCA
3.8.3 Factor analysis—fa
3.8.4 Structural equation modelling—sem
3.8.5 Loglinear models
3.9 Internal structure designs: Objects, individuals, cases, etc.
3.9.1 Similarities and dissimilarities
3.9.2 Multidimensional scaling—mds
3.9.3 Cluster analysis
3.10 Internal structure designs: Objects and variables
3.10.1 Correspondence analysis: Analysis of tables
3.10.2 Multiple correspondence analysis
3.10.3 Principal component analysis for binary variables
3.11 Internal structure designs: Three-way models
3.11.1 Three-mode principal component analysis—tmpca
3.12 Hypothesis testing versus descriptive analysis
3.13 Model selection
3.14 Model evaluation
3.15 Designing tables and graphs
3.15.1 How to improve a table
3.15.2 Example of table rearrangement: a binary dataset
3.15.3 Examples of table rearrangement: contingency tables
3.15.4 How to improve graphs
3.16 Software
3.17 Overview of statistics in the case studies
4 Statistical framework extended
4.1 Contents and Keywords
4.2 Introduction
4.3 Analysis of variance designs
4.4 Binning
4.5 Biplots
4.6 Centroids
4.7 Contingency tables
4.8 Convex hulls
4.9 Deviance plots
4.10 Discriminant analysis
4.11 Distances
4.12 Inner products and projection
4.13 Joint biplots
4.14 Means plot with error bars, line graph, interaction plot
4.15 Missing rows and columns
4.16 Multiple regression
4.17 Multivariate, multiple, multigroup, multiset, and multiway
4.18 Quantification, optimal scaling, and measurement levels
4.19 Robustness
4.20 Scaling coordinates
4.21 Singular value decomposition
4.22 Structural equation modelling—sem
4.23 Supplementary points and variables
4.24 Three-mode principal component analysis (tmpca)
4.25 X2 test (χ2 test)
Part II The Scenes
5 Similarity data: Bible translations
5.1 Background
5.2 Research questions: Similarity of translations
5.3 Data: English and German Bible translations
5.4 Analysis methods
5.4.1 Characteristics of multidimensional scaling and cluster analysis
5.4.2 Multidimensional scaling
5.4.3 Cluster analysis
5.5 Bible translations: Statistical analysis
5.5.1 Multidimensional scaling
5.5.2 Cluster analysis
5.6 Other approaches to analysing similarities
5.7 Content summary
6 Stylometry: Authorship of the Pauline Epistles
6.1 Background
6.2 Research questions: Authorship
6.3 Data: Word frequencies in Pauline Epistles
6.4 Analysis methods
6.4.1 Choice of analysis method
6.4.2 Using correspondence analysis
6.5 The Pauline Epistles: Statistical analysis
6.5.1 Inspecting Epistle profiles
6.5.2 Inertia and dimensional fit
6.5.3 Plotting the results
6.5.4 Plotting the Epistles profiles
6.5.5 Epistles and Word categories: Biplot
6.5.6 Methodological summary
6.6 Other approaches to authorship studies
6.7 Content summary
7 Economic history: Agricultural development on Java
7.1 Background
7.2 Research questions: Historical agricultural data
7.3 Data: Agriculture development on Java
7.4 Analysis methods
7.4.1 Choice of analysis method
7.4.2 catpca: Characteristics of the method
7.5 Agricultural development on Java: Statistical analysis
7.5.1 Categorical principal component analysis in a miniature example
7.5.2 Main analysis
7.5.3 Agricultural history of Java: Further methodological remarks
7.6 Other approaches to historical data:
7.7 Content summary
8 Seriation: Graves in the Münsingen-Rain burial site
8.1 Background
8.2 Research questions: A time line for graves
8.3 Data: Grave contents
8.4 Analysis methods
8.5 Münsingen-Rain graves: Statistical analysis
8.5.1 Fashion as an ordering principle
8.5.2 Seriation
8.5.3 Validation of seriation
8.5.4 Other techniques
8.6 Other approaches to seriation
8.7 Content summary
9 Complex response data: Evaluating Marian art
9.1 Background
9.2 Research questions: Appreciation of Marian art
9.3 Data: Appreciation of Marian art across styles and contents
9.4 Analysis method
9.5 Marian art: Statistical analysis
9.5.1 Basic data inspection
9.5.2 A miniature example
9.5.3 Evaluating differences in means
9.5.4 Examining consistency of relations between the response variables
9.5.5 Principal component analyses: All painting categories
9.5.6 Principal component analysis: Per painting category
9.5.7 Scale analysis: Cronbach's alpha
9.5.8 Structure of the questionnaire
9.6 Other approaches to complex response data
9.7 Content summary
10 Rating scales: Craquelure and pictorial stylometry
10.1 Background
10.2 Research questions: Linking craquelure, paintings, and judges
10.3 Data: Craquelure of European paintings
10.4 Analysis methods
10.5 Craquelure: Statistical analysis
10.5.1 Art-historical categories: Scale means
10.5.2 Scales, judges, and paintings: Three-mode component analysis
10.5.3 Separation of art-historical categories
10.6 Other approaches to pictorial stylometry
10.7 Content summary
11 Pictorial similarity: Rock art images across the world
11.1 Background
11.2 Research questions: Evaluating Rock Art
11.2.1 The Kimberley versus Algerian images
11.2.2 The Zimbabwean, Indian, and Algerian images
11.2.3 The Kimberley, Arnhem Land, and Pilbara images
11.2.4 General considerations
11.3 Data: Characteristics of Barry's rock art images
11.4 Analysis methods
11.4.1 Comparison of proportions
11.4.2 Principal component analyses for binary variables
11.5 Rock art: Statistical analysis
11.5.1 Comparing rock art from Algeria and from the Kimberley
11.5.2 Comparing rock art from Zimbabwe, India, and Algeria
11.5.3 Comparing rock art images from within Australia
11.5.4 Further analytical considerations
11.6 Other approaches to analysing rock art images
11.7 Content summary
12 Questionnaires: Public views on deaccessioning
12.1 Background
12.2 Research questions: Public views on deaccessioning
12.3 Data: Public views about deaccessioning
12.3.1 Questionnaire respondents
12.3.2 Questionnaire structure
12.3.3 Type of data design
12.4 Analysis methods
12.5 Public views on deaccessioning: Statistical analysis
12.5.1 Item distributions
12.5.2 Item means
12.5.3 Item correlations
12.5.4 Measurement models: Preliminaries
12.5.5 Measurement models: Confirmatory factor analysis
12.5.6 Measurement models: Deaccessioning data
12.5.7 Item loadings
12.5.8 Interpretation
12.6 Other approaches in deaccessioning studies
12.7 Content summary
13 Stylometry: The Royal Book of Oz - Baum or Thompson?
13.1 Background
13.2 Research questions: Competitive authorship
13.3 Data: Occurrence of function words
13.3.1 Preprocessing
13.3.2 Dataset
13.4 Analysis methods
13.4.1 Significance testing
13.4.2 Distributions and graphics
13.4.3 Principal component analysis and graphics
13.4.4 Cluster analysis
13.5 Wizard of Oz: Statistical analyses
13.5.1 Principal component analysis
13.5.2 Cluster analysis
13.6 Other approaches in authorship studies
13.7 Content summary
14 Linguistics: Accentual prose rhythm in mediæval Latin
14.1 Background
14.2 Research questions: Accentual prose rhythm in mediæval Latin
14.3 Data: Janson's data tables
14.4 Analysis methods
14.4.1 Contingency tables
14.4.2 Ordinal principal component analysis
14.5 Accentual prose rhythm: Statistical analysis
14.5.1 Internal structure of individual authors' cadences
14.5.2 Similarities in accentual prose rhythm
14.6 Content summary
15 Linguistics: Chronology of Plato's works
15.1 Background
15.2 Research questions: Plato's chronology
15.3 Data: Kaluscha's clausulae data
15.4 Analysis methods
15.5 Plato's chronology: Statistical analysis
15.5.1 Text similarities
15.5.2 Clausulae and texts
15.6 Other approaches to Plato's chronology
15.7 Content summary
16 Binary judgments: Reading preferences
16.1 Background
16.2 Research questions: Binary variables
16.3 Data: Reading preferences
16.4 Analysis methods
16.4.1 Loglinear modelling
16.4.2 Multiple correspondence analysis
16.4.3 Supplementary variables
16.5 Reading preferences: Statistical analysis
16.5.1 Co-occurrence of quality and popular reading
16.5.2 Complexity of the relations
16.5.3 Multiple correspondence analysis
16.5.4 Supplementary background variables
16.6 Other approaches to binary judgments
16.7 Content summary
17 Music appreciation: The Chopin Preludes
17.1 Background
17.2 Research questions: Appreciation and musical knowledge
17.3 Data: Semantic differential scales
17.3.1 Musical database: Semantic differential scales
17.3.2 Design and data collection
17.3.3 Data format: Students, Preludes, and Scales
17.4 Analysis methods
17.4.1 Two-way multivariate analysis of variance
17.4.2 Tucker's three-mode model
17.4.3 Joint biplots
17.5 The Chopin preludes: Statistical analysis
17.5.1 Individual differences
17.5.2 Three-mode principal component analysis—tmpca
17.5.3 Scale components
17.5.4 Prelude components
17.5.5 Joint biplot—Consensus
17.5.6 Preludes as characterised by the scales
17.5.7 Circle of fifths: Judgments and keys
17.5.8 Joint biplot—Individual differences
17.6 Other approaches to evaluating music appreciation
17.7 Content summary
18 Musical stylometry: Characterisation of music
18.1 Background
18.2 Research questions: Differences in musical style
18.3 Data: Melodic intervals and pitch
18.3.1 Musical database
18.3.2 Musical styles: Features
18.3.3 Design
18.4 Analysis methods
18.4.1 Binary logistic regression
18.4.2 Multinomial logistic regression
18.4.3 Discriminant analysis
18.5 Characterisation of music: Statistical analysis
18.5.1 Data description
18.5.2 Data inspection
18.5.3 Preliminaries for the analyses
18.5.4 Predicting Bach versus Haydn + Beethoven
18.5.5 Genre: Heterogeneity of the Bach pieces
18.5.6 Genre: Discriminating between Bach pieces
18.6 Other approaches to analysing musical styles:
18.7 Content summary
Part III The Finale
19 Final Musings
Appendix A Discipline-orientated statistics books
Appendix Statistical Glossary
Appendix References
Index
Recommend Papers

Multivariate Humanities
 3030691497, 9783030691493

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Quantitative Methods in the Humanities and Social Sciences

Pieter M. Kroonenberg

Multivariate Humanities

Quantitative Methods in the Humanities and Social Sciences Series Editors Thomas DeFanti, Calit2, University of California San Diego, La Jolla, CA, USA Anthony Grafton, Princeton University, Princeton, NJ, USA Thomas E. Levy, Calit2, University of California San Diego, La Jolla, CA, USA Lev Manovich, Graduate Center, Room 4319, The Graduate Center, CUNY, New York, NY, USA Alyn Rockwood, KAUST, Boulder, CO, USA

Quantitative Methods in the Humanities and Social Sciences is a book series designed to foster research-based conversation with all parts of the university campus – from buildings of ivy-covered stone to technologically savvy walls of glass. Scholarship from international researchers and the esteemed editorial board represents the far-reaching applications of computational analysis, statistical models, computer-based programs, and other quantitative methods. Methods are integrated in a dialogue that is sensitive to the broader context of humanistic study and social science research. Scholars, including among others historians, archaeologists, new media specialists, classicists and linguists, promote this interdisciplinary approach. These texts teach new methodological approaches for contemporary research. Each volume exposes readers to a particular research method. Researchers and students then benefit from exposure to subtleties of the larger project or corpus of work in which the quantitative methods come to fruition. Editorial Board: Thomas DeFanti, University of California, San Diego & University of Illinois at Chicago Anthony Grafton, Princeton University Thomas E. Levy, University of California, San Diego Lev Manovich, The Graduate Center, CUNY Alyn Rockwood, King Abdullah University of Science and Technology Publishing Editor for the series at Springer: Laura Briskman, [email protected]

More information about this series at http://www.springer.com/series/11748

Pieter M. Kroonenberg

Multivariate Humanities

123

Pieter M. Kroonenberg Leiden University Leiden, Zuid-Holland The Netherlands

ISSN 2199-0956 ISSN 2199-0964 (electronic) Quantitative Methods in the Humanities and Social Sciences ISBN 978-3-030-69149-3 ISBN 978-3-030-69150-9 (eBook) https://doi.org/10.1007/978-3-030-69150-9 © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To Ineke

A biplot breakfast

Preface

Why this book? Rather than stating this myself, I leave it to voices in the humanities to explain the need for a book such as the present one. Just 15% of students in England study mathematics beyond GCSE level. However, many of this non-mathematics studying majority find that they need mathematical skills for the advanced study of other subjects, including humanities and social science subjects at school or university or in their job.[...] Without mathematical and, in particular, statistical skills whole areas of the social sciences and humanities are inaccessible to research students and future academics. (Canning, 2014, p. vii) My position is that multivariate analysis is to be thought of as nothing more than the analysis of tables of data. If it worth putting together a table of data it is worth exploring it by multivariate methods (Wright, 1989, p. 1; quoted in Baxter, 1994).

On the other hand Thomas (1978, p. 231) produces a cautionary note in his paper The awful truth about statistics in archaeology: There is a rapidly growing clutch of statistically-sophisticated archeologists who seemed perched, rotating factor analyses in hand, prepared to pounce on the first clump of unmanipulated data that should have the misfortune of stumbling into their path.

In writing this book it has been my explicit aim to present a link between the data collected to tackle specific research questions in the humanities, and appropriate multivariate statistical techniques to answer such questions. The central part of the book consists of case studies from different disciplines in the humanities. They are meant to encourage researchers to look differently at their data, and to consider various possibilities for analysis. At the same time it would be wise to take heed of the cautionary remarks in Thomas’s (1978) paper. Thus, the general idea of the text is to provide guidance for researchers in making informed decisions on which approaches may be useful to answer their research questions using the data at hand. This book is not meant to teach them how to perform all the analysis methods presented here, but instead to show the kind of analyses that are available given the data and the research questions. My hope is that this approach will be useful in practical work. However, to quote Warwick (2004, p. 378), it should be realised that ‘the use of digital (and statistical) resources can only be truly meaningful when combined with old-fashioned critical judgement’— and expert knowledge I would like to add. vii

viii

Preface

What is new? Readers may wonder what new insights can be gained by using multivariate methods. Obviously there is no easy answer, as I am not an expert in all the fields touched upon in this book. Primarily I would be happy if results from applying multivariate methods are not contradictory to what the experts say their data tell them. The power of analysing many data in one go and presenting them in a coherent fashion, especially via well-designed tables and graphs, has given me, as a nonexpert, insights into various fields of endeavour which otherwise I could probably only have acquired via painstaking, detailed time-consuming studies of each field separately. At the same time I hope and expect that quantitative methods, in general, and multivariate techniques, in particular, will put powerful tools in the hands of experts in the humanities so that they can both expand their views and present their research results to colleagues and novices. Moreover, they may discover new facts and anomalies which otherwise might have gone unnoticed. Apart from this, the book is primarily intended as a demonstration of analytic tools available for use in many fields of the humanities. The case studies are demonstrations of how statistical problems arising from a research project may be tackled. By seeing these tools applied in different disciplines researchers may discover how they may be applied in their own research projects.

How to use this book The case study chapters all have a similar setup. First, the background and substantive research questions are introduced, as well as the data of the study. Next, the statistical methods used in the chapter are discussed at an elementary level, as well as the reasons for choosing them. Finally, the focal points of the chapters are the results of the analyses, which are presented at a medium statistical level. To support the analyses in the case studies there are four introductory methodological chapters, so that the emphasis in the case study chapters could be on the application rather than the mathematical side of things. The fourth chapter contains more detailed explanations for readers who want to delve deeper into the statistical side of the methods. Moreover, for a brief explanation of statistical terms a Glossary is provided at the end of the book. The emphasis in the text is on descriptive and exploratory methods, amply illustrated with many kinds of different graphics, rather than on test-based and mathematical statistical methods. It should be emphasised that the methods used are rather standard and have been found useful in many contexts. The main exceptions are methods for categorical data, and methods in which the data have been measured more than once. At times the same explanations will be given in several chapters to allow for (semi)independent reading, but there will also be many cross-references to other places in the text for additional explanation and information. However,

Preface

ix

researchers wishing to carry out meaningful analyses of their data should turn to statistical consultants and/or specialists in the various statistical techniques described.

Audience The level of the exposition is aimed at readers with at least a graduate degree in the humanities and a basic course in methodology and statistics. I am assuming they understand such subjects as elementary descriptive analysis and hypothesis testing. For them I hope to show that many multivariate statistical techniques can be used to make more sense of their data. On the other hand, I have also tried to make the text attractive for individuals who are primarily interested in the case studies themselves, and want to see what it is all about. I would advise them to read the first chapter, skim (or skip) the next three methodological chapters, and then start with the case studies that interest them, turning only to the methodology if they feel the need. In reading the case studies they should read the background, research questions, and the data descriptions. Then (again) skim or skip the methodological sections, and continue with the content summary at the end of the chapter. The outcomes of the analyses should not be considered fundamental contributions to the subjects studied. After all I am a statistician, and not an expert on all the areas touched upon in the case studies.

Plagiarism? Writing books such as this one is a daunting task, because so many methodology books have already been written. Copying texts from other authors who have presented topics so lucidly is a continuous temptation. The following quote from Crowder and Hand (1990, p. 4) was too beautiful and too close to the heart of the matter not to plagiarise: ‘There is very little “original” material here. Our inclination for barefaced plagiarization from many sources as possible has been tempered only by the need to write a coherent account’. Another quote which is close to my heart: The development of statistical ideas and thinking in the 20th century is possibly one of that century’s more important intellectual achievements; the mathematics of statistics can be ugly and sometimes pointless but I think the justification for statistics ultimately lies in applicability to data analysis (...). Not everyone would agree with this. (Baxter, 2008)

Which is duly noted.

x

Preface

Origin This book has its origins at the Netherlands Institute for Advanced Study in the Humanities and Social Sciences in Wassenaar, The Netherlands, where I stayed during the academic year 2003–2004. A collection of scholars from both the social sciences and the humanities worked and studied there in truly monastic tranquillity. Everyone was expected to give a presentation about their research. Given that my own research dealt with methods for data analysis but that most of the other scholars were working in substantive areas, the idea came to me to show what my kind of experience in analysing data could offer to their fields of endeavour. This resulted in a lecture Multivariate Analysis: Power to the Humanities with examples concerning the dating of graves from the artefacts they contained, searching for the relative time sequence of the works of Plato, and investigating the authorship of a book from the Wizard of Oz series. I presented this lecture at several subsequent occasions to various audiences. The idea grew to expand this lecture into a book. However, I felt that I would then need more different kinds of data. To this end I wrote to authors of humanities papers asking whether I could use their data for the purpose of a book which at the time was still more of a daydream than anything else. Many reacted favourably to my request, and these datasets form the core of the present book. I am extremely grateful to the authors for their willingness to share their data with me. At the outset I decided that I would reanalyse the data, not in competition with the original publications, but as an example of how one could answer research questions. Several times this resulted in a co-authorship, which is acknowledged in several chapters in this book. Nevertheless, I decided to take complete responsibility for the text presented here to ensure uniformity and purpose.

Acknowledgements Many people have contributed to the realisation of this book, and I would like to acknowledge them, in alphabetical order. As co-authors and/or data providers: Michael Barry, Zachary Bleemer, Leonard Brandwood, Spike Bucklow, Dorien Herremans, Tore Janson, Takashi Murakami, Donald Polzella, Marilena Vecco, and Jeroen Vermunt. As critical readers and/or correctors: José Binongo, Laura Brinkman, Rachel Chrastil, Pieter de Coninck, Hugh Craig, Twyla Gibson, Ruud Halbertsma, Casper de Jonge, Annemarie Kets-Vree, Paul Keyser, Els Koeneman, Jo McDonald, Cory Mckay, Kaarle Nordenstreng, Adriaan Rademaker, Ralph Rippe, Ineke Smit, June Ross, Meg Southwell, Paul Taçon, HaroldTarrant, Holger Thesleff, Paul Vierthaler, Peter White, and Meredith Wilson. Leiden, The Netherlands

Pieter M. Kroonenberg [email protected]

Global table of contents Opening Dedication Preface

PART 1. The Actors 1. 2. 3. 4.

Introduction: Multivariate studies in the Humanities Data inspection: The data are in. Now what? Statistical framework Statistical framework extended

PART 2. The Scenes Theology / Bible studies 5. Similarity data: Bible translations (co-author: Zachary Bleemer) 6. Stylometry: Authorship of the Pauline Epistles

History & Archeology 7. Economic history: Agriculture development on Java 8. Seriation: Graves in the Münsingen-Rain burial site

Arts 9. Complex response data: Evaluating Marian art (co-author: Donald Polzella) 10. Rating scales: Craquelure and pictorial stylometry (co-author: Spike Bucklow) 11. Pictorial similarity: Rock art images across the world 12. Questionnaires: Public views on deaccessioning (co-author: Marilena Vecco)

xi

xii

Global table of contents

Linguistics 13. 14. 15. 16.

Stylometry: The Royal Book of Oz: Baum or Thompson? Linguistics: Accentual prose rhythm in mediæval Latin Linguistics: Chronology of Plato’s works Binary judgements: Reading preferences

Music 17. Music appreciation: The Chopin Preludes (co-author: Takashi Murakami) 18. Musical stylometry: Characterisation of music (co-author: Dorien Herremans)

PART 3. The Finale 19. Final Musings Statistical glossary References Subject & Author index

Contents

Part I 1

2

The Actors

Introduction: Multivariate studies in the Humanities . . . . 1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Before you start . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Multivariate analysis . . . . . . . . . . . . . . . . . . . 1.1.4 Case studies: Quantification and statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The humanities—What are they? . . . . . . . . . . . . . . . . 1.3 Qualitative and quantitative research in the humanities 1.4 Multivariate data analysis . . . . . . . . . . . . . . . . . . . . . 1.5 Data: Formats and types . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Data formats . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Data characteristics: Measurement levels . . . . 1.5.3 Characteristics of data types . . . . . . . . . . . . . 1.5.4 From one data format to another . . . . . . . . . . 1.6 General structure of the case study chapters . . . . . . . . 1.7 Author references . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9 Web addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3 3 4 4 6

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

7 8 9 9 10 11 11 14 15 16 16 17 17

Data inspection: The data are in. Now what? . . . . . . . . 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 A researcher’s nightmare . . . . . . . . . . . . . 2.1.2 Getting the data right . . . . . . . . . . . . . . . . 2.2 Data inspection: Overview . . . . . . . . . . . . . . . . . . . 2.2.1 The normal distribution . . . . . . . . . . . . . . 2.2.2 Distributions: Individual numeric variables

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

19 19 20 22 23 24 26

. . . . . . .

. . . . . . .

xiii

xiv

Contents

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

30 33 36 37 37 38 40 40 40 41 42 43 43 43 44

Statistical framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Data formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Matrices: The basic data format . . . . . . . . . 3.2.2 Contingency tables . . . . . . . . . . . . . . . . . . . 3.2.3 Correlations, covariances, similarities . . . . . . 3.2.4 Three-way arrays: Several matrices . . . . . . . 3.2.5 Meaning of numbers in a matrix . . . . . . . . . 3.3 Chapter example . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Designs, statistical models, and techniques . . . . . . . . 3.4.1 Data design . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 From questions to statistical techniques . . . . . . . . . . 3.5.1 Dependence designs versus internal structure designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Analysing variables, objects, or both . . . . . . 3.6 Dependence designs: General linear model—GLM . . . 3.6.1 The t test . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Analysis of variance—ANOVA . . . . . . . . . . . 3.6.3 Multiple regression analysis—MRA . . . . . . . . 3.6.4 Discriminant analysis . . . . . . . . . . . . . . . . . 3.6.5 Logistic regression . . . . . . . . . . . . . . . . . . . 3.6.6 Advanced analysis of variance models . . . . . 3.6.7 Nonlinear multivariate analysis . . . . . . . . . . 3.7 Internal structure designs: General description . . . . . 3.8 Internal structure designs: Variables . . . . . . . . . . . . . 3.8.1 Principal component analysis—PCA . . . . . . . 3.8.2 Categorical principal component analysis—CatPCA . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

45 45 46 46 46 48 48 49 49 50 50 51 53

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

54 55 56 57 57 58 60 61 63 64 65 65 66

........

71

2.3

2.4

2.5

2.6 3

2.2.3 Inspecting several univariate distributions 2.2.4 Bivariate inspection . . . . . . . . . . . . . . . . Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Unintentionally missing . . . . . . . . . . . . . 2.3.2 Systematically missing . . . . . . . . . . . . . . 2.3.3 Handling missing data . . . . . . . . . . . . . . Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Characteristics of outliers . . . . . . . . . . . . 2.4.2 Types of outliers . . . . . . . . . . . . . . . . . . 2.4.3 Detection of outliers . . . . . . . . . . . . . . . . 2.4.4 Handling outliers . . . . . . . . . . . . . . . . . . Testing assumptions of statistical techniques . . . . . 2.5.1 Null hypothesis testing . . . . . . . . . . . . . . 2.5.2 Model testing . . . . . . . . . . . . . . . . . . . . . Content summary . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

Contents

3.8.3 Factor analysis—FA . . . . . . . . . . . . . . . . . . . . . . 3.8.4 Structural equation modelling—SEM . . . . . . . . . . . 3.8.5 Loglinear models . . . . . . . . . . . . . . . . . . . . . . . . Internal structure designs: Objects, individuals, cases, etc. . 3.9.1 Similarities and dissimilarities . . . . . . . . . . . . . . . 3.9.2 Multidimensional scaling—MDS . . . . . . . . . . . . . . 3.9.3 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . Internal structure designs: Objects and variables . . . . . . . . 3.10.1 Correspondence analysis: Analysis of tables . . . . . 3.10.2 Multiple correspondence analysis . . . . . . . . . . . . 3.10.3 Principal component analysis for binary variables . Internal structure designs: Three-way models . . . . . . . . . . 3.11.1 Three-mode principal component analysis—TMPCA Hypothesis testing versus descriptive analysis . . . . . . . . . . Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Designing tables and graphs . . . . . . . . . . . . . . . . . . . . . . 3.15.1 How to improve a table . . . . . . . . . . . . . . . . . . . 3.15.2 Example of table rearrangement: a binary dataset . 3.15.3 Examples of table rearrangement: contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.15.4 How to improve graphs . . . . . . . . . . . . . . . . . . . Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of statistics in the case studies . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

73 75 77 79 79 80 81 84 84 87 88 88 89 91 91 93 93 93 94

. . . .

. . . .

. . . .

. . . .

94 97 98 99

Statistical framework extended . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Contents and Keywords . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Analysis of variance designs . . . . . . . . . . . . . . . . . . . . . . 4.4 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Biplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Convex hulls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Deviance plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 Inner products and projection . . . . . . . . . . . . . . . . . . . . . 4.13 Joint biplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.14 Means plot with error bars, line graph, interaction plot . . . 4.15 Missing rows and columns . . . . . . . . . . . . . . . . . . . . . . . 4.16 Multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.17 Multivariate, multiple, multigroup, multiset, and multiway 4.18 Quantification, optimal scaling, and measurement levels . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

103 103 104 104 105 106 108 108 110 111 112 113 114 115 115 117 118 120 121

3.9

3.10

3.11 3.12 3.13 3.14 3.15

3.16 3.17 4

xv

xvi

Contents

4.19 4.20 4.21 4.22 4.23 4.24 4.25

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

123 124 125 125 127 127 128

Similarity data: Bible translations . . . . . . . . . . . . . . . . . 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Research questions: Similarity of translations . . . . . 5.3 Data: English and German Bible translations . . . . . 5.4 Analysis methods . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Characteristics of multidimensional scaling and cluster analysis . . . . . . . . . . . . . . . . . 5.4.2 Multidimensional scaling . . . . . . . . . . . . . 5.4.3 Cluster analysis . . . . . . . . . . . . . . . . . . . . 5.5 Bible translations: Statistical analysis . . . . . . . . . . . 5.5.1 Multidimensional scaling . . . . . . . . . . . . . 5.5.2 Cluster analysis . . . . . . . . . . . . . . . . . . . . 5.6 Other approaches to analysing similarities . . . . . . . 5.7 Content summary . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

133 133 134 135 137

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

138 138 138 139 139 140 140 140

6

Stylometry: Authorship of the Pauline Epistles . . . 6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Research questions: Authorship . . . . . . . . . . . 6.3 Data: Word frequencies in Pauline Epistles . . 6.4 Analysis methods . . . . . . . . . . . . . . . . . . . . . 6.4.1 Choice of analysis method . . . . . . . . 6.4.2 Using correspondence analysis . . . . . 6.5 The Pauline Epistles: Statistical analysis . . . . . 6.5.1 Inspecting Epistle profiles . . . . . . . . . 6.5.2 Inertia and dimensional fit . . . . . . . . 6.5.3 Plotting the results . . . . . . . . . . . . . . 6.5.4 Plotting the Epistles profiles . . . . . . . 6.5.5 Epistles and Word categories: Biplot . 6.5.6 Methodological summary . . . . . . . . . 6.6 Other approaches to authorship studies . . . . . . 6.7 Content summary . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

143 143 146 148 149 150 150 151 151 152 153 153 155 156 157 158

7

Economic history: Agricultural development on Java . . . . . . . . . . . 161 7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.2 Research questions: Historical agricultural data . . . . . . . . . . . . . 162

Part II 5

Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scaling coordinates . . . . . . . . . . . . . . . . . . . . . . . Singular value decomposition . . . . . . . . . . . . . . . Structural equation modelling—SEM . . . . . . . . . . . Supplementary points and variables . . . . . . . . . . . Three-mode principal component analysis (TMPCA) X 2 test (v2 test) . . . . . . . . . . . . . . . . . . . . . . . . . . The Scenes

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

Contents

7.3 7.4

7.5

7.6 7.7

xvii

Data: Agriculture development on Java . . . . . . . . . . . . . . . . Analysis methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Choice of analysis method . . . . . . . . . . . . . . . . . . . 7.4.2 CatPCA: Characteristics of the method . . . . . . . . . . . . Agricultural development on Java: Statistical analysis . . . . . . 7.5.1 Categorical principal component analysis in a miniature example . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Main analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Agricultural history of Java: Further methodological remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Other approaches to historical data: . . . . . . . . . . . . . . . . . . . Content summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

163 166 166 167 167

. . 168 . . 171 . . 175 . . 175 . . 176

8

Seriation: Graves in the Münsingen-Rain burial site 8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Research questions: A time line for graves . . . . . 8.3 Data: Grave contents . . . . . . . . . . . . . . . . . . . . . 8.4 Analysis methods . . . . . . . . . . . . . . . . . . . . . . . 8.5 Münsingen-Rain graves: Statistical analysis . . . . 8.5.1 Fashion as an ordering principle . . . . . . 8.5.2 Seriation . . . . . . . . . . . . . . . . . . . . . . . 8.5.3 Validation of seriation . . . . . . . . . . . . . 8.5.4 Other techniques . . . . . . . . . . . . . . . . . 8.6 Other approaches to seriation . . . . . . . . . . . . . . . 8.7 Content summary . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

177 177 178 179 180 181 181 182 183 185 185 186

9

Complex response data: Evaluating Marian art . . . . . . . . . . . . 9.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Research questions: Appreciation of Marian art . . . . . . . . . 9.3 Data: Appreciation of Marian art across styles and contents 9.4 Analysis method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Marian art: Statistical analysis . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Basic data inspection . . . . . . . . . . . . . . . . . . . . . . 9.5.2 A miniature example . . . . . . . . . . . . . . . . . . . . . . 9.5.3 Evaluating differences in means . . . . . . . . . . . . . . 9.5.4 Examining consistency of relations between the response variables . . . . . . . . . . . . . . . . . . . . . . 9.5.5 Principal component analyses: All painting categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.6 Principal component analysis: Per painting category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.7 Scale analysis: Cronbach’s alpha . . . . . . . . . . . . . . 9.5.8 Structure of the questionnaire . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

187 187 188 189 192 193 193 195 197

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . 201 . . . 202 . . . 205 . . . 205 . . . 206

xviii

Contents

9.6 9.7

Other approaches to complex response data . . . . . . . . . . . . . . . 209 Content summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

10 Rating scales: Craquelure and pictorial stylometry . . . . . . . . . . . 10.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Research questions: Linking craquelure, paintings, and judges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Data: Craquelure of European paintings . . . . . . . . . . . . . . . . 10.4 Analysis methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Craquelure: Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 Art-historical categories: Scale means . . . . . . . . . . . 10.5.2 Scales, judges, and paintings: Three-mode component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.3 Separation of art-historical categories . . . . . . . . . . . . 10.6 Other approaches to pictorial stylometry . . . . . . . . . . . . . . . . 10.7 Content summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . 211 . . 211 . . . . .

. . . . .

213 213 215 217 217

. . . .

. . . .

218 223 225 225

. . . . . . . . . . . .

. . . . . . . . . . . .

227 227 229 229 229 229 230 230 233 233 234 235

11 Pictorial similarity: Rock art images across the world . . . . . . . 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research questions: Evaluating Rock Art . . . . . . . . . . . . . . 11.2.1 The Kimberley versus Algerian images . . . . . . . . . 11.2.2 The Zimbabwean, Indian, and Algerian images . . . 11.2.3 The Kimberley, Arnhem Land, and Pilbara images . 11.2.4 General considerations . . . . . . . . . . . . . . . . . . . . . 11.3 Data: Characteristics of Barry’s rock art images . . . . . . . . . 11.4 Analysis methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 Comparison of proportions . . . . . . . . . . . . . . . . . . 11.4.2 Principal component analyses for binary variables . 11.5 Rock art: Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Comparing rock art from Algeria and from the Kimberley . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.2 Comparing rock art from Zimbabwe, India, and Algeria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.3 Comparing rock art images from within Australia . 11.5.4 Further analytical considerations . . . . . . . . . . . . . . 11.6 Other approaches to analysing rock art images . . . . . . . . . . 11.7 Content summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

238 241 245 246 246

12 Questionnaires: Public views on deaccessioning . . . . . . . 12.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Research questions: Public views on deaccessioning . 12.3 Data: Public views about deaccessioning . . . . . . . . . 12.3.1 Questionnaire respondents . . . . . . . . . . . . . . 12.3.2 Questionnaire structure . . . . . . . . . . . . . . . . 12.3.3 Type of data design . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

249 249 251 252 252 252 252

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . 235

Contents

12.4 Analysis methods . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Public views on deaccessioning: Statistical analysis 12.5.1 Item distributions . . . . . . . . . . . . . . . . . . . 12.5.2 Item means . . . . . . . . . . . . . . . . . . . . . . . 12.5.3 Item correlations . . . . . . . . . . . . . . . . . . . 12.5.4 Measurement models: Preliminaries . . . . . . 12.5.5 Measurement models: Confirmatory factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.6 Measurement models: Deaccessioning data 12.5.7 Item loadings . . . . . . . . . . . . . . . . . . . . . . 12.5.8 Interpretation . . . . . . . . . . . . . . . . . . . . . . 12.6 Other approaches in deaccessioning studies . . . . . . 12.7 Content summary . . . . . . . . . . . . . . . . . . . . . . . . .

xix

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

254 254 254 254 256 258

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

260 261 263 264 266 266

13 Stylometry: The Royal Book of Oz - Baum or Thompson? 13.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Research questions: Competitive authorship . . . . . . . . 13.3 Data: Occurrence of function words . . . . . . . . . . . . . . 13.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 13.3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Analysis methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.1 Significance testing . . . . . . . . . . . . . . . . . . . . 13.4.2 Distributions and graphics . . . . . . . . . . . . . . . 13.4.3 Principal component analysis and graphics . . . 13.4.4 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . 13.5 Wizard of Oz: Statistical analyses . . . . . . . . . . . . . . . 13.5.1 Principal component analysis . . . . . . . . . . . . 13.5.2 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . 13.6 Other approaches in authorship studies . . . . . . . . . . . . 13.7 Content summary . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

269 269 270 271 271 272 273 273 273 275 275 276 276 280 281 282

14 Linguistics: Accentual prose rhythm in mediæval Latin . . . . 14.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Research questions: Accentual prose rhythm in mediæval Latin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3 Data: Janson’s data tables . . . . . . . . . . . . . . . . . . . . . . . 14.4 Analysis methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.1 Contingency tables . . . . . . . . . . . . . . . . . . . . . . 14.4.2 Ordinal principal component analysis . . . . . . . . 14.5 Accentual prose rhythm: Statistical analysis . . . . . . . . . . 14.5.1 Internal structure of individual authors’ cadences 14.5.2 Similarities in accentual prose rhythm . . . . . . . . 14.6 Content summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . 285 . . . . . 285 . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

287 288 288 289 289 290 291 297 301

xx

Contents

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

303 303 304 305 305 306 306 307 309 310

16 Binary judgments: Reading preferences . . . . . . . . . . . . . . 16.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Research questions: Binary variables . . . . . . . . . . . . . 16.3 Data: Reading preferences . . . . . . . . . . . . . . . . . . . . . 16.4 Analysis methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4.1 Loglinear modelling . . . . . . . . . . . . . . . . . . . 16.4.2 Multiple correspondence analysis . . . . . . . . . 16.4.3 Supplementary variables . . . . . . . . . . . . . . . . 16.5 Reading preferences: Statistical analysis . . . . . . . . . . . 16.5.1 Co-occurrence of quality and popular reading 16.5.2 Complexity of the relations . . . . . . . . . . . . . . 16.5.3 Multiple correspondence analysis . . . . . . . . . 16.5.4 Supplementary background variables . . . . . . . 16.6 Other approaches to binary judgments . . . . . . . . . . . . 16.7 Content summary . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

313 313 314 314 314 315 315 316 317 318 318 320 321 322 324

17 Music appreciation: The Chopin Preludes . . . . . . . . . . . . . . . 17.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 Research questions: Appreciation and musical knowledge . 17.3 Data: Semantic differential scales . . . . . . . . . . . . . . . . . . . 17.3.1 Musical database: Semantic differential scales . . . 17.3.2 Design and data collection . . . . . . . . . . . . . . . . . 17.3.3 Data format: Students, Preludes, and Scales . . . . . 17.4 Analysis methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4.1 Two-way multivariate analysis of variance . . . . . . 17.4.2 Tucker’s three-mode model . . . . . . . . . . . . . . . . . 17.4.3 Joint biplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5 The Chopin preludes: Statistical analysis . . . . . . . . . . . . . 17.5.1 Individual differences . . . . . . . . . . . . . . . . . . . . . 17.5.2 Three-mode principal component analysis—TMPCA 17.5.3 Scale components . . . . . . . . . . . . . . . . . . . . . . . . 17.5.4 Prelude components . . . . . . . . . . . . . . . . . . . . . . 17.5.5 Joint biplot—Consensus . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

327 327 328 329 329 330 331 333 334 334 336 336 336 338 338 339 341

15 Linguistics: Chronology of Plato’s works . . 15.1 Background . . . . . . . . . . . . . . . . . . . . 15.2 Research questions: Plato’s chronology 15.3 Data: Kaluscha’s clausulae data . . . . . . 15.4 Analysis methods . . . . . . . . . . . . . . . . 15.5 Plato’s chronology: Statistical analysis . 15.5.1 Text similarities . . . . . . . . . . . 15.5.2 Clausulae and texts . . . . . . . . 15.6 Other approaches to Plato’s chronology 15.7 Content summary . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

Contents

xxi

17.5.6 Preludes as characterised by the scales . . 17.5.7 Circle of fifths: Judgments and keys . . . . 17.5.8 Joint biplot—Individual differences . . . . . 17.6 Other approaches to evaluating music appreciation 17.7 Content summary . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

341 341 343 344 344

18 Musical stylometry: Characterisation of music . . . . . . . . 18.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.2 Research questions: Differences in musical style . . . . 18.3 Data: Melodic intervals and pitch . . . . . . . . . . . . . . 18.3.1 Musical database . . . . . . . . . . . . . . . . . . . . 18.3.2 Musical styles: Features . . . . . . . . . . . . . . . 18.3.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.4 Analysis methods . . . . . . . . . . . . . . . . . . . . . . . . . . 18.4.1 Binary logistic regression . . . . . . . . . . . . . . 18.4.2 Multinomial logistic regression . . . . . . . . . . 18.4.3 Discriminant analysis . . . . . . . . . . . . . . . . . 18.5 Characterisation of music: Statistical analysis . . . . . . 18.5.1 Data description . . . . . . . . . . . . . . . . . . . . . 18.5.2 Data inspection . . . . . . . . . . . . . . . . . . . . . 18.5.3 Preliminaries for the analyses . . . . . . . . . . . 18.5.4 Predicting Bach versus Haydn + Beethoven . 18.5.5 Genre: Heterogeneity of the Bach pieces . . . 18.5.6 Genre: Discriminating between Bach pieces . 18.6 Other approaches to analysing musical styles: . . . . . . 18.7 Content summary . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

347 347 348 349 349 349 351 351 351 353 354 355 355 355 360 363 365 367 369 370

Part III

. . . . .

The Finale

19 Final Musings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 Appendix A: Discipline-orientated statistics books . . . . . . . . . . . . . . . . . . 375 Statistical Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

Part I

The Actors

Chapter 1

Introduction: Multivariate studies in the Humanities

Abstract Main topics. Multivariate techniques are quantitative statistical procedures designed to analyse data consisting of more (multi) than one variable. Its relation with qualitative analysis is briefly touched upon. Types and formats of quantitative data are sketched. Some general information about the structure of the book is supplied. Keywords Multivariate analysis · Data types · Data formats · Research questions · Humanities disciplines · Case studies · Qualitative-quantitative

1.1 Preliminaries In this book I intend to show through examples what quantitative techniques, especially multivariate data analysis and statistics, have to offer when we are dealing with questions which lend themselves to be expressed and analysed through numbers, i.e. data. Some of these questions can be dealt with via techniques which can be found in elementary statistics books for the humanities (and other disciplines, of course), such as Van Peer, Hakemulder, and Zyngier (2012). However, in many cases such elementary techniques cannot really answer the questions posed, or handle the number of data and variables available. It is in such cases that multivariate data analysis comes to the fore. Multivariate data analysis is the approach to analysing collections of data consisting of several, even many variables. Generally there are not only many variables but also many objects, individuals, paintings, literary works, etc. which have scores on these variables. Multivariate analysis in the humanities is unlike a drunk’s use of a lamp post; it is used for both support and illumination.1

1 The

original quote about the drunkard and his lamp post was probably coined by Andrew Lang; for its history see: https://quoteinvestigator.com/2014/01/15/stats-drunk/ . © Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_1

3

4

1 Introduction: Multivariate studies in the Humanities

This book is particularly concerned with demonstrating the value of using multivariate statistical and data analysis in humanities research. The groundwork has been laid by many quantitatively inclined scholars working in the humanities. Here I aim to show readers the potential of multivariate data analysis via fourteen case studies, and my ambition is that the presentation will assist them in their research, even if their specific problem is not treated here. I hope that the data-analytic treatments in the case studies are detailed enough to be translated to the researcher’s own data-analytic questions. The aim of the presentations is to show the analyses in more detail and with more methodological comments than was possible in the original papers the case studies are taken from. I specifically address the questions of how such analyses are conducted, which methodological decisions have to be made, and how they assist in answering the research questions. Nevertheless, at times different techniques from those originally applied to the data will be used to show the power of multivariate methods, and the different points of view from which substantive questions can be approached. A comparable approach, directed especially at archeological researchers, can be found in Baxter (1994). No attempt has been made to be in any way inclusive, either with respect to fields within the humanities or multivariate methods. The primary aim is to familiarise researchers in the humanities with the analytic options multivariate analysis has to offer.

1.1.1 Audience The book is intended for researchers in the humanities who have had some earlier exposure to statistics, and who have collected data with many variables within their research projects. It is not meant to replace searching in their own literature for examples of analyses with data similar to their own. Instead I want to show what can be done with multivariate data in general, and give researchers an idea of how other disciplines are dealing with similar problems.

1.1.2 Before you start Before going into the nitty-gritty of multivariate data analysis, researchers should first think about what they have and what the task before them entails. In its simplest form, the dataset collected consists of individuals (objects, cases, paintings, graves, etc.),2 and variables on which individuals have scores, for example, how well they read, sing, speak French, or how tall and heavy they are. Thus we have a 2I

will refer in this book to individuals, persons, individuals, objects, singers, and specific objects such as paintings, words, graves, books, etc, using these words interchangeably where appropriate and without prejudice.

1.1 Preliminaries

5

data matrix consisting of rows (usually individuals) and columns (usually variables). Because there is more than one variable the dataset is multivariate. In addition, we have a set of research questions, say ‘Is there a relationship between height and weight, and between height and quality of singing’. We may want to make statements about the relationships between the variables, but we may also want to know whether it is possible to predict the quality of singing from someone’s expertise in reading notes, number of singing lessons, and posture. The idea is, of course, that these and other variables contain information about singing quality, and we want to know to what extent we can predict this quality. It is the task of multivariate analysis to sort this out and supply us with information about such questions, using the data collected. The sort of information we may want to have is (1) what relationships there are between the variables, (2) what kind of individuals have similar scores on the variables, and (3) whether we can get an insight into the relations between the persons and the variables. An effective way to do this is by making graphs and figures which show individuals and variables together. Another frequently occurring data format is a two-way contingency table which consists of two categorical variables, one is the row and the other the column variable. The table shows how often a row category and a column category occur together. The row variable may be ‘eye colour’ with categories blue, light, medium, and dark, and the column variable could be ‘hair colour’ with categories fair, red, medium, dark, and black. The two variables are, in principle, fully crossed, which means that all combinations of eye colour and hair colour may occur together. A table cell mostly contains a frequency, which tells us how often an eye colour and a hair colour occur together in the sample; for an example of a contingency table of such data see Section 3.2.2, p. 46. A visual impression of the relationship between eye colour and hair colour can be acquired via a single graph, referred to as a biplot; see Chapter 3, p. 47 for the relationship between hair and eye colour in Scotland.. How to make such a graph is within the realm of mathematics, and I will not really go into much detail, this happens only in Chapter 4 (p. 106), which presents some further statistical explanations. However, I will regularly talk about interpreting such graphs as they contain information on the research questions. Researchers also frequently want to group or cluster objects (and sometimes variables). For example, they may want to know whether there are any naturally existing groups with specific attributes. Listening to a choir an uninitiated person might want to know how many groups of voices constitute a choir. Alternatively, we know that singers can be grouped in different voice types, and we want to know what makes these group different. Thus, we search for variables which can provide information about the groups, or which can tell us whether there are naturally occurring groups or man-made clusters. For all these problems there exist both standard much used statistical techniques and new more experimental ones. In this book I will concentrate on the standard well-tried techniques. It is useful to read the introduction to Baxter (1994), who also primarily discusses classical multivariate analysis methods. An exception is my

6

1 Introduction: Multivariate studies in the Humanities

own speciality, which deals with three-way data, for instance, a group of (1) school children measured several (2) years on the same (3) school subjects. Personally I feel that such data are more common than many people realise. The other exception is that ample attention will be paid to techniques which deal with categorical data, and datasets containing variables of different measurement levels (see Section 1.5.2 for more information).

1.1.3 Multivariate analysis As mentioned above, a central characteristic of the datasets treated here is that they are multivariate rather than univariate, i.e. that they consist of more than one variable. The basic assumption is that the researcher wants to get an insight into the association patterns between these variables, possibly for different groups. ∗ Formats of multivariate data. To give a mini example of the format of multivariate data, let us assume that a researcher is interested in the polemics between two groups (A and B) of three theology students each, who have different opinions about the Bible. In particular, they disagree about its historical base (variables 1 and 2) and the authorship of texts (variables 3 and 4). To this end, a dataset has been collected that can be represented in a matrix with theology students as rows and the aspects of the controversy (variables) as columns. The scores on the variables were collected for their each of their six years academic study. Figure 1.1 depicts the structure of such data. It shows both a single matrix at year 1 (lefthand side), and six matrices, one for each year (right-hand side). Of course, not all characteristics (theology students, variables, groups, years) need to be used in every analysis, but the figure shows what the researcher’s multivariate data may look like. ∗ Types of multivariate questions. In this example some typical multivariate questions are: ‘Is there a relationship between the theology students’ opinions about historical issues and the authorship of specific Bible texts?’, ‘ Is such a relationship the same in the two groups?’ ‘Given that we have additional background variables about the theology students (religion, country of origin, etc.) can we attribute differences in opinion to their various religious backgrounds?’, ‘Can we see a systematic change during their study of theology in their opinions?’ Such questions are inherently multivariate in the sense that more than one variable is involved in answering them. Moreover, questions of relationships between variables and changes over time generally require the kind of techniques that have been developed within the framework of multivariate analysis, for example, analysis of time series.

1.1 Preliminaries

7

Fig. 1.1 Multivariate data formats: Single matrix (Left); Six matrices (Right)

1.1.4 Case studies: Quantification and statistical analysis Starting from the assumption that researchers in the humanities have questions that require multivariate analysis, this book provides examples of substantive research questions which have been or can be tackled with multivariate methods. My aim is to provide an understanding of the data collected that would be difficult to acquire if the variables were taken one by one. It is not my aim to follow the historians described by Thomas III (2004): ‘When challenged at a conference, more than one historian responded with nothing more than a mathematical equation as the answer’. It is not the mathematics which is of primary interest, but the question of how multivariate statistical methods provide answers to substantive questions. The aim of the statistical analysis of data from the humanities should be to create an understanding befitting this area of studies. In certain areas and with certain researchers within the humanities there is opposition to quantification and digitisation in humanities research. For instance, the onetime president of the American Historical Association, Carl Bridenbaugh (1963) said that ‘The finest historians will not be those who succumb to the dehumanizing methods of social sciences, whatever their uses and values, which I hasten to acknowledge. Nor will the historian worship at the shrine of that Bitch-goddess, quantification’. (p. 326) Of course, this opinion was expressed more than 50 years ago and many developments have taken place since, as well as experiences gained by fruitfully applying multivariate analysis. A similar outlook on data analysis as that underlying this book was formulated in Baxter’s book on exploratory multivariate analysis in archeology (Baxter, 1994); reprinted in 2015 with a new introduction. The power and nature of his philosophy

8

1 Introduction: Multivariate studies in the Humanities

is strongly expressed in his first chapter, and should also be required reading for the audience of this book. Similar books exist in other fields than archeology which provide an introduction into the philosophy and practice of statistics; see a limited list in the Appendix (p. 375).

1.2 The humanities—What are they? Different people have different ideas as to what fields belong to the humanities. For instance, Shalin Hai-Jew primarily counts the following fields as belonging to the discipline: the classics, literature, philosophy, religious studies, psychology, modern and ancient languages, culture studies, art, history, music, philosophy, theatre, film, political science, geography, anthropology, linguistics, social work, and communications (Hai-Jew, 2017, p. viii). However, his list of fields is far too general and far reaching to serve as an underpinning for this book, especially as some areas have their own overarching field descriptions. The social and behavioural sciences, for instance, are generally considered to consist of psychology, educational sciences, anthropology, political science, public administration, and communication studies. These disciplines will not be covered here, nor will ‘others’. On the other hand, archeology will be considered part of the humanities, albeit apart from exact science techniques used in archeology such as carbon-C14 dating and thermoluminescence. The study of languages, modern or ancient, is here grouped under the general heading of ‘linguistics’. In a sense ‘history’ is also a general container concept, but we will present only one example from this domain. Talking about what constitutes the humanities, Ryan Cordell states:3 We shouldn’t forget, of course, that “humanities” is not itself a self-evident signifier. What “humanities” does and does not comprise differs from definition to definition and from institution to institution. For our students, as for many of us, the word humanities is opaque, vaguely signaling fields that are not the sciences. Even that broad definition is hazy. Consider anthropology, which is in some institutional structures a humanities field and in others a social science; the same sometimes goes for history.

Such remarks as Cordell’s are not very useful in deciding what fields should have an example included in this book. The choice of case studies has been primarily guided by interesting examples that have come my way, or for which data discussed in published papers were available, provided they were at least vaguely within the fields mentioned above. As I cannot to be an expert in all academic fields of endeavour, background as described in the publications and the assistance of the owners of the data were usually necessary for the interpretation.

3 http://ryancordell.org/teaching/how-not-to-teach-digital-humanities/ ;

accessed 15 June2020.

1.4 Multivariate data analysis

9

1.3 Qualitative and quantitative research in the humanities Rachel Chrastil created a website on Quantitative Literacy and the Humanities and an accompanying curriculum. There she states: Quantitative literacy [QL] is part of a holistic, liberal arts education. It is the rigorous and sophisticated application of relatively elementary mathematics to new situations. Much like writing, QL encompasses a wide-ranging set of skills and cuts across the undergraduate curriculum. Just as good writing skills support the sciences and mathematics, good QL skills support the humanities. We unlock new possibilities and tools for precision when we incorporate mathematical thinking into our disciplines. At the same time, robust QL is not possible without the values, nuance and range of experience encountered in the humanities. QL cannot eliminate uncertainty; mathematics alone cannot stand in for good judgment. In this era of big data, humanistic QL is an urgent need. QL does not challenge the humanities. It makes the humanities stronger and more necessary than ever.4

This passage, in which I would have preferred a mention of statistics, is obviously not meant to vilify the qualitative research which is carried out nearly everywhere in humanities faculties, but stresses that qualitative research on its own is not enough and cannot be the exclusive paradigm in humanities research. To get a conceptual introduction into what statistics is about, and an idea of things statistical, including multivariate statistics, it is useful to read the first chapter of Kachigan (1991, p. 1–55). There Kachigan states, among other things, the overall objectives of statistical analysis: Observations of the world are converted into numbers, the numbers are manipulated, and then the results are interpreted and translated back to a world that is now hopefully more orderly and understandable than prior to the data analysis. This process of drawing conclusions and understanding more about the sources of our data is the goal of statistical analysis in the broadest sense. [...] data manipulation and organisation [can be seen] as achieving one or more of three basic objectives: (1) data reduction, (2) services as an inferential measuring tool, and (3) the identification of associations or relationships between and among sets of data. (p. 3).

In this book I primarily deal with objectives (1) and (3). Chapter 3 is entirely devoted to these two objectives, and provides an overview of the general framework of multivariate humanities. Objective (2), i.e. inference, is only treated in passing, and then mostly in a descriptive sense.

1.4 Multivariate data analysis Through multivariate data analysis researchers attempt to get an insight into a dataset by simultaneously considering the scores on several variables, with the relationships among the variables as the prime focus. This can be reflected in the way in which one set of variables predicts another, and/or how the variables are related among 4 http://scalar.usc.edu/works/quantitative-literacy-and-the-humanities; accessed 15 June 2020.

10

1 Introduction: Multivariate studies in the Humanities

themselves. In addition, there is a distinction between significance testing of a priori hypotheses on the one hand, and exploratory/descriptive data analysis on the other, although in one study these activities are often intertwined. In this book the emphasis will be on the data-analytic descriptive side, and significance testing will be primarily used in a descriptive setting. The next two chapters are preliminary and methodological, they concentrate on setting the scene once the data have been collected. Chapter 2, Data Inspection, outlines the quality control that has to be carried out before the multivariate analysis. Chapter 3 Statistical framework concerns the transition from research questions to analysis design. For this the characteristics and functions of variables need to be described, so that it is easier to see what design is appropriate for what kind of data, and what statistical technique fits which design best. As it turned out, in several case studies of this book, the same techniques were used, although with a different emphasis, because of the nature of the research questions and data. This is in line with the central idea in this book: content first. Some of the more technical statistical information is collected in Chapter 4 Statistical framework extended, and may be ignored in a first reading. The statistical information in this book is not sufficient for researchers to carry out analyses themselves, but is intended especially to explain some statistical concepts and terminology in more detail for the aficionados. The final Glossary offers short definitions and explanations of statistical terms, without going into detail. Readers who really want to do their own data analysis are referred to literature dedicated specifically to the technical and practical side of multivariate data analysis, and to manuals explaining how to handle specific computer programs (such as spss, sas, and Stata) and programming languages as, for instance, R. Examples of computer-oriented application guides are Husson, Le, and Pagès (2017) and Baxter (1994); some other practical multivariate analysis textbooks, especially popular in the social and behavioural sciences, are Field (2017) (even though the examples are at times annoyingly flippant); Hair, Black, Babin, and Anderson (2018); Tabachnick and Fidell (2013); to name a few. There are several other books with similar aims but directed at specific areas within the humanities, such as Gries (2013) and Levshina (2015) for linguistics. In her epilogue the latter writes ‘the main goal of this book has been to equip the readers with enough R skills and understanding of fundamental statistical principles to perform common procedures in exploration, visualization, analysis and interpretation of linguistic data’. Whenever possible and if allowed by the owners, the data used will be made available as supplementary information on the publisher’s website, so that the analyses may be replicated.

1.5 Data: Formats and types Before discussing various multivariate data analysis techniques applicable to the data from the case studies in this book, it is good to realise that there are different formats

1.5 Data: Formats and types

11

and types of datasets. Format refers to the way the data or data points are organised. Common formats are matrices, contingency tables, three-way arrays, correlation matrices, similarity matrices, etc. (see Sections 1.5.1 and 3.2). Type refers to the characteristics of the data content, especially the way the variables have been scored, i.e. what determines the values of the variables, what they represent, and how they can be interpreted. Examples of common types are rating scales (paintings are rated from ‘very ugly’ to ‘very beautiful’), frequencies (number of paintings by Rembrandt), similarities (similarity of Frans Hals and Rembrandt paintings), ordinal data or rank-order data (finishing order on the London marathon), and numeric data (weight of a gold bar); see Nishisato (1994, Chapter 3) for a full discussion of such types, in particular, within the context of categorical data analysis. Some of these types will be discussed in Section 1.5.3, and especially their characteristics, i.e. what types of measurements are commonly used and to what extent these influence what one can do with the data. It should, however, be noted that once data have been collected, their original form is not necessarily the best to handle them, so that we need to know whether and how one data type can be transformed into another.

1.5.1 Data formats Figure 1.2 presents five data formats: a basic data matrix (A) which contains scores on a set of variables. Datasets may also consist of more than one matrix, for instance, sets of basic data matrices for measurements under different conditions or at different time points: three-way data (B) or sets of measurements taken from different subjects: multisample data (C). Yet another format is the similarity or correlation matrix (D), which contains information on either the similarity of objects or the strengths of the relationships between variables. We may also have a set of correlation matrices (E) from different samples, or a set of similarity matrices obtained at different time points. Not depicted is a contingency table, see however Section 3.2.2.

1.5.2 Data characteristics: Measurement levels There is a fundamental distinction between numeric data and categorical data. For the first type, virtually any computational procedure can yield interpretable results. Thus, we can interpret summary measures such as means, variances, and correlations, as well as all kinds of results from advanced statistical techniques based on them. For the second type, such summary quantities often have no meaning or interpretation. What is the substantive interpretation of a mean of 0.5 if it concerns the variable Gender, coded 0 for boys and 1 for girls? Of course, numerically we can conclude from a mean value 0.5 that there are as many girls as boys, but a concept

12

1 Introduction: Multivariate studies in the Humanities

Fig. 1.2 Common data formats. (A) Basic data matrix, (B) Three-way data, (C) Set of basic data matrices—multisample data, (D) Similarity or correlation matrix, (E) Set of similarity or correlation matrices.

of average gender does not exist. There are other measures that can fulfill the summary function for categorical variables, such as the median and the mode, but the mathematical treatment of categorical variables is essentially different from that of numeric variables, as is their interpretation. The following types of measurement levels are commonly distinguished. • Numeric: Ratio level. A continuous variable has a ratio level if it has an absolute zero point, for example, weight; a zero indicates weightlessness. About such variables valid statements can be made, for instance, the weight of a stone of 10 kg is twice that of a stone of 5 kg. Thus, the values of the variable can be interpreted in terms of ratios: twice as heavy, three times as long, etc. Counted data can mostly be treated as having a ratio level, but they are discrete: having done something four times is twice as often as having done it twice. On the other hand fractions only refers to partial objects, and are not considered whole objects. For some variables fractions have no real meaning: a crate may contain 50 and a half apples, but an audience does not consist of 16 and a half people. ∗ In one of our examples theology students from Group A may have used historical facts to support their belief in the authenticity of Pauline Epistles twice as often as the theology students from Group B. We may also calculate the average use of historical facts (Chapter 6). • Numeric: Interval level. A variable with an interval level is intrinsically continuous but is measured discretely. Moreover, there is no clear absolute zero point, for instance, there is no absolute zero point for Beauty. Statements such as ‘this painting is twice as beautiful as the next one’ have no strict ratio interpretation because the Beauty scale has no absolute zero point. However, the assumption is

1.5 Data: Formats and types

13

that judgments are given in such a way that the size of the difference between two values can be interpreted. Thus, the difference between two paintings with scores of 3 and 5 on a scale of 7, respectively, is as large as the difference between scores of 4 and 6. In practice this may or may not be a realistic assumption. Another example is esthetic judgments about the emotions evoked by music, which are measured with semantic differential scales (Chapter 17). The mean of an interval scale represents the score of an average individual. Deviations from the mean have an absolute zero point, i.e. the mean; so that one person’s score can be twice as far removed from the mean as someone else’s. Thus the variable deviations from the mean has a ratio scale, and we may use these deviations as fully numeric variables. • Categorical: Ordinal level. The values of an ordinal variable can be ordered, but intervals between the values are not defined. We know that someone has the lowest value and someone else the highest but it is not known how far apart they are. The median is the centre value; by definition half the observations are above the median and half below it. We can also specify a mode, the value occurring most often, but the mean or average value is a meaningless concept for ordinal variables. Painters can be ordered on Competence; one is considered the most competent, and one the least, with the others in between. However, the differences between the numbers assigned to competence are not interpretable. Thus, the difference between painter 1 and painter 2 may be hardly visible to an amateur, whereas the difference between painter 4 and painter 5 may be glaringly obvious. Another example is the use of particular rhythmic patterns in sentence endings, where the frequency of occurrence should be treated as an ordinal variable, i.e. only as an indication of whether one type is used more often than another (see Chapter 14). • Categorical: Nominal level. A nominal variable consists of categories which are labels, they may be numbers, but these have no numeric meaning. Nominal variables are often attributes or descriptive categories, such as the variable fruit may have apple, pear, and pineapple as categories. They are obviously not ordered. We can specify which category is most common, i.e. is the mode, but much more cannot be said. In an orchestra the instrument a musician plays can be seen as an attribute of that musician, but that is all. The variable Instrument is really only used to classify the musicians. Similarly, a grave may contain a particular artefact, or not; the variable Presence of artefact is in this case a binary variable and generally coded with 0 (absent) and 1 (present), but no further numeric statement is possible (see Chapter 8). These distinctions between measurement levels are important because they determine for a large part which operations on the values of the variables are meaningful, and hence which techniques are relevant for their analysis. It should be mentioned that a distinction can be made between the intrinsic measurement level and the actual measurement level. For instance, at a fair, small children below a particular height are not allowed in certain attractions because it would be too dangerous for them. All that is necessary for the fair master is to measure

14

1 Introduction: Multivariate studies in the Humanities

whether they can pass under a horizontal stick at the appropriate height. Although the measurement level of Length is intrinsically numeric and ratio, here the actual measurement level is binary. A third variety is the analysis level. In practice, a variable may be analysed at a different level than its intrinsic, or even actual, level. In certain situations the measurement level may be downgraded by analysing a numeric variable as a categorical one, for instance, when it has a very peculiar distribution, say one with two equally important modes. Such bimodal distributions can make standard interpretations of analysis outcomes unreliable or even inappropriate. For instance, if we want to use Length as predictor for Weight in two groups, but one group consist only of children and the other group of adults, ‘short’ and ‘tall’ will be sufficient.

1.5.3 Characteristics of data types It is useful to list some data types. These are determined by the way the data are measured. What follows is not an exhaustive enumeration, but is only meant to give an idea of some of the possibilities which figure in the case studies. The list of types has largely been taken from Nishisato (1994, p. 23). • Numeric measurement. In this case it is assumed that, in principle, the values of a variable are derived from a continuous ordered scale, and, depending on the subtleness of the measurement instrument, can be measured with ‘infinite accuracy’. Typically the length of a musical note and the weight of a book fall into this category, but tensile strength of the canvas of a painting would qualify as well. In practice, all measurement is categorical, because there is a limit to the accuracy with which an instrument can measure. In daily practice, continuity of measurement is regularly assumed. • Counting. Frequencies are ubiquitous and used generally. They are necessarily discrete, but mostly treated as numeric data. • Rating scales. There are two common variants. The first is Unipolar (say, going from 0 to 7 or 9), corresponding to ‘not at all beautiful’ and ‘very beautiful’, respectively. The other is Bipolar going from one marked end ‘very ugly’ to another marked end ‘very beautiful’; the end markers are antonyms. The latter type is a common form of bipolar scales called semantic differential scales—first formulated as such by Osgood (1952). These adjective scales generally have seven divisions (mostly coded from 1 to 7), running from one adjective to its antonym. So from ‘cold’ (1) to ‘hot’ (7). Such a scale is generally referred to by its high end (here: ‘hot’), so that high scores (6 and 7) indicate that it is ‘very hot’ and low scores (1 and 2) that something is very ‘not-hot’, i.e. ‘cold’, 4 is considered the neutral point—‘not warm/not cold’. Semantic differential scales feature in Chapters 9 on Marian art, and 17 on Chopin Preludes. A common assumption in semantic differential studies is that the participants are a random sample from a larger population, so that they are representative of that

1.5 Data: Formats and types

15

Fig. 1.3 Transforming raw data to derived data

population. In particular, it is assumed that the population contains no specific groups with fundamentally different judgments, so that the relationships between concepts and scales are essentially the same for everybody in the sample. Another option is to drop the assumption of a random sample, and concentrate on the different ideas individuals may have about the relationships between the concepts and scales. • Rank-order data. The participants are faced with a number of alternatives which have to be ordered with respect to some criterion, for instance, who is the most fit for a job. In this case there is no need to specify the size of the difference between the ranked alternatives. • Multiple choice. A truly categorical type. The researcher or participant chooses one option from a set of options. Many exams consist of multiple-choice items. • Paired comparisons. The participant is faced with a choice between two items and has to indicate which one is preferred, how much one is preferred over the other, or how much the items resemble each other.

1.5.4 From one data format to another It is not unusual that the observed scores cannot be used in an analysis in the same way as they have been collected. A common situation is, for instance, that categorical observations are labelled with words (‘red’ or ‘blue’), but words do not lend themselves for numeric analysis. In such a case numbers may be assigned to the labels, and depending on the research questions one of the above measurement levels is then applied. For several reasons, both practical and related to the analysis, raw data may be converted into derived data, which are then analysed with various techniques depending

16

1 Introduction: Multivariate studies in the Humanities

on the research design, research questions, measurement levels, etc. Typical derived data or data schemes are contingency tables for counts, correlation matrices, covariance matrices, and similarity matrices (see Figure 1.3).

1.6 General structure of the case study chapters Given that the major aim of the book is to show how research questions can require certain types of data to be collected, and how these data can be analysed, I thought it would be helpful to have a fairly uniform structure for the case study chapters. The general outline of these chapters is given below, but readers must bear in mind that not all chapters will need all elements, and some chapters will need further elaborations. Inevitably, there will be some overlap in the statistical descriptions in the chapters to allow for (relatively) independent reading.5 The various sections, and their contents, are • Background. A short introduction to the field from which the data have originated, as well as relevant information about the particular study in which the data were collected. • Research questions. The specific questions for which answers are sought. • Data. The dataset for the example, and some of its essential characteristics. • Analysis methods. A discussion of how the research questions and the data can be modelled so that they are amenable to numeric analysis. In particular, the statistical methods used in the chapter will be explained in fairly general terms, and why these are applicable to the specified problem. • Statistical analysis. Presentation of the appropriate statistical analyses. • Other approaches. Any relevant information on other methods with which the problem and/or the data can be analysed. • Content summary. The most important results of the multivariate analyses with particular focus on what the multivariate techniques have to offer in this specific case. My aim here was to make the results understandable for the nonstatistically inclined.

1.7 Author references The author references in the Index are strictly speaking references to publications rather than to the authors themselves. This means that only first authors with the date of their publications appear in this index, which facilitates finding the places where publications are referenced. It also means that second and later authors of a 5I

have sometimes used a typical Dutch bracket construction in that (dis)similarities will mean dissimilarities and/or similarities.

1.9 Web addresses

17

publication cannot be found by name in the index; they are listed as authors only in the References.

1.8 Wikipedia A brief note on the references to Wikipedia entries in this book. Even though there is limited academic control on the subject matter and the way it is presented in the online encyclopedia, at times references to Wikipedia are included because it is a readily accessible source of general knowledge on all sorts of topics. Given that this book covers topics from many domains, it will be difficult for readers to have easy and direct access to scholarly works on all these areas. Thus, for a first orientation the online encyclopedia Wikipedia is a good resource, but for serious study one should refer to scholarly works on the topic.

1.9 Web addresses With regard to the web addresses mentioned in this book, I have to note here that, first of all, web addresses are generally not long lived, so do not be surprised if you cannot find some of them immediately. You may have to search the web for the particular source under another name. Secondly, printing constraints sometimes lead to addresses being wrapped over several lines, which generally means that in the pdf file of this book they cannot be clicked on for direct access. You will have to copy the complete address and paste it into your browser to get the desired access.

Chapter 2

Data inspection: The data are in. Now what?

Abstract Main topics. The main purpose of this chapter is to show what a researcher has to do before the start of a multivariate data analysis. Data. Once the data for a study have been collected, they have to be carefully examined. The more important ways to do this are presented here. Some dataset questions. Do the variables have distributions which make the intended analyses feasible? Have appropriate decisions been taken with respect to missing observations? Are the data free from coding errors and other rogue data points? Can unusual data points be trusted not to influence the outcomes in an inappropriate way? Statistical techniques. Coding checks, outlier checks, missing data checks, distribution inspection, inspection of relationships between variables, and investigating what adjustments may need to be made to the variables to make analysis possible. Keywords Bivariate association · Data transformation · Graphical representation · Missing data · Normal distribution · Outliers · Scatterplot · Kurtosis · Skewness · Statistical assumptions

2.1 Background Once the data are in, several preliminary investigations have to be carried out before the proper analysis can begin. There is much that can have gone awry during the data collection, and it is the responsibility of the researcher to make sure the outcomes of the data analysis can be trusted. In this chapter a case is made for careful inspection of the collected data before an analysis can begin, various problems will be outlined, and approaches to tackle them are suggested.

© Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_2

19

20

2 Data inspection

Fig. 2.1 The value of graphing: Anscombe’s quartet. Mean of all X = 9; Mean of all Y = 7.5; Standard deviation is X = 3.3; Variance Y = 2.3; Correlation r (X, Y ) = 0.82; Regression line = Y = 3.00 + 0.50X

2.1.1 A researcher’s nightmare A group of researchers in art had taken samples in four communities to see whether the people of these communities differed in the appreciation of modern art, and whether this opinion was dependent on their intelligence. They were a bit lazy and had only collected data from 11 people in each of the 4 communities. To their utter surprise the statistics (= numeric summaries) were all the same: the mean (= average) and standard deviation (= spread) for both intelligence (X) and appreciation for modern art (Y) were identical in each community. Even more surprising was that the correlation and regression (= predictive relation between the two variables) were also the same in each community. They were about ready to forget about the differences in the communities, as the statistics suggested that they were the same anyway, and considered that they would be much better off with one sample of all 44 persons because they had heard that the bigger the sample, the more powerful the analysis. At the last minute they decided to have a look at the graphs of intelligence versus art appreciation per community, and they nearly got a fit. Look at Figure 2.1, and see what upset them. The relationship between intelligence (X) and art appreciation (Y) was totally different in each community, although all indications of the statistics suggested the

2.1 Background

21

Table 2.1 Descriptive information about Anscombe’s quartet Variable Valid Min name N X for 1 11 4.0 Y for 1 11 4.3

Max

X for 2 Y for 2

11 4.0 11 3.1

14.0 9.3

9.0 7.5

9.0 8.1

3.3 2.3

0.00 −1.32

−1.20 0.85

X for 3 Y for 3

11 4.0 11 5.4

14.0 12.7

9.0 7.5

9.0 7.1

3.3 2.3

0.00 1.85

−1.20 4.38

X for 4 11 8.0 19.0 9.0 11 5.3 12.5 7.5 Y for 4 Data source: (Anscombe, 1973)

8.0 7.0

3.3 2.3

3.32 1.51

11.00 3.15

14.0 10.8

Mean Median Standard Skewness Kurtosis deviation 9.0 9.0 3.3 0.00 −1.20 7.5 7.6 2.3 -−0.07 −0.54

opposite. In Community 1 higher intelligence seems to indicate more appreciation in an orderly way. In Community 2 there was a perfect relationship but at the higher end of intelligence the appreciation started to decline systematically. In Community 3 there was again a perfect relationship between the variables, except for a very intelligent person who was very enamored with modern art, and in Community 4 there was little to say about the relation because all persons but one were equally intelligent. The one outlier was clearly a modern art aficionado. The precise data are listed in Table 2.1.

Where the data really came from To underscore the need for data inspection rather than solely relying on numeric summaries, like means and variances, Anscombe (1973) constructed the quartet of datasets shown in Table 2.1 such that for each of them the X and the Y variable had an equal mean and variance. If the variables were all normally distributed all distributions would be identical. However, as we see in the table they have different medians, skewness, and kurtosis (fatness of the tails of the distribution). This makes a tremendous difference to the distributions when we graph the X s against Y s, as the graphs in Figure 2.1 shows. Against expectations the correlations and regression lines in Figure 2.1 are all the same. They illustrate that inspection of the data purely on the basis of their statistics is insufficient; actually viewing and graphing the data is an absolute must. The four panels show that summary statistics can hide a multitude of sins.1 1 See

for further discussion and a listing of the data: https://en.wikipedia.org/wiki/Anscombe’s_quartet.

22

2 Data inspection

Data are a precious commodity, and a conscientious investigator collects them carefully.2 However, errors can and do occur and can throw a spanner in the works. Therefore it is necessary to do a data inspection before starting seriously with the multivariate data analysis. In this chapter I will discuss a few procedures but for an exhaustive data inspection standard text books should be consulted. Once the data are in, researchers are dying to know whether the data confirm their theories, present revealing information about the relations between variables, and reveal object groups who were predicted to exist. However, it should be realised at the outset that if the data collection is flawed in whichever way, research is doomed. There will be no convincing your colleagues about the value of your results. Reviewers will vilify your paper, and there is no defence possible. Moreover, journals will not want to publish the results from analyses based on your data, and rightly so. In other words, flawed and incorrectly sampled data are irreparable, and not even brilliant statisticians can help to set things right. To quote Crowder and Hand (1990, p. 130):3 The first thing to do with data is to look at them ....[This] usually means tabulating and plotting the data in many different ways to ‘see what’s going on’. With the wide availability of computer packages and graphics nowadays there is no excuse for ducking the labour of this preliminary phase, and it may save some red faces later, for example when an important correlation is found to be just due to a misplaced decimal point in the original data.

2.1.2 Getting the data right To start with, if at all possible you should have a look at the data directly. Simply looking at the actual numbers will give you an idea what sort of numbers there are in the dataset. There is a chance that gross flaws show up directly; a subset of the data is clearly different from the rest; one case has numbers twice as large as any other case; nearly all cases have the same value for a particular variable, one variable has nearly all scores missing, etc. An important aspect of the data which cannot be assessed by looking at them, and usually not even by careful numeric data inspection, is the representativeness of the cases in the dataset. This is often discipline related and outside the data themselves. It should have been taken care of during the planning of the study and the actual collection of the data. Many studies rely on (approximately) random samples from populations in order to justify that the results are representative for the specific population of interest. Many hypothesis-testing procedures are heavily reliant on this random sampling assumption. 2 This

chapter is partly based on lectures from a data analysis course given during many years at the Department of Education and Child Studies at Leiden University, The Netherlands. Direct and indirect contributions from Marielle Linting, Ralph Rippe, Anja van der Voort, and Joost van Ginkel, as well as many PhDs who taught working groups, are gratefully acknowledged. 3 Quoted in Cook, Lee, and Majumder (2016).

2.2 Data inspection: Overview

23

Only in very rare cases will it be possible to ‘adjust’ the population to your sample. As an example, say it turns out that there were only four men in your sample of 150 humanities students. This unbalanced situation can be ‘solved’ by deleting the men, and indicating that your results are only applicable to female students. Thus your initial population is changed from humanity students to female humanity students. It saves your design and your study, but limits the context. Random sampling is not always possible or desirable. In various fields in the humanities either the population proper is investigated, or the data come from nonrandomly collected samples. This is because often the data at hand are all the data there are. Think, for instance, about archeological excavations, historical archives, or all the works of a particular composer. In such situations the investigation is not necessarily driven by hypotheses, but ‘merely’ by the curiosity of the researcher who wants to know what the data have to say; a perfectly legitimate aim. It is the exploration of the data which is the prime activity in such a study, rather than the confirmation of theoretically derived hypotheses. This situation will be encountered regularly in the case studies presented in this book,4 but specific hypotheses will be put to the test only occasionally. One of the consequences is that many of the statistical techniques discussed will have an exploratory and/or descriptive character, and that the results of analyses will be primarily supported by a variety of graphs rather than significant tests. This chapter is mainly concerned with what has to be done before the real analyses can begin. It has to be justified, and sometimes shown, that the data are creditable and trustworthy. Without full confidence in the data themselves, analyses can only lead to uncertain and possibly incorrect information and conclusions.

2.2 Data inspection: Overview Apart from making the data suitable for analyses, there is an additional advantage to data inspection: it will provide insight into the elementary characteristics of the data and so give an initial feeling of what they are like. It will also give an indication of and background to what to expect from the multivariate analysis. If there are discrepancies between the information at a basic level and at a more sophisticated level, it forces us to think carefully about the nature of the data. Suppose that three variables are highly correlated among each other, but there is no trace of this in the multivariate analysis. Such diverging outcomes need careful consideration as to the possible causes. Without an initial look at the basic data characteristics such discrepancies may go unnoticed. In other words, walk before you run: be sure of the simple information before you go the whole way with complex multivariate techniques. Data inspection has a number of aspects. In particular: 4 The

case studies presented in Part II are based on real data, either listed in publications or kindly provided by the owners of the data or their colleagues.

24

2 Data inspection

∗ Inspection of the distributions of the variables, often via graphical representations. ∗ Checking for any adjustments necessary to make the variables behave better during the analyses. For instance, the data of Communities 2, 3, 4 of Figure 2.1 can be problematic (see the explanation above). ∗ Assessment and handling of any missing data. ∗ Assessment and handling of any outliers. ∗ Assessment of the statistical assumptions required for proper execution of the analyses and interpretation of the results. ∗ Inspection of residuals to assess how well a model (for example, a regression line or normal distribution) fits the data at hand. To evaluate the distributions of the observed variables in the dataset, we have to first know their measurement levels. The measurement process has, of course, been designed by the researcher, but once defined we have to establish what the characteristics of the empirical distributions are, and what we can do with them statistically. There are four basic measurement levels, two categorical: nominal, ordinal, and two numeric: interval, ratio. Each have their own characteristics. Usually the nominal level is considered the lowest, because the values have the fewest possibilities for mathematical manipulations (see Section 1.5.2). To illustrate several of the possibilities below we will use information from real projects, but not necessarily label the graphs and tables with the original variables, partly to avoid lengthy explanations of the projects and the variables involved.

2.2.1 The normal distribution The theoretical normal or Gaussian distribution permeates all of statistics,5 but in itself there is nothing normal about this distribution. Because it has all kinds of nice mathematical properties it rears its head in many statistical methods. Empirical numeric scores with an approximately normal distribution show up in a graph as a bell curve: symmetric around the mean, most scores in the middle, and tails which taper off on both sides. In Figure 2.2 the solid line is the mathematical normal probability distribution, and the histogram is the observed sample distribution. Obviously an observed distribution is never exactly equal to a theoretical one. The latter is a model for the former; see Section 3.4.2 (p. 51) for a discussion of model. Often a normal distribution is used in its standardised form, i.e. with a mean of zero and a standard deviation of one. For the normal distribution some two-thirds of the scores lie within −1 and 1 standard deviations, and all but about five percent within −2 and 2 standard deviations.

5 Gauss

was a 19th -century German mathematician who made many significant contributions to mathematics and statistics. He was considered by some the greatest mathematician since antiquity; https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss.

2.2 Data inspection: Overview

25

Fig. 2.2 A theoretical normal distribution superimposed on the empirical histogram of an observed variable; both have the same mean and standard deviation.

The frequent use of the normal distribution in statistics and data analysis derives from the fact that many natural phenomena such as intelligence have a roughly normal distribution. The normal distribution also occurs because in large samples the means of empirical distributions are approximately normal, and because random errors also tend to have such a distribution. ∗ Suppose one morning you weigh yourself a hundred times in a row with two decimal precision. You will see that the distribution of these measurements resembles a normal distribution because of the random errors you make every time you weigh yourself. The mean of the 100 measurements will be in the middle, and the variability around the mean reflects the random errors you made during weighing. As mentioned above, many statistical techniques make extensive use of the normal distribution because of its convenient mathematical properties and easy handling in advanced statistics. Fortunately, techniques based on the normal distribution are not very sensitive to mild deviations of the sample scores from normality, especially in large samples. It is important to note that quite a few collections of numeric scores have distributions which approximate the normal distribution reasonably well. On the other hand, there are many others that do not, and it is therefore important to inspect your data before analysis to establish whether you may use statistical techniques based on the normal distribution. In Figure 2.3 we see two single-peaked probability distributions.6 One is more or less symmetric, and the other is asymmetric and is skewed to the right. Obviously the latter has a much larger standard deviation (σ ) because the scores are more spread out around the mean. The position of the central tendency measures is vastly different. The medians of the two distributions are in the same spot, but typically for such distributions their means lie far apart, as do the modes. If we have to choose one characteristic measure for the distributions, the median takes the prize.

6 Source: https://en.wikipedia.org/wiki/Skewness.

26

2 Data inspection

Fig. 2.3 A reasonably normal distribution (solid line) compared to a skewed one (dotted line). Relative positions of distribution centres are shown as a function of the skewness of the distributions: mean, median, and mode. The medians coincide.

However, many variables in research projects are not numeric at all, but categorical. This clearly requires a totally different approach towards distributions and summary measures. This will become a recurring topic in the case studies, and will be discussed in connection with the data in question.

2.2.2 Distributions: Individual numeric variables If we want to look at the distributions of variables one by one, a relatively efficient way is to graph them. A graph as Figure 2.3 gives visual information about the mean and the spread, as well as the skewness and kurtosis of the distribution. This is especially relevant for numeric data.7 Gifi (1990, Chapter 1, especially Section 1.6) provides a thorough but rather technical account of the state of affairs in this respect. If you want to be informed, this would be a good source to consult. My account here is probably a bit too optimistic.

Graphing distributions If we have a numeric variable with many possible values a stem-and-leaf plot can be very informative, provided we have a reasonable but not exorbitant number of persons, objects, etc. available. A special feature is that all observations are visible, as is the case in Figure 2.4.

7 Kurtosis:

from the Greek kurtós, κυρτ  ς -bulging.

2.2 Data inspection: Overview

27

Fig. 2.4 Stem-and-leaf plot: Distribution of exam grades of 143 students; range: 0–100.

We see the exact values of the lowest score (23) and the highest score (98). The median can be found by searching for the value above and below which 50% of the students score; just count them. Here the median is just under 64. It is clear that the general shape does not deviate much from a normal distribution, i.e. it is reasonably symmetric and has a single peak roughly in the middle. The histograms in Figure 2.5 (taken from another study) do not show the original individual scores. Histograms can, in principle, handle huge numbers of cases more easily than stem-and-leaf plots. The left-hand panel shows a distribution with two modes (or peaks), and we may wonder whether a variable with such a distribution can be included in standard statistical techniques. It might be worthwhile to investigate whether the sample does not consist of two distinct groups, one larger group of lower scorers and the other a smaller group of higher scores (probably from 24 and up). The right-hand panel of Figure 2.5 displays exam scores, and we see that there is a strange gap in the middle of the distribution. It looks as if there has been an attempt to artificially create a clear threshold between the failing and passing scores (around 55). Possibly the instructor did not like students arguing about the grading in the hope to get a few extra points to push their scores above the ‘fail-pass’ threshold. If the data are not inspected first, in both figures such anomalies may go unnoticed, and could give strange or incorrect results and interpretations. As was done in Figures 2.2 and 2.5, it is a good idea to superimpose the appropriate normal distribution, i.e. that with the same mean and variance as the observed scores. This is often helpful to visually assess the relative normality of the distribution.

Regularising distributions Not infrequently, distributions of numeric variables do not conform to a normal distribution but are seriously skewed. This may be statistically undesirable, but in

28

2 Data inspection

(a) Bimodal histogram

(b) Strange exam results

Fig. 2.5 Anomalous distributions. Left-hand panel: A bimodal distribution. Right-hand panel: A suspicious gap in the distribution.

the real world we would prefer not to have a normal distribution for, for instance, depression, this would mean that there are quite a few people with a medium type of depression, and very few with no depression. In data analysis, one of the problems with nonnormal distributions is that many common statistical procedures have been developed under the assumption of the normality of a distribution, and therefore work best under normal distributions. One statistical solution for this problem is transforming the nonnormal distribution of such variables so that they do not violate the assumptions required for the proper use of the statistical techniques, and deal with outliers in an adequate way. This does, however, influence the interpretation. As an example, let us investigate the relationship between Family income and School readiness in the US; this example that will be used later in this and other chapters as well.8 ∗ School readiness and Income. In Figure 2.6 we see on the left the relationship between the original scores for Family income and School readiness. It shows that children from poor families are less ready for school. Many of them will not have had the ideal surroundings to start developing their intellectual abilities at a very young age. With increasing income School readiness increases as well, but not in a linear way. One could say that for the rich increasing the school readiness of their children needs ever more money investment per unit school readiness, as is shown by the red curved line, which fits better than the black straight line. A statistician would say that the curved line looks like a logarithmic curve. After a logarithmic transformation of Family income the black straight line now fits adequately. This shows that the transformation succeeds in linearising the relationship. The red line, constructed in different way, fits reasonably as well. We will come back to this line in Section 3.8.2. 8 Many

thanks to the National Institute of Child and Human Development—nichd—for making the data available.

2.2 Data inspection: Overview

(a)

29

(b)

Fig. 2.6 Effect of a log transformation on the relationship between Family income and School readiness. Left: Observed scores; Right: Logarithmically transformed Family income.

Fig. 2.7 Effect of a log transformation on the distribution of Family income.

∗ Figure 2.7 shows the distributions of Family income before and after the logarithmic transformation. The left-hand distribution is skewed to the right: there is a long tail at the right-hand side and hardly one at the left. The right-hand distribution image is more or less symmetric and resembles a normal distribution, apart from a couple of spikes on the left. In other words, the properties of Family income has been much improved by the transformation, and can be used in multivariate analyses which require normal distributions. Later (see, for instance, Chapter 7) we will discuss procedures which make even more intensive use of transformations to improve the quality of the analyses. However, we should realise that transforming variables also comes at a cost. Here, without transformation, the variable Family income could be easily interpreted, i.e. in terms of dollars, after transformation we have to talk about income in terms of log dollars. This is possible, but it should be taken into consideration during the interpretation.

30

2 Data inspection

Table 2.2 Descriptive information for several variables with common and unusual characteristics; based on the fictitious scores of 100 persons. Variable name Gender Religion Age Depression Happiness Only valid observations

Valid Min Max Mean Standard Standard Standard Measure N deviation skewness kurtosis -ment level 100 70 98 69 99

0 1 4 1 1

1 4 99 9 7

.60 2.73 56.10 2.35 6.00

0.24 0.92 10.48 1.74 0.30

−1.00 0.83 5.95 3.89 0.09

1.00 1.95 3.88 4.33 7.54

binary nominal numeric interval ordinal

56

2.2.3 Inspecting several univariate distributions In this subsection we will indicate the most important topics to do with inspecting several variables at the same time; where to look and what kind of phenomena to look for. To give adequate advice on what to do when anomalies have been observed is difficult, because the contents of the variables and the field of investigation will play an important role as well. In the case study chapters we will offer a few suggestions on how to handle such problems when they arise. Basic descriptives To get a quick look at variables together it is advantageous to make a table of plain Descriptives (Table 2.2). We may do this irrespective of the measurement level as long as we focus on inspecting the numbers themselves rather than trying to interpret them substantively. Such a table should at least consist of a number of valid observations, minimum, maximum, mean, standard deviation, standardised skewness, standardised kurtosis, and a reference to measurement levels. When unexpected and/or unusual values come to the fore during the inspection of the descriptives, we would generally want to graph the complete distributions to see what is going on. Presence of missing data The first thing to check is whether there are missing data. Most statistics programs will show not only how many valid data there are per variable, but also how many cases have only valid values, i.e. have no missing data. From Table 2.2 we see that Depression has the lowest number of valid observations (69); nearly a third of the respondents did not answer this question. Only just over half (56) of the respondents had valid scores on all variables. In any case everybody was prepared to declare their gender. We will turn to the problem of missing data in Section 2.3.

2.2 Data inspection: Overview

31

Values of data points and range When interpreting variable descriptives, we should always take the measurement levels of the variables into account. The mean of the categorical variable Religion— where each observation only has a label, such as Muslim, Buddhist, Christian, or Agnostic—is obviously meaningless. There is no mean religion, exactly because assigning numbers to those labels is completely arbitrary. To get a first impression of the values of a variable, it is probably best to inspect its distribution by making a histogram. For any type of variable it will be useful to check its range, to verify that the values are as they should be. If a specific value has been coded by the researcher as an indicator for missing data but this coding is not known to the statistical computer program, it will become blatantly obvious by inspecting the range (and the other statistics of the variable) that something is seriously wrong. In Table 2.2 an outsider faces the dilemma whether 99 is a coding error, an indication of a missing data value, or a genuine observation of old age. At times we may observe for a binary (0 and 1) variable that the mean is something like 0.02 or 0.93; this will indicate that nearly every case has a 0 or 1 score, respectively. In other words, there is not much variability in that variable, which might be problematic in further analyses. A mean of 0.5 will indicate that there are an equal number of zeroes and ones. In Table 2.2 this problem does not occur for the variable Gender; we can only observe that there are more ones than zeroes, more men than women or vice versa, depending on whether men or women were coded 0 or 1 (mean = .60). Skewness and kurtosis The next thing to inspect for numeric (ratio or interval) variables is the standardised skewness. A standardised more or less normal distribution has a standardised skewness of roughly zero with most of the scores between -2 and +2, but this should not be taken as a strict law. A negative/positive skewness indicates that the distribution has a long tail on the left-hand/right-hand sides, respectively. If this is the case it is advisable to make a histogram with a normal distribution superimposed, to evaluate the observed distribution and to decide whether anything needs to be done about this. Tails of skewed distributions generally consist of only few observations. This means that in the tail region there is relatively little information about the phenomenon measured. This may affect interpretability, especially when we want to assess the relationships with other variables. For the standardised kurtosis a similar strategy is advisable and similar decisions need to be taken. The kurtosis measures the fatness of the tails compared with the normal distribution. Fatter tails will lead to flatter peaks, and shorter tails to higher peaks. A not uncommon situation is a concentration of scores around the lowest possible value, i.e. zero or one. For example, in general there are a lot of people without or with only very light depressions and very few with heavy depressions. Thus, one expects the distribution of depression to be strongly peaked at the lower

32

2 Data inspection

Fig. 2.8 Checking normality: histograms and Q-Q normal probability plots. Top: Histograms— Age entering university (left); Exam results (right) Bottom: Q-Q plots—Age entering university (left); Exam results (right)

end (i.e. to have a large kurtosis), and a long tail tapering off at the higher values, i.e. the distribution will also have a high positive skewness. The Family income variable mentioned earlier has such a shape: many people with low incomes and only a few millionaires (see Figure 2.7; left). Such things have to be noted and decisions have to be taken, such as whether to ignore the anomaly, apply some kind of transformation, maybe even setting a few outlying persons aside, or restrict the sample to only those values of the variable on which there is sufficient information. Alternatively, the scores may be grouped into a few ordinal categories (i.e. binning, for instance, from very low, low, neutral, high to very high; see Chapter 4). The two numeric variables in the table, Age and Depression, should be carefully checked, considering their high standardised skewness and kurtosis. A full evaluation cannot be given here because the data are fictitious. It is, however, no problem to conclude that our fictitious persons were a happy lot. Another generally available option to check distributions is to make a quantilequantile plot, also called Q-Q probability plot. Figure 2.8 shows a totally nonnormal distribution and a nearly normal one, except that there are a few outlying low scores. Perhaps these scores were from students who only came to the exam to see what

2.2 Data inspection: Overview

33

to expect next time. When the normal distribution is involved it is also called a normal or Gaussian probability plot. Further information can, for instance, be found in Hoaglin, Mosteller, and Tukey (1983, pp. 226–227), and in many general statistics books. Hoaglin et al. also contains a discussion of residuals, which often provide further information on distributions and deviations from fitted models.

2.2.4 Bivariate inspection From the point of view of data inspection it is also necessary to look at the bivariate relationships between the variables to check whether everything is as expected. We will again have to pay attention to different measurement levels, with the emphasis on what to look for and how to interpret the outcomes of the inspection. One of the important reasons for paying special attention to variables in pairs is that many analysis techniques are based on coefficients indicating the strength of association between such pairs: correlations, covariances, similarities, etc.

Pairs of numeric variables Figure 2.6 contains scatterplots of the relationship between Family income and School readiness. The plots show how with increasing income, school readiness increases up to a certain level of income. The relationship between the two variables is clearly nonlinear, and we saw that logarithmically transforming Family income increases the linearity between the two variables. Numerically it became slightly easier to predict School Readiness from the logarithmically transformed Family income (the correlation increased from 0.35 to 0.40, or from 12% to 16%). The reason for the rather modest correlation is that for a given X value there is considerable variability in the Y variable, as can be seen in the figure. When there are more than two variables, we still need to examine the pairwise relationships to assess whether they are linear (and thus suitable for most standard analyses based on this linearity), in particular, whether we can work confidently with correlations and/or covariances. A convenient display for a restricted number of variables is the scatterplot matrix such as shown in Figure 2.9, where we see the mutual relationships between three parts of an exam as well as their distributions. The row variables (Part A, Part B, and Part C) are the response variables, and the column variables the predictor variables. The distributions seem more or less sufficiently symmetric and single-peaked, and so that we do not expect problems with analyses based on the normal distribution. However, it would be wise to look somewhat closer at the lowest scores in the distributions. Student 90 and possibly Student 40 may be the most troublesome, as they have low scores on both Part B and Part C of the exam. The relationships between the parts of the exam do not show a strong deviation from linearity, so that we could have used them safely in the remainder of the study,

34

2 Data inspection

Fig. 2.9 A 3 × 3 scatterplot matrix for exam score. Univariate distributions and relationships between the scores of three parts of the exam. The scatterplots below the diagonal are the mirror images of those above the diagonal. The numbers in red are the correlations of the variable pairs. The boxed numbers identify students.

except for probable outliers. The correlation between Part A and Part B is the lowest (0.23), and the other two correlations are 0.34 and 0.54.

Pairs of categorical variables The relationships between two categorical variables, for instance, ‘eye colour’ and ‘hair colour’, can be shown via a contingency table, which lists how many children of a school class have both blue eyes and blond hair, and how many have red hair and green eyes, etc. Given the simple measurement of categorical variables (nominal or ordinal) there are few statistics that can be inspected. Probably the most important

2.2 Data inspection: Overview

35

aspect to ponder with respect to an analysis is the role of rare categories with only few observations. If some categories are exceedingly rare in comparison to other categories, we might consider whether merging some of them might not be a good idea, for instance, merging the numbers of Protestants and Catholics into a category Christians in a predominantly Muslim country. We should at least check whether the strength of the relationship between the two categories is affected by categories with very low frequencies.

Numeric and categorical variables If we have both numeric and categorical variables in a study and we want to see whether there are problems with the variables, especially their relationships, we should make a distinction between a research design where numeric variables are the response variables and a categorical variable is the predictor, and the design where the roles of the two types are reversed. Numeric response variable; categorical predictor A standard option here is to compare the means or medians on the response variable between categories. This can be evaluated by a t test or its nonparametric equivalent, respectively. Some kind of effect size can be assessed as well. If necessary, we can inspect the response distributions for categories against each other using boxplots (see Figure 2.10). Categorical response variable; numeric predictor variable An option in this case is to categorise the numeric variable and make a contingency table to look at their relationship. Another way is to ‘forget’ which variable is the response, and, as above, compare the means and distributions of the numeric variables for each of the categories of the categorical response variable.

Boxplots: numeric and categorical variables A boxplot (Figure 2.10) is a figure that in a compact way shows the main ordinal characteristics of a distribution. ∗ the interquartile range is the central box. It indicates the middle 50% of the distribution. ∗ the median is indicated by a bar in the box. It is located at the point which splits the distribution in two equal halves. ∗ whiskers indicate a range of 1.5 times the interquartile range below and above the box. ∗ outliers are objects with values up to 3 times the interquartile range outside the central box. ∗ extreme values are even further away.

36

2 Data inspection

Fig. 2.10 Boxplots. Characteristics of three exam parts. Range = 0-10; 6 is the passing grade.

Figure 2.10 shows the boxplot characteristics mentioned above for another threepart exam. There were some really bad scores, but on the whole the students performed really well, assuming 6 is a passing grade. The fact that the whiskers touch the perfect score line ( = 10) indicates that there were in fact students who had perfect scores. We also notice that student 176 was really struggling with both Parts B and C, which were the statistical parts of the exam. The instructor would be wise to contact this student about these results. In addition, we should check whether this student affects the sizes of correlations with other variables by calculating them both with and without this student.

2.3 Missing data The first maxim of data analysis is that there (nearly) always will be missing data, accidentally or on purpose. The second maxim should be: you always have to take a decision about the missing data before the analysis properly. Ignoring them is a decision as well.

2.3 Missing data

37

It is good to realise that there is a difference between missing data and respondents who have missing data. It makes a lot of difference whether there are 10% missing data points, or 10% of the persons have missing data. The total number of missing data points may be drastically reduced by not considering persons with many missing data. However, the information in their valid data will then be lost as well, which may affect the representativeness of the sample. There may even be no respondents left if everybody is deleted who has a missing data point on one of the variables. The seriousness of the effect of missing values on the outcome and interpretability of the results depends partly on what underlying mechanism caused the missing data. Generally we distinguish between three such mechanisms: missing completely at random (mcar), missing at random (mar), and missing not at random (mnar); they are discussed below. How to handle missing data also depends on the type of missing data. For more detailed information and sophisticated treatments see Van Ginkel, Linting, Rippe, and Van der Voort (2019), McKnight, McKnight, Sidani, and Figueredo (2007), and Van Buuren (2018) of which there is an online version.9 In a fascinating article David Hand discusses missing data under the name dark data. ‘Dark data are data that you do not have. They might be data you thought you had, or hoped to have, or wished you had. You might be aware that you do not have them, or you might be unaware’. (Hand, 2020, p. 44)

2.3.1 Unintentionally missing When the researcher assumes that the missing data are entirely due to some random process, the situation is referred to as Missing completely at random (mcar). For instance, somebody got distracted and missed a question; the process which is responsible for the ‘missingness’ then has nothing to do with the topic under investigation, and the missing data are not dependent on the observed data or any other kind of data. When the data are missing at random (mar) they do depend on the observed data, but not on unobserved data. A notorious case is the survey in which voters most from a particular political party are inclined to skip the question about their income, but everybody from any of the other parties meekly supplied it. As there was also a question about the favoured party, the missingness was dependent on the observed data.

2.3.2 Systematically missing In the case of not missing at random (nmar) the missingness depends on nonobserved variables, details of which are unknown to the researcher. For instance, some specific 9 https://stefvanbuuren.name/fimd.

38

2 Data inspection

archeological objects should have been available for investigation but are not, perhaps because they were made of perishable materials. This can, however, no longer be checked. Alternatively particular archeological objects of a very specific type are not available for study, but the reason for this is unknown. Yet another systematic cause of missing data may be due to the design of the study itself. The researcher may create a missing data point, because an observed score is so unlikely that it is considered better to declare the score missing, say a homeless person who has just won a million pounds in a lottery. It may be better dealt with as if it were really missing, to prevent undue influence on the outcomes of the analyses. Obviously further investigation and adequate reporting of such observations are necessary.

2.3.3 Handling missing data There are many ways to handle missing data, but unfortunately no method will fit all circumstances. However, the better you know your own data, the easier it is to make the right choice. None of the methods is perfect; if information is missing there are no sure methods to resuscitate it. Nevertheless, something will have to be done and the available options have to be considered. Ignoring the problem is a decision as well. But be aware that ‘[a]nalyzing data that you do not have is so obviously impossible that it offers endless scope for expert advice on how to do it’. (Macdonald, 2002). There are several options for dealing with missing data, for instance, removing cases from your sample (listwise deletion); removing variables from your study; deleting observations from the calculation of a statistic such as the correlation (pairwise deletion); and estimating the missing data from other information (single and multiple imputation). One of the main arguments in deciding what to do is the assumed cause of the missingness, and the effect of the proposed handling procedure. There are questions of representativeness of the sample and outcome bias as well. Removing cases with missing data; listwise deletion If the missing data are scattered randomly over a large dataset, it may be possible to remove cases with missing data from your analyses without biasing the outcomes and without affecting the representativeness of the sample. Such innocent listwise deletions are rare, especially when there is a fair percentage of cases with missing values. If nearly all cases have missing data, you might, in the extreme, have virtually no cases left to analyse. A comparable danger is that you upset or destroy a carefully balanced design that you needed in order to have a strong experiment.

2.3 Missing data

39

Removing variables Removing variables might be a very effective way to reduce the amount of missing data, but it will be disastrous if this involves important variables in your research design. The probability of this is of course large, because why measure variables if they are irrelevant for the study? An intermediate solution is pairwise deletion. Correlations and covariances between two variables can only be calculated if objects have valid scores for both variables. When objects have a score only on one of the variables, they cannot be considered for the calculation of a correlation. Rather than deleting them from the study, they may be included in the calculations of any other correlation provided they have valid scores for the pair of values in question. Thus, sometimes a case is included, sometimes it is not. That is why it is referred to as pairwise deletion. Then the problem is that all correlations are based on different subsets of the overall sample, which might lead to odd results. Moreover, when there are many such cases, technical difficulties in the multivariate analyses may arise. Imputation Imputation is the procedure to first estimate reasonable values for the missing data, based on other information from the case itself and on information from other cases in the sample. Once such estimates have been made and inserted in the dataset, the intended analysis is performed as if the sample was complete to start with. One of the dangers of this approach is that the imputed values are treated as if they are real and correct, which is not the case. Serious biases may occur in the results. This is especially likely when only one imputation is made for each missing data point: single imputation. Because the estimates are based on the other information in the dataset, there can be a decrease in variability within the dataset (there will be no random error or outside influences on the estimated values), which can lead to smaller overall random errors, and easier rejection of the null hypotheses. A more sophisticated imputation procedure is multiple imputation. This procedure comes down to completing the dataset more than once with different estimates for the missing data. The missing data points are each time slightly different because of the addition of a random error to the estimates. The desired analyses are carried out for each completed dataset, and the different outcomes are combined into one overall analysis. One of the advantages is that we can then see how sensitive the results of the analysis are to minor differences or disturbances in the datasets. A useful article by Van Ginkel et al. (2019) provides an introduction into the procedure, and refutations of misconceptions some researchers have about the technique. The relevant software is available in most large statistical packages as well as through special packages in R. A full-scale presentation can be found in Van Buuren (2018).

40

2 Data inspection

2.4 Outliers Above we have already seen a few outliers in the tails of skewed distributions (Figure 2.9) and in the graphs with boxplots (Figure 2.10). Here we will go slightly more deeply into the various kinds of outliers and how they can be handled.

2.4.1 Characteristics of outliers A reasonable definition of an outlier is: an observation with a unique pattern which makes it recognisably different from the other observations. They are not necessarily problematic for an analysis; this depends on their deviations from the other observations and, of course, on the other observations in general. It is a fact of life that most people have low-to-moderate incomes, but only a few are millionaires. A multivariate outlier need not be extreme on anyone variable. It may only be in its combination of values on several variables that an observation is an outlier. Many people play an instrument badly and quite a few play in renown orchestras, but a bad player in a renown orchestra is certainly an outlier. On the whole detecting outliers in multivariate date is hardly ever straightforward and easy. The point is that outliers may influence the outcomes of an analysis by affecting the mean and the standard deviation; they may change the correlations with other variables; and they may lead to significant or nonsignificant results that would not even be there without them. Anscombe’s quartet (see Section 2.1) shows that this is not always the case, i.e. that outliers may exist without affecting principal information in a dataset. Graphing will help to discover them. On the other hand, in particular cases outliers have led to new and surprising discoveries. They might be exactly what is needed to make progress. A statistician working for a pharmaceutical toxicology company seems to have remarked ‘All this discussion of deleting outliers is completely backwards. In my work, I usually throw away all the good data, and just analyze the outliers’.10

2.4.2 Types of outliers We may distinguish several types or causes for outliers, such as ∗ Procedural outliers. Errors due to incorrect transcription, coding or responding, etc. ∗ Exceptional deviations. A rare event has occurred that influences one or more persons; say, a Dutch artist painting in a purely Italian style because he has been trained in Italy. ∗ Unexplained exceptional scores. Unknown circumstances have led to a real but exceptional score, for instance, giving birth to six girls in a row. Random variability 10 Quoted by Westfall and Hilbe (2007) in their review of Nassim Nicholas Taleb’s The Black Swan.

2.4 Outliers

41

could be a cause, just like winning the lottery several times in a row. Only having children of the same gender may of course also have a genetic basis. ∗ Naturally skewed distributions. Outliers are a regular phenomenon in musical ability. Many people have no or little proficiency in playing an instrument and only very few are very good at it. ∗ Bivariate and multivariate outliers. Unusual combination of scores on different variables at the same time: the scores on separate variables may not be exceptional, but in combination with other variables they may be, for instance, winning awards for best student in both chemistry and Greek is likely to be an unlikely occurrence.

2.4.3 Detection of outliers As we have seen above, outliers in the distribution of a single variable can be found via histograms and boxplots, but there are also involved statistical procedure for detecting them. However, these will not be discussed here; see Aggarwal (2017). The visual detection of outliers in numeric variables can be supported by calculating their standard scores and comparing these values with those from the standard normal distribution, as in a Q-Q Plot (see Figure 2.8). Standard scores of -3 and +3 and more extreme values will certainly be candidates for further inspection. Standardised residuals in contingency tables can be used to discover unusual frequencies (see Chapter 4, p. 108). Bivariate outliers can often be detected in scatterplots. At the same time other characteristics can be investigated as well, such as nonlinearity of the relationship, and the occurrence of heteroscedasticity (unequal spread), especially in plots of residuals. Heteroscedasticity manifests itself, for instance, as a funnel shape of points in the graph; the higher the scores on the X variable the more spread there is on the Y variable. (This is, for instance, the case in Figure 4.10(b), p. 120). If it occurs we may consider searching for an appropriate transformation for one or both variables, as we did in the example with school readiness and family income. ∗ Figure 2.11 shows the relationship between the scores of two parts of an exam. There seem to be univariate outliers on the variables, such as students 133, 31, 83, and 74 for the Y variable (Part B), and students 27, 31, and 74 for the X variable (Part A). Whether these students are also bivariate outliers is more difficult to say, but most of them probably are. There are also formal statistical tests to decide whether data points are outliers or not. ∗ In Figure 2.11 we have marked five students as possible bivariate outliers as they are far away from the centre of the cloud. As explained in Chapter 4 (see p. 113) the degree to which these students are outliers may be qualified via a large Mahalanobis distance (M). For the five students mentioned these distances are: M74 = 8.9, M31 = 10.7, M83 = 11.4, M27 = 18.0, and M133 = 19.1. ∗ These exams are certainly worth further investigation. For a start, the correlation between the two parts is .45 if all 206 students are considered; without the five students the correlation is .38. Whether this difference is large enough to worry about depends on many practical things outside our present discussion.

42

2 Data inspection

Fig. 2.11 Bivariate inspection of exam parts. Assessment of linearity and detection of outliers with respect to the joint mean.

2.4.4 Handling outliers Clearly it is the researcher’s task to make an effort to detect outliers and decide what to do with them. In that sense the problem is similar to handling missing data. Some specific decision has to be made; options are correction (in case of a coding error), deletion (creating a missing data point), adjustment (changing the value after careful and reasoned considerations), and transformation of the whole variable. The general preference is to keep the outliers in the analysis, especially if they are actually occurring values, because they may appear again in similar studies. It is advisable to carry out the analyses both with and without the outliers, so as to assess their influence on the results. When outliers are interesting instead of an annoyance or pest, collect more of them and analyse them separately. For instance, a psychology researcher might be interested in the occurrence of depression symptoms. In a random sample from the population we would expect only a few people suffering from depression (they are thus the out-

2.5 Testing assumptions of statistical techniques

43

liers). If our interest is primarily directed towards the study of depression, we have to design a sampling procedure that aims to collect only information from persons with a depression. In such a case it is the outliers which are of specific interest.

2.5 Testing assumptions of statistical techniques It has been mentioned several times that first of all statistical techniques have to be chosen on the basis of the research questions. The characteristics of the data then determine the choice of techniques, in particular, the measurement levels and distributions of the variables. Information about them are necessary to make reliable assertions about the population on the basis of a sample drawn from that population. Such assertions belong to the formal domain of hypothesis testing.

2.5.1 Null hypothesis testing Suppose we have carried out an experiment on paintings by applying two different kinds of varnish, and want to make a statement about their hardness after say 1 year. The null hypothesis could be that the hardness of the varnishes is the same. However, in any situation there is an error component because even two varnishes from the same batch are never 100% equal. To make a probability statement about the comparative hardness we have to have the ‘hardness’ distributions for each of the varnishes and need to know how many samples of the same varnish can be different due to chance only. Thus, to evaluate the outcomes of the experiment, we have to know the shape, the variability, and the mean of hardness distributions. Many techniques for hypothesis testing are based on the normal distribution, and therefore it requires special care to evaluate the statistical result if clearly nonnormal distributions are involved. It is especially in such situations that nonparametric hypothesis tests may be of assistance. Furthermore, much robustness research has been undertaken to handle nonnormal distributions (see Chapter 4, p. 123 for a brief discussion of robustness). In many cases hypothesis testing involves testing the values of parameters of the population distributions using information from the observed sample. This procedure is called a parameter test. Readers of this book are assumed to be familiar with the basics of null hypothesis testing, and such procedures are not explained here in any detail. They should consult a more elementary statistics book for that purpose. Some of the more common terms are included in the Glossary.

2.5.2 Model testing There is another situation in which we often want support from a hypothesis test, in particular, when we compare two different models, or compare the outcomes of a model with the observations. Such tests are referred to as model tests, and are required when we want to know whether or not a model fits the data.

44

2 Data inspection

Suppose that we have a covariance (or alternatively a correlation) matrix computed from the observed data, as well as a theoretical model for the process which generated the data. Data computed on the basis of the theoretical model are called estimated data, implied data, or model-based data. When such estimated data are used to compute a covariance matrix, the question is whether this theory-based covariance matrix resembles the observed covariance matrix sufficiently for us to conclude that the model is a feasible one. A model test can supply information on whether the fit of the theoretical model is adequate, given that the assumptions for the test are tenable. Note that the researcher hopes that a model test shows that the model cannot be rejected; in that case the model is a feasible representation of the data. Whether the model is indeed the ‘true’ model will only be known after much further research. If we adapt the adage that Data = Model + Residuals, we want to show that the nonmodelled part of the data consists only of residuals, and in most circumstances mostly random errors.

2.6 Content summary To summarise the major procedures available for data inspection I have constructed a data inspection chart (Figure 2.12) listing the primary techniques used in this book. It is advisable to move from simple techniques to more complex ones, so that it is easier to understand the nature of the data on the basis of knowledge gained in the previous step.

Fig. 2.12 Summary of data inspection procedures The procedures are arranged in three categories of increasing complexity of the data.

Chapter 3

Statistical framework

Abstract Research questions. This chapter concerns the characteristics of research questions, data designs, and the more common multivariate statistical methods. Data. Details are provided on different data formats and types that feature in the case studies. Properties of variables, including measurement levels, are also discussed. Designs. Two general data designs are distinguished: Dependence designs related to the General Linear Model (glm), and Internal structure designs centring around the structure of sets of variables. Statistical techniques. Analysis methods for the two data designs, general linear model, variants of principal component analysis, factor analysis, multidimensional scaling, correspondence analysis, cluster analysis, loglinear analysis, and three-mode analysis. Additional issues: model selection, model evaluation, and presentation of analysis results. Keywords Contingency tables · Data format · Data type · Data design · Dependency designs · General linear model · Internal structure designs · Research design · Statistical techniques · Three-mode analysis.

3.1 Overview In this chapter the main statistical procedures used in this book are reviewed. The presentations require some statistical background, but for easier understanding I have provided concrete examples for most of the procedures. In Chapter 4 Statistical framework extended, several methods and concepts are treated more technically for a deeper understanding. In that chapter the topics are arranged alphabetically to facilitate consultation. The case study chapters also contain some statistical explanations, similar to the ones in this chapter, to make them more self-contained, and to facilitate their independent reading.

© Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_3

45

46

3 Statistical framework

3.2 Data formats The two-way data formats considered here are the data matrix, contingency table, correlation matrix, covariance matrix, and similarity matrix. In addition, several three-way formats are discussed: the three-way data matrix of objects by variables by conditions or time points, sets of correlation matrices, sets of similarity matrices. Three-way contingency tables are also mentioned in passing but not treated in great detail. Multiway formats exist as well but are not really discussed.

3.2.1 Matrices: The basic data format As discussed in Section 1.5.1, in many fields the basic data structure is a two-way data matrix of objects × variables, where objects can be anything from cultural objects, persons, paintings, music pieces, graves, etc. Variables can have all kinds of contents, depending on the particular field, such as musical rating scales, function words, amount of live stock, artefacts in graves, reading materials, types of cracks in paintings, melodic character of compositions, etc. Variables can have various measurement levels (Section 1.5.2, p. 11). The standard notation for the elements of a data matrix is xi j , where i is the indicator for the I rows and j for the J columns. The I × J matrix itself is indicated by X. In general we will avoid mathematical notation as much as possible, but in some cases mathematical notation may be seen as an illustrative explanation used in the same way as figures. ∗ Example of a two-way data matrix: A collection of paintings (rows) are each evaluated on a number of variables (columns). For example, age, quality of execution, use of colour, topic, number of persons depicted, stage of deterioration, etc.

3.2.2 Contingency tables A different common data format is the two-way contingency table: a table of two categorical variables—a row variable with R row categories and a column variable with C column categories. It is clearly different from a data matrix, which consists of objects × variables. Extensions of a contingency table are three-way and multiway contingency tables involving three or more categorical variables (see also Chapter 4, p. 108). Two examples of contingency tables are: ∗ Eye and hair colour of 5387 children, Caithness, Scotland. (Goodman, 1981). Even though the contingency table (Figure 3.1) contains the precise information on the counts, it is hard to understand the relationships between eye and hair colour from the table. On the other hand the graph (Figure 3.2) says it all. Blue and lightly

3.2 Data formats

47

Fig. 3.1 Two-way contingency table with frequencies in the cells.The frequency in cell indicates how many children had a particular eye and hair colour, thus there were 326 children with ‘blue eyes’ and ‘fair hair’.

coloured eyes go together with fair and red hair, and medium eye and hair colour go together as well. Dark eye colour especially goes together with dark hair, but also with black hair. However, black hair is rare in Caithness. Noticeable is the (concave) curve from fair hair to black hair; such curves are generally referred to as horseshoes and should be read from one end to the other along the curve; see for more horseshoes in Figures 8.5 (p. 184) and 15.2 (a) (p. 308).

Fig. 3.2 Plot of the eye and hair colour data of 5387 children in Caithness, Scotland. Both eye colour and hair colour lie on a horseshoe curve.

∗ Paintings from four different countries (row variable: Country) made by painters from four different age groups (column variable: Age group). A two-way contingency table with two categorical variables then shows how many painters there were in each country of a certain age group. If we then also score another categorical variable, topics of the paintings (variable: Topic), we have a three-way contingency table of Country × Age group × by Topic. The cells inside the table indicate the number of painters from a particular country of a particular age group who painted a particular topic.

48

3 Statistical framework

3.2.3 Correlations, covariances, similarities Yet another data format is a variable×variable or object×object matrix, which may be filled with correlations, covariances, similarities, and dissimilarities (see Figure 1.2). Correlation and covariance matrices are nearly always variable×variable matrices, with as content the strength of the relationship between the row and column variable, such as the correlation between weight and length of individuals. In a sense the rows of the data matrix, say objects, have disappeared into the correlations and covariances. Thus, a correlation and a covariance are not directly observed but they are statistics constructed or derived from the objects. Such statistics are also called derived statistics. Covariance matrices differ from correlation matrices in that the matrices contain both the covariances and variances of the variables. A correlation is a standardised covariance. When it is important to model the variances in an analysis, and not only the strength of the relationships, as, for instance, is the case in structural equation modelling, covariance matrices are preferred; see Section 3.8.4 (p. 75). There is also a fundamental difference between a correlation and a similarity matrix. Similarity matrices are generally filled directly by individual judgments without further calculation, for instance, respondents’ answers to questions about the extent of the similarities between row and column items. Such data are mostly called paired-comparison data. For example the answer of an individual to the question: ‘How similar are jam and honey?’. On a ten-point scale the answer may be: jam and honey have a similarity of 8, but kippers and toast have a similarity of 2. These answers are directly entered into the similarity matrix. In common usage, similarity matrices are symmetric with the same items in the rows and columns. Jam is as similar to honey, as honey is similar to jam.

3.2.4 Three-way arrays: Several matrices Extensions of two-way data are three-way data in which, for instance, the third way contains conditions, time points, etc. Thus, for each condition or time point we may have a matrix of objects by variables. The data can then be arranged in a block, in statistics referred to as a data array. It is assumed that three-way data are fully crossed: each object has a value for each variable under each condition or at each time point. For an expository presentation of other three-way data formats, see Kroonenberg (2008, pp. 38–40). The statistical notation for a data point then becomes xi jk with index k for the ‘conditions’ dimension. The size of a data array will be I × J × K . Matrices can be seen as K slices of a three-way array, like the slices of a cake (see Figure 3.3). ∗ Example of a three-way array: a three-way array consisting of paintings (first way) evaluated on several scales (second way) by a number of judges (third way). In this context a distinction is generally made between the terms way and mode.

3.2 Data formats

49

Fig. 3.3 Data formats. A single data matrix (left) and a three-way data matrix as a sliced data cake (right)

For instance, the term three-way only indicates that we have a box which exists in the three-dimensional space we live in. A four-way array is a four-dimensional box. The word mode refers to the content of a way, i.e. here paintings, scales, and judges are modes; it tells us what the data are all about. For the entities which make up a mode, for instance, the individual paintings, we will use the word level. Thus each scale is a level of the scales mode, and each judge is a level of the judges mode. In Chapters 10 and 17 we will come across this data format. ∗ Example of a set of correlation or covariance matrices: a set of K J × J correlation or covariance matrices showing the appreciations of J different types of music judged by K different age groups is thus a J × J × K three-way data array with types of music and countries as modes. Moreover, the original data are not necessarily fully crossed, because each correlation or covariance matrix may be constructed from the answers of people from different age groups.

3.2.5 Meaning of numbers in a matrix A good start of any data analysis is to make sure you understand what the numbers in a matrix or table represent and how they were acquired. Even though we may think that this should be obvious, it occurs occasionally that a careful inspection leads to surprising results.

3.3 Chapter example Throughout this chapter a simplified and slightly adapted version of the Marian Art dataset from Chapter 9 will be used to clarify various methods. The data were collected by Polzella et al. (1998) who were interested in students’ appreciation of pre- and post-Renaissance paintings of the Virgin Mary.

50

3 Statistical framework

∗ The researchers asked individuals to indicate on ten bipolar rating scales where they considered the paintings to be between two opposite adjectives, as, for instance, ‘very ugly’ (1) and ‘very beautiful’ (7) with a neutral evaluation in the middle (4). This is an example of a bipolar semantic differential scale, in which the scale ends are antonyms, see also Section 17.3 (p. 329). For a further discussion of rating scales, see Section 1.5.3 (p. 14). In Table 9.1 (p. 189) all scales used in the study are listed. ∗ For this chapter I have added a fictitious variable indicating how much an individual liked the painting overall on a scale running from 0 to 10 (like school grades), and we assume that this variable has an interval measurement level (see Section 1.5.2 (p. 11) for types of measurementlevels).

3.4 Designs, statistical models, and techniques In the literature the terms design, model and technique are not always used consistently. It is in fact difficult to be completely consistent in this respect.

3.4.1 Data design The word ‘design’ mostly refers to research designs, i.e. the way the research project is designed and will be carried out, including the way in which the data will be collected, for instance, via an experiment or a survey, and whether data are longitudinal or cross sectional. On the other hand, we can also use the term in a more limited way, and define the design of a study by specifying the research questions, how these questions are related, etc. In this book we will primarily consider the latter approach and look primarily at data resulting from a study. Wherever relevant, however, we will be more specific, i.e. we will look at data formats, data types, and the measurement levels of the variables. Thus we specify the characteristics of the data to have a basis for a discussion on appropriate analysis techniques for the dataset at hand. ∗ With respect to the Marian example, the dataset is a data matrix of the shape individual × variable data matrix, with variables as ten bipolar interval rating scales and one continuous ratio variable. To be able to treat the interval scales as numeric variables we first subtracted the mean for scale, see Interval level in Section 1.5.2 p. 11. To properly appreciate any dataset it is, of course, necessary to know how the data were collected, what the research questions were, what kinds of assumptions may be made about the data, etc. However, the methodology of designing a study so that the research questions can be best answered, and the question how the data can

3.4 Designs, statistical models, and techniques

51

Fig. 3.4 Marian Art: Normal distribution as a model. The observed frequency distribution of the sum of four evaluative bipolar scales (histogram) and the best-fitting normal distribution with the same mean and standard deviation as the observed distribution (solid line).

be collected is not part of the content of this book. A look at the initial chapters of Van Peer et al. (2012) will be instructive.

3.4.2 Model Statistical models in research come in all shapes and sizes. Characteristics of models are mostly expressed through mathematical formulas, and their parameters determine the detailed form of the models. For instance, the mean and the standard deviation are the parameters of a normal distribution determining its exact shape.

Normal distribution as a model The normal, or Gaussian, distribution is often used as a hypothetical statistical model for the population from which the observed scores are drawn. Figure 3.4 shows both the distribution of the observed scores from a sample and the assumed normal distribution of such scores in the population from which they were drawn. It should be obvious that the sampling distribution is an approximation of the population distribution. The general shape of the normal distribution is described by a mathematical formula which determines its general shape: one peak, symmetric values around the mean, mean equal to the median and the mode. The precise shape depends on the parameter values of the distribution. In the Marian art example the mean is 15.6 and the standard deviation is 3.1, both calculated from the sample.

52

3 Statistical framework

Here the observed distribution, the histogram, follows the theoretical normal distribution quite reasonably, although a statistical test might find it a bit too peaked. Problematic fits of observed distributions to the normal distribution are, for instance, shown in Figure 18.7 (p. 360).

The linear multiple regression model Suppose wewant to predict a variable (the response variable) from one or more other variables (the predictor variables or predictors), for instance, we may want to predict appreciation of modern art from intelligence and the number of visits to the Van Gogh museum in Amsterdam. If there is a linear relation between the variables we use the linear regression model, which is specified by the values of its parameters, the regression coefficients; for details see Chapter 4, p. 118. The values of the regression coefficients and parameters, in general, are estimated from the observed scores. Using the regression weights and the scores on the predictor variables, we can estimate an individual’s score on the response variable. Much of statistical analysis is aimed at providing the best parameter estimates from the sample information. In regression analysis this is measured by the squared correlation between the observed scores on the response variable on the one hand, and the estimated ones based on the regression model, the squared multiple correlation on the other hand. The estimates for the values of the response variable are called the implied values, i.e. these estimates are implied by the regression model (see Chapter 4, p. 118). An important aspect of a model which we will see throughout this chapter is the concept of the model fit between the observed response variable and the implied values based on the model. Thus the multiple correlation coefficient fulfills this role in regression; see also Section 2.5.2, p. 43. A perfect fit would indicate that the predicted outcomes correspond exactly to the observed values of the response variable, but this is rarely, if ever, the case. Of course this fit may not be very good, but a statistical technique will do its best to reach the optimum given the data and the model. With an incorrect model we will not get anywhere near a good fit. Thus, the aim of any research involving models is to construct a model which produces outcomes as close as possible to the observed data. For interpretation it should also be a meaningful model, which consists of variables relevant to the problem at hand. If it is not, we are in trouble; it means that either the substantive theory is wrong or inappropriate, or the data are wrong, or something so far unexplained has taken place. ∗ In the Marian art example we may want to predict to what extent a respondent likes a specific picture. This requires a regression model with ‘liking’ as the response variable, and the descriptive scales as predictors. The prediction is very good if the scores predicted on the basis of the scales are close to the observed scores. In sum we can say that we try to find or estimate values for the parameters of a relevant model in such a way that the fit of the implied values to the observed

3.5 From questions to statistical techniques

53

Fig. 3.5 From research question to statistical analysis. Van den Berg (1992, p. 17).

data is as good as possible. This idea of model fitting is fundamental to nearly all techniques used in multivariate data analysis. There are some techniques which are not based on a specific model, such as some clustering techniques; this will be discussed whenever necessary.

3.5 From questions to statistical techniques Van den Berg (1992, p. 17) discusses in great detail how statisticians advise researchers in their choice of analysis techniques. A summary of this process is displayed in Figure 3.5,1 where we have drawn a red box around the part that forms the main subject of this book. The researcher has the questions and the data (referred to as Research problems) and is looking to solve the Analysis problems, i.e. find out which techniques are appropriate given the questions and the data. When the researcher comes to the statistician they together walk through a process as outlined in Figure 3.6. The actual process will, of course, depend on their knowledge of each other’s disciplines and statistical knowledge of the two partners. One should be aware that most statisticians have their hobby horses for particular data. Their choice is, of course, restricted by the problem at hand; screws need a screwdriver rather than a hammer. The next sections will provide a general overview of the statistician’s toolkit, and the hammers and screwdrivers it contains, and how these may be employed to the benefit of researchers in the humanities.

1 Figures

3.5 and 3.5 are reproduced, with permission, from Van den Berg (1992, p. 17 and p. 107).

54

3 Statistical framework

Fig. 3.6 The statistical advice process. Dependent variable = response variable; independent variable = predictor variable. Source: Van den Berg (1992, p. 107).

3.5.1 Dependence designs versus internal structure designs One may roughly divide statistical techniques into two categories: Dependence designs, which have one or more response variables and one or more predictors, and internal structure designs or interdependence designs or internal structure models, which generally concern a single set of variables where it is especially the relationships among them that are of prime interest. Mixtures of the techniques from both types of designs exist as well. In addition, the main emphasis in an analysis can be on the variables, on the cases-objects-subjects, etc. or on both. The word model should be taken as defining a statistical model whose parameters can be estimated by one or more particular statistical techniques. In the first group of models, the dependence designs, there are two kinds of manifest (or observed) variables, i.e. the response variables also called dependent variables (Y ), and predictor variables or independent variables (X ). The italicised terms are preferred because they are less ambiguous.2 The aim of dependence designs is to predict Y from one or more X s, and virtually all such models can be collected under the flag of the General Linear Model (glm) (Figure 3.7—left). In the second group, the Internal structure designs, we have only one set of manifest X variables, but there may also be a second set of latent variables (the Fs), which are hypothesised and thus unobserved (Figure 3.7—right). The term ‘latent variables’ indicates that they are variables hypothesised to underlie (or determine) the observed ones. It is assumed that such variables are responsible for the relationships (the structure) between the observed variables. They are often considered to 2 The

abbreviations IV for predictor variables and DV for response variables should not be used. Abbreviations are not really helpful for comprehension, and in addition independent variables (IV ) are mostly correlated and thus dependent among themselves, confusing comprehension.

3.5 From questions to statistical techniques

55

be theoretical constructs or concepts specified by the researcher on the basis of a theory or previous research. In principle their existence is postulated, and empirical data are used to make their existence and relationships plausible, as well as the way they are related to the observed (or manifest) variables. ∗ Latent variables are, for instance, intelligence and its components, such as arithmetic and language ability. They are theoretical constructs assumed to be responsible for the correlations between the items of IQ tests. Specific models have been designed to examine or verify how and to what extent intelligence is responsible for the observed correlations among the items in an intelligence test. The model may, typically, specify how the latent variables affect the observed items and explain the correlations between items. If two items measure language ability, they are correlated exactly because both are influenced by language ability. Thus, the models specify how arithmetic ability, language ability, and general intelligence influence the observed items. The models may also specify how these latent variables are correlated among themselves (Figure 3.7—right). ∗ Regarding our leading example about the characterisation of paintings of Mary, we may postulate that there is a latent variable structure for the ten bipolar scales, one of which could be an evaluative latent variable. A detailed analysis is given in Chapter 9.

3.5.2 Analysing variables, objects, or both Most standard models, especially dependence designs, are directed towards the analysis of variables. The objects-individuals-cases, etc. are treated as a random sample from a specific population. Another option for ’randomly drawn’ objects is exchangeability, i.e. an object in the sample is as good as any other object in the population. Thus the emphasis is not on the process by which the objects have come into the sample, but on the characteristics they share with the objects in the population. Clearly randomly drawn objects are exchangeable, but the reverse is not necessarily true.

Fig. 3.7 Simple examples of two model types. General Linear Model (left), Internal structure design (right)

56

3 Statistical framework

Data-analytic models are sometimes focussed on objects, etc. and their individual differences rather than the structure of the variables. When the interest centres on individual differences, it is not a question of a random draw from a population, because the individuals themselves and their properties are the subject of scrutiny. Often it is their memberships of distinct groups which are the focus of the analysis. ∗ For example, we may want to know whether the evaluations of the Marian paintings depend on the gender of the judges of the paintings. In addition, the researcher may want to know whether the gender differences are specific for particular scales, say the evaluative ones, but not for others. This is typically a question for which both the rows (persons) and the columns (bipolar scales) are of analytic interest. Correlations computed over all judges would be inappropriate because individual differences between persons disappear in the calculations of the correlations.

3.6 Dependence designs: General linear model—GLM In this section a number of common variants of the general linear model (glm) will be discussed in some detail. The main purpose is to show how methods with all kind of different names are basically variants of the general linear model. This will make it easier to select an appropriate statistical analysis for a specific research problem, and to see in the literature whether an appropriate statistical model has been chosen for the problem at hand.

Schematic representation of a GLM To make the nature of the different models for the dependence designs clear to the reader, schematic drawings will be presented to understand the relationship between the general dependence design and a specific statistical model. Figure 3.8 represents a template for the more common models of the glm family. There are more complex models in the family which will not be discussed here, especially more general and complex ‘grandparents’ from which all models discussed here are descendants. The template consists of three parts, each of which expresses more or less the same thing in a different manner. The top part indicates in words which and how many variables are involved, the bottom left-hand part does so in a tabular form, and the bottom right-hand figure shows how the variables influence each other in the form of symbols and arrows. Finally the name of the multivariate variant of the model is listed below. The family members discussed have both response and predictor variables; these variables may be numeric or categorical. There may be one or more predictors, and there may be one or more response variables. For historical reasons members of this

3.6 Dependence designs: General linear model-glm

57

Fig. 3.8 Template for the General Linear Model-glm in three representations. Bottom right: manner in which the predictor variable(s) (X ) influence the response variables (Y ). Last line in red mentions the multivariate extension of the technique displayed.

family often have different names, and on the surface it is not always clear that they are in fact related.

3.6.1 The t test The major purpose of a t test is to investigate whether two independent groups, or a single group measured twice, have equal or different means. In Figure 3.9 we see that a t test is a linear model with one numeric response variable and one categorical predictor variable. ∗ Assessing the difference between mean judgments by men and by women in their overall appreciation of Marian paintings can be done with a t test. Alternatively, we could check whether the respondents have different mean judgments with respect to pre-Renaissance and post-Renaissance paintings. Application ⇒ Chapter 18: Characterisation of music

3.6.2 Analysis of variance—ANOVA In the simplest form of analysis of variance(anova), there is one numeric response variable and one multicategory predictor variable (Figure 3.10). For instance, in

58

3 Statistical framework

which of four countries is music by Ligeti most appreciated? Analysis of variance tests the difference between the mean appreciation in the countries.3 Multivariate anova has the same setup as basic anova but with more response variables. More complex anovas can have more multicategorical predictor variables, and so may cater for interactions between these predictor variables: multiway anova.4 Some further complex analysis of variance models is mentioned in Section 3.6.6. ∗ Differences between people from two or more countries can be tested with oneway anova, and differences between (1) the means for men and women (way 1) from (2) different countries (way 2). In addition the interaction between gender and country with respect to mean judgments can be tested with two-way anova. In such a case we speak about a main effect for Gender, a main effect for Country, and an interaction effect for Gender × Country. These three effects are tested separately. Applications ⇒ Chapter 9: Evaluating Marian art ⇒ Chapter 18: Characterisation of music

3.6.3 Multiple regression analysis—MRA A simple multiple regression model (Figure 3.11) has one numeric response variable and one or more numeric predictor variables,5 . Not uncommonly a few binary

Fig. 3.9 t test: 1 numeric response, 1 binary categorical predictor 3 The

t test is in fact an analysis of variance technique as well, but with a binary predictor variable. our definitions this would be a multiple anova, but this term is not used in the literature. 5 This design is sometimes called a convergent causal research design but this designation is not used in the following. 4 Within

3.6 Dependence designs: General linear model-glm

59

Fig. 3.10 Analysis of variance (anova). 1 numeric response, 1 or more (here 2) multicategorical predictors. Arrows are main effects; the forked arrow is the interaction.

variables, mostly coded as 0 and 1, are used as numeric predictors as well. As with analysis of variance, the model can be more complicated, for instance, in the case of a multivariate multiple regression problem, when there are several response variables and several predictors (see also Section 3.4.2 and Chapter 4, p. 118). The fit of a regression model is indicated by the multiple correlation coefficient.

Fig. 3.11 Multiple Regression: 1 numeric response, more numeric predictors.

∗ Our Marian researcher wants to know whether descriptive judgments of paintings portraying Mary with child, such as beauty, sereneness, shows devotion, shows emotion, and quality of the painting, can catch the essence of a person’s appreciation of Marian paintings.

60

3 Statistical framework

This can be investigated by a regression analysis with Appreciation as the response variable and descriptive judgments as predictors. A good prediction of Appreciation will support the assumption that the descriptive judgments together can explain why people appreciate Marian paintings. The analysis may even point to the most influential aspect of these judgments. Application ⇒ Application Chapter 16: Reading preferences

3.6.4 Discriminant analysis A linear discriminant model (Figure 3.12) consists of a binary response variable (values 0 and 1) and one or more numeric predictors. The idea is to find out what and to what extent predictor variables contribute to the distinction between the two pre-existing groups. As there are only two groups, one discriminant function can be specified, comparable to a regression function. If there are more response categories more functions can be derived. One of the statistics to asses how well the functions discriminate between the response groups is the canonical correlation, which fulfills the same function as the multiple correlation coefficient in regression. Its square provides the proportion variation accounted for by each discriminant function. In other words, it is a measure of fit. The other technique commonly used when the response variable is binary is logistic regression, discussed in the next section. Discriminant analysis is used for hypothesis testing as well as exploratory and descriptive analyses.

Fig. 3.12 Discriminant analysis: 1 categorical response, more numeric predictors

3.6 Dependence designs: General linear model-glm

61

∗ Country difference in Marian paintings. Is it possible to discriminate between mediæval Flemish and Italian Marian paintings on the basis of the descriptive judgments mentioned in the previous section? If yes, which of them are important in this discrimination? In other words, are Marian paintings from the two countries judged to have different characteristics? To investigate this we combine the predictors in an equation (called discriminant function). The task of this function is to combine the predictors in such a way that the two country’s scores of paintings on the discriminant function are as different as possible. When we come across a new painting, we can allocate it to a country on the basis of its score on the discriminant function. Discriminant analysis becomes more complex if the response variable has more than two categories, because then there are more discriminant functions, in fact, one less function than the number of categories. In the literature various names are used for this model: multiple discriminant analysis, multigroup discriminant analysis, and multivariate discriminant analysis. Within our definition the last name would be most appropriate, as we use the term ‘multivariate’ if there are several response variables, which is the case here.6 Applications ⇒ Chapter 10: Craquelure and pictorial stylometry ⇒ Chapter 11: Rock art images across the world ⇒ Chapter 18: Characterisation of music

3.6.5 Logistic regression Logistic regression is similar to an ordinary regression, but the response variable is binary, i.e. it has only two values (0 and 1). We may think of paintings being Genuine (1) versus Fake (0), or books having been Read (1) versus Gathering dust in the bookcase (0). ∗ Melodic music. Suppose we have pieces composed by Fauré (1) and by Stockhausen (0). Our hypothesis is that the most distinguishing feature is their melodic contents. To investigate whether this is true, we score all pieces on several aspects of their melodic content. Using these aspects we want to find a regression function which on the basis of the melodic aspects predicts whether the piece was composed by Fauré or Stockhausen. Rather than predicting the composer directly, the logistic regression uses the probabilities of each having composed a particular piece. More precisely, what is predicted is the odds, i.e. the ratio of the probability that Fauré (1) has composed the piece divided by the probability that Stockhausen did (0). Even more precisely 6 Unfortunately,

the statistical community does not sanction my preference. If we go by Google Scholar: multiple discriminant analysis [19,000 hits], multigroup discriminant analysis [400 hits], and multivariate discriminant analysis [9,000 hits]. (checked: 12/8/2019).

62

3 Statistical framework

it is the log odds or logit (= the logarithm of the odds) which is predicted; the logarithm is primarily chosen for statistical reasons, especially because the regression is nonlinear. In the fictitious example in Figure 3.13 we have scored a collection of musical pieces by Fauré and Stockhausen on their melodic character, on a sevenpoint scale between the values 0 and 6. We see in the figure that about two-thirds of the pieces with a low melodic character are correctly predicted as composed by Stockhausen, and those with a very high melodic content are correctly predicted as composed by Fauré, but that a piece with a melodic score of around 3 could have been composed by either composer.

Fig. 3.13 Logistic regression—musical example. Response variable: Composer; Predictor variable: Melodic character of a musical piece; Cases: Musical pieces (black dots). The line represents the predicted odds. The higher the odds, the greater the probability that a piece has been composed by Fauré.

The advantage of logistic regression over discriminant analysis with a binary response variable is that the statistical assumptions for the predictors are not as strict, so that more types of variables can be used as predictors. To quote Walsh (1987) ‘No other technique will allow the researcher to allow to analyze the effects of a set independent variables [i.e., predictors] on a dichotomous dependent variable [i.e., response variable] (or a qualitative polytomous one) with such minimal statistical bias and loss of information’. Peng et al. (2002) review 52 applied papers in higher education for their use of logistic regression. S. J. Press and Wilson (1978, pp. 705–706) conclude that it is unlikely that the two methods will give markedly different results, unless many of the predictors are nonnumeric. An advantage of discriminant analysis is that it is often easier to interpret than logistic regression for those people who are less used to think in odds

3.6 Dependence designs: General linear model-glm

63

(i.e. who do not bet at horse or greyhound races). A thing to remember is that the odds are 1 (the odds are even) if the probabilities are equal. If the numerator probability is larger than the denominator probability the odds are greater than 1 and vice versa. There is also a multicategory logistic regression or multinomial logistic regression, where the response variable has several categories. However, in these cases I prefer discriminant analysis because it is easier to interpret and to graph its results in descriptive settings.7 ∗ The research situation is the similar to that in the previous section for discriminant analysis. We are interested in the differences between men and women with respect to their appreciation for the pictures shown. An example of a logit curve with the odds as the Y -axis and the probability as the X -axis can be found in Figure 3.13. Application ⇒ Chapter 18: Characterisation of music

3.6.6 Advanced analysis of variance models A common aspect of all analysis of variance models is that notwithstanding the name ‘analysis of variance’, they are really models to compare means. The term is based on the fact that the differences between means are tested via variances. Analysis of variance is actually a cover name for a host of variants of the general linear model with categorical predictors (see Chapter 4, p. 104 for further details):  Multivariate analysis of variance: several response variables  Longitudinal analysis of variance or repeated-measures analysis of variance: several measurements on the same response variable(s) over time  (Multivariate) analysis of covariance: one or more response variables and a mix of categorical and numeric predictor variables  Random-effects analysis of variance: the categories of the predictor variable are not fixed, but a random sample from possible categories  Multilevel analysis of variance: measurements where the predictor variables have different levels, which are nested. For example, children are grouped in a classroom (level 1) nested within teacher (level 2) nested within school (level 3). Our leading example in this chapter is derived from the design used in Chapter 9. The design consists of four, rather than one, different types of paintings, and the same bipolar scales are measured several times. Differences in means are evaluated, and the fact that a bipolar scale is measured four times makes the analysis

7 https://statistics.laerd.com/spss-tutorials/multinomial-logistic-regression-using-spss-statistics.php;

June 2020.

accessed 15

64

3 Statistical framework

of the means a repeated-measures design. Because there are ten scales we have a multivariate repeated-measures design. In this book we will come across only a few of these anova variants in the case study chapters. For a serious introduction to their details we refer the reader to general books on applied multivariate analysis often used in the social and behavioural sciences. Field (2017) contains useful chapters on several variants of analysis of variance, but his examples are somewhat too flippant for my taste. Other comparable books are Tabachnick and Fidell (2013) and Hair et al. (2018, Chapters 6, 7, and 8). These books have gone through several editions and provide assistance with using computer software such as, spss sas, and Stata; Field also has an R version. Applications ⇒ Chapter 9: Evaluating Marian art ⇒ Chapter 18: Characterisation of music

3.6.7 Nonlinear multivariate analysis As its name suggests the general linear model only pertains to linear relationships between variables. However, these models can also be used when relationships are nonlinear, by employing optimal scaling of the variables (see Chapter 4, p. 121: Quantification and optimal scaling,). Typical examples are principal component analysis and regression analysis, also referred to as Categorical principal component analysis (catpca) and Categorical regression (CatReg), respectively. The core of this addition to linear modelling is that the variables in the models are transformed in such a way that their relationships become linear. Thus, suppose two variables have a quadratic relationship, i.e. the response variable is predicted by the square of the predictor variable, one or both may be transformed in such a way that their relationship becomes linear and hence becomes more amenable to linear model analysis, and hopefully easier to interpret. The optimal scaling technique is even more powerful than transforming numeric variables, in that it can assign sensible and relevant numeric values to (i.e. quantify) the labels of truly nominal variables, such as religion, in such a way that these variables also acquire numeric relationships with other variables. In what way this is particularly useful will be shown in Chapter 16. Full technical details can be found in Gifi (1990), but simpler introductions can be found in Meulman and Heiser (2012) and in Linting et al. (2007). ∗ Suppose that our 89 judges of Marian paintings come from four different countries (Country—categorical predictor), and we want to compare Liking of Marian paintings for the four countries (Liking—numeric response). The standard way is to do an analysis of variance on the country means for these variables. But suppose we have another numeric variable, for each of our judges, say Religiosity, we then could compute a correlation between Liking and Religiosity, but not with Country because it is a truly categorical variable.

3.7 Internal structure designs: General description

65

Optimal scaling can help to sort out which numeric values you should give to each of the Country categories, such that the correlation with each of the other two variables is as high as possible. When these correlations have been sorted out we can create a plot that contains the relationships between all three of the variables. Optimal scaling is also especially useful for internal structure designs such as principal component analysis (pca), as will be discussed in several chapters. Applications ⇒ : Chapter 7: Agricultural development on Java; nonlinear pca ⇒ : Chapter 7: Agricultural development on Java; categorical pca ⇒ : Chapter 11: Rock art across the world; binary pca ⇒ : Chapter 14: Accentual prose rhythm in mediæval Latin; ordinal pca ⇒ : Chapter 16: Reading preferences; binary pca

3.7 Internal structure designs: General description Internal structure designs are analysed by means of models in which the structure among the entities is of prime interest. These entities can be (1) the individuals, objects, or cases often making up the rows of a data matrix; (2) the variables or scales, often making up the columns of a data matrix; or (3) the relationships between rows and columns treated together. Apart from standard data matrices, correlation and covariance matrices are used to group variables without paying specific attention to individual respondents. Research designs that generate similarity and dissimilarity matrices also belong to the category of internal structure designs. The values in such matrices indicate the similarity, dissimilarity, or distance between the row objects and column objects.

3.8 Internal structure designs: Variables If the variables are the focus of the analysis, the researcher is mostly looking for an insight into conceptual domains underlying the variables. For instance, the items of an intelligence test are related because they all measure some aspect from the domain of intelligence. The general concept of intelligence structures the relations (correlations) between the items, and so the models to be used should be designed to describe such conceptual domains. In some types of models the domains from which the variables are supposed to come are known beforehand (confirmatory models, e.g. known aspects of intelligence), in others they are not (descriptive, data analytic, or exploratory models, e.g. which aspects of compositions distinguish composers; see Chapter 18). For these two designs different types of models are used, although they are sometimes used together. For instance, first a successive

66

3 Statistical framework

series of exploratory models can be investigated, to find a proper confirmatory one or vice versa; for an example see Kroonenberg and Lewis (1982). Internal structure designs focussing on variables almost invariably require a theoretical concept, a latent variable, for example, Intelligence. Latent variables are assumed to underlie the manifest or observed variables (for instance, the items of a test). The (cor)relations of the observed variables are assumed to be due to these latent variables (see right-hand panel of Figure 3.7). Models with latent variables are used for both confirmatory and exploratory analyses. For the former, theoretical latent variables are postulated (factors), and for the latter latent variables are searched for during the analysis (components). For instance, the items of an intelligence test are assumed to be correlated because they all measure some aspect of the latent variable Intelligence, be it verbal or mathematical intelligence or intelligence itself. The terms factor and latent variable both refer to some underlying theoretical concept on which the subjects differ, but which cannot be measured directly.8 Several latent variable models can be seen as an extension of the general linear model. In particular, we may introduce the latent variables as (conceptual) predictors and the manifest variables as responses. Such models are sometimes referred to as Latent Linear Models (llm). In fact this idea can be taken even further by specifying correlations between some or all latent variables (see right-hand panel of Figure 3.7, p. 55).

3.8.1 Principal component analysis—PCA As an internal structure method principal component analysis9 is one of the most versatile techniques. Its main purpose is to make it easier to understand the patterns and relationships among a set of variables without postulating these relationships beforehand, as would be the case for (confirmatory) factor analysis (see below). In fact, principal component analysis is not one method, but a family of methods whose family membership depends primarily on measurement level. Table 3.1 shows its genealogy. Categorical principal component analysis (catpca) is clearly a Jack of all trades. In those cases when categorical principal component analysis is used instead of the standard variant, proper measurement levels have to be specified. Some techniques in the table have several different names in the literature; multiple correspondence analysis (mca), for instance, is also known as ‘homogeneity analysis’; see Gifi (1990). 8 In

statistics the word factor is used in two different ways: in dependence designs as a categorical predictors in analysis of variance, and in internal structure designs as a latent variable. From the context it will be clear which meaning applies. 9 Note the difference between principle and principal; a Google search reported 750,000 occurrence of the incorrect ‘principle component analysis’! There are two slightly different correct ways of writing the technique: principal component analysis and principal components analysis. According to Google the former is about 5 × as common as the latter. Search carried out: 15/1/2019.

3.8 Internal structure designs: Variables

67

Table 3.1 Principal component analysis—the family Measurement Metric Metric level ratio/interval ratio/interval Association type

Linear

Nonlinear

Standard pca

pca

Ordinal

Nonlinear Nonlinear mca Ordinal pca

methods catpca catpca mca = multiple correspondence analysis

Nominal

catpca

Binary

Mixed

Nonlinear mca

Nonlinear

Including profiles e.g. (101100) catpca catpca catpca

Basics of PCA As explained above, a basic data format in data analysis is a matrix of objects by variables. It is easiest to gain an insight into the relationships among the variables if we can compact the total variance of the variables into a set of new variables, called components. It then becomes possible to construct graphs that show the relationships, because most of variance in the dataset will be concentrated in a limited number of components; usually two or three, rarely more than four, components will suffice to show these relationships. Principal component analysis is the ideal technique for this compacting (see Figure 3.14).

Fig. 3.14 An internal structure design. Principal component analysis. The Cs are principal components and the Xs are observed correlated variables

Jolliffe (2002, p. 1) formulated this as follows in the opening paragraph of his monograph on principal component analysis: The central idea of principal component analysis (pca) is to reduce the dimensionality of a dataset consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the dataset. This is achieved by transforming to a new set of variables the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables.

68

3 Statistical framework

The word dimension here refers to the size of the (Euclidean) space with straightline distances in which the variables live. A two-dimensional space is plane, a threedimensional space is what we ourselves live in, etc. We cannot readily imagine what an even higher dimensional space looks like or how to represent it on paper. This, of course, hampers our understanding of relationships between variables in such a space. Because components are uncorrelated, their number indicates the dimensionality of the space they ‘live’ in; technically one says that the components span the space. In principle, all principal components are needed to account for the variance of all variables. However, the first component is the one that collects most of the variance in the data, and the subsequent components sequentially do the same with what is left of variance. Thus, most of the variance is concentrated in the first few components and there is very little left for the later components. This explains why we seldom see studies with four or more, and most studies need far fewer components to represent the variability in a dataset. In addition, the components are always the same and account for the same amount of variation, irrespective of two, three, or more components are extracted from a specific dataset. This property is called the nesting of components. Thus, the first two components from a five-component solution are the same as those from a two-component solution, and they account for the same amount of variance in both solutions. In Figure 3.15 the process from two original variables (X 1 and X 2 ) to principal components (pc1 and pc2 ) is illustrated by graphs A to D. The original variables are first depicted in a scatterplot, to show how they are related (A). Next the variables are centred, i.e. the means are subtracted from the scores; the centred variables are (Z 1 and Z 2 ) (B). The principal components lie along the two main axes of the ellipse (C). The components are rotated such that they become the coordinate axes (D). Statistical programs generally first centre the variables. The centred scores of a variable tell us how far above or below the scores of a person are with respect to the mean. Dividing the variables by their standard deviations as well, they are converted to standard scores. As components are sums of (weighted) observed variables, it must make sense to add them. For instance, weight and length cannot be added as one is in kilogrammes and the other in centimetres, but we can add their standard scores. In practice this means that correlations rather than covariances are analysed. For a standard principal component analysis the assumption is, just as for factor analysis (see below), that linear correlations adequately represent the linear association between each pair of variables. In other words, there should not be serious nonlinear relationships between the variables. If there are, we have to do something before the analysis proper to satisfy the linear relationships between the variables (see Section 3.6.7, p. 64 and Section 3.8.2, p. 71). If in doubt, we may inspect the relationships via scatterplots; for a limited number of variables a scatterplot matrix can fulfill this function effectively (see Section 2.2.4, p. 33).

3.8 Internal structure designs: Variables

69

Fig. 3.15 Principal component analysis of two variables. (A) Scatterplot of two variables (X 1 , X 2 ), original measurements; (B) Original variables are centred (Z 1 , Z 2 )—this centres the data at the origin; (C) Principal components (pc1 , pc2 ) added along main direction of variability of the data; (D) Principal components as coordinate axesvia an orthogonal rotation.

Principal components are also used if there are a number of predictors closely correlated: then there is multicollinearity. Because regression does not work very well in such situations, we first perform a principal component analysis on the predictor variables, and then use the resulting components for prediction instead of the predictors themselves, for example, see Section 18.5.3, p. 360.

Plots in principal component analysis Components are plotted against each other in a scatterplot to assess the structure of the dataset. Because, as indicated above, the first components collect as much of the total variance of the dataset as possible, the plot of the first two components usually gives a good impression of the structure of the relationship between the respondents and the variables. There are three types of plots in common use (see Figure 3.16, p. 70). In these plots the ‘respondents’ are the paintings with cracks presented in Chapter 10, and the variables are rating scales defining the types of cracks. ◦ component score plot—left The plot of the respondents’ scores (here, painting scores) on the newly constructed variables, i.e. the components. The plot is analogous to a scatterplot showing the respondents’ values on the observed variables (here rating scales), but the orthogonal coordinate axes are the components. It is

70

3 Statistical framework

Fig. 3.16 Plots to display the results of a principal component analysis. Component scores—left; Variable-component correlations or component loadings—middle; biplot of scores and loadings right.

useful in such plots to indicate the percentage of the total variance accounted for by both components, as it indicates how much of the total variance is represented by the two components. Component scores are standardised; a component has zero mean and standard deviation 1, and they are orthogonal, i.e. perpendicular. ◦ component loading plots—middle The plot of the correlations between the observed variables (here, rating scales) and each of the components—r1 and r2 . Again the components are the coordinate axes. The size of the correlation of a variable with each component is its coordinate. The variable-component correlations are commonly referred to as component loadings. The variable points in this plot are usually displayed by an arrow from the origin (0,0) to the variable point (r1 , r2 ). The angle between two arrows, i.e. two variables, represents an approximation to the correlation between the variables. Component loadings are usually not standardised. ◦ biplot—right: This plot, a biplot, is a constructed from the two plots above, by aligning the origins and the coordinate axes. The coordinate axes are, of course, the components, as they are the same in the two previous plots; they explain 41% and 19%, respectively. How to evaluate the paintings with respect to the rating scales in the biplot is discussed in Chapter 4, p. 106. In nearly all disciplines we can find examples of the use of principal component analysis. The technique is especially useful if researchers want to get an insight into how the variables in a study hang together and to what extent the objects are different from each other. Two more examples: ∗ Suppose we would like to have a measure for overall Size, on the basis of measures for weight and height of a sample of vases found during an archeological excavation. The idea is that because Weight and Height are correlated, we can

3.8 Internal structure designs: Variables

71

combine these two variables via a component analysis, and use the first component as an indication for Size. This component should be capable of explaining a large part of the physical differences in vases, but of course not completely; there are, after all, small heavy vases and tall light ones. Note that we have to standardise weight and height first, because combining kilos and centimetres is not a viable proposition. ∗ As an example of an internal structure design, let us consider an intelligence test for children consisting of 20 items. These items were not randomly chosen, but carefully constructed to measure two concepts: verbal comprehension and numeric aptitude. However, there is doubt whether the items are pure measures of these concepts, and whether power of concentration also plays a role when children are scoring the items. Furthermore, there is some doubt whether the items really are pure measures of a single concept. A numeric item could have been presented within a little story, in which case the child might need to call on both verbal and numeric abilities to solve the problem. What is required here is to analyse the 20-item test in its entirety. With a principal component analysis we can see the size of the correlation between each item and the components: a high correlation with only one component indicates a pure item, sizeable correlations with both components show that the children needed both components or latent variables to get to the right answer. Applications ⇒ Chapter 9: Evaluating Marian art ⇒ Chapter 12: Public views on deaccessioning ⇒ Chapter 13: The Royal Book of Oz: Baum or Thompson? ⇒ Chapter 18: Characterisation of music

3.8.2 Categorical principal component analysis—CatPCA Because in data from humanities research many variables can have various measurement levels, and/or have nonlinear relationships with other variables in the study, categorical principal component analysis (catpca), a variant of standard principal component analysis, can be an extremely useful tool. A numeric example can be found in Chapter 7, with a mini example in Section 7.5.1. We have indicated that in standard principal component analysis the first component is the linear combination of the numeric variables, which collects as much variance as possible from the original variables. In categorical component analysis the principle of collecting the most variance is still true for the first component, but the set of components are not necessarily nested, see Section 3.8.1 for an explanation of nesting as well as an explanation of the order of components. The first component also has the highest average correlation with all the original variables. However, it requires the numeric variables to have linear relations with one another, as the technique is based on ordinary (Pearson) correlations. But who

72

3 Statistical framework

says that the world consists only of linear relations? Regularly associations between variables start to rise quickly but slow down later, such as the Income—School readiness relationship, see Figure 3.17(a) left-hand graph. On the other hand, in many situations a linear relationship serves as a good first approximation to the actual more complex one. Although many datasets consist of variables with different measurement levels, we still want to investigate the relationships between all variables simultaneously; in addition, we want to deal with nonlinear relationships between the numeric variables. Thus we would like to deal with both these problems in a single analysis. This desire means that we have to use an adapted version of the standard principal component technique, called categorical principal component analysis (catpca). Because we want to treat all variables on an equal footing, and to use all the useful properties of standard principal component analysis, we have to transform all variables by assigning numeric values to the categorical variables. All variables can then be treated numerically by the pca, which seems like magic. However, the original measurement levels of the variables will restrict which numeric values can be assigned. At the same time nonlinear relationships can be accommodated as well. The whole process takes place via optimal scaling (see also Chapter 4, p. 121) and is called quantification. An example of the linearisation process is illustrated in Figure 3.17. As mentioned earlier, the transformations of the variables should be taken into account when interpreting the outcomes. There is no free lunch. In the final paragraph of this section I will come back to the arguments for the acceptability of this process; it is easier to do this after further details have been explained. To give an impression of what linearisation of a relationship entails let us return to the previous example of the relationship between Family income and School readiness. The left-hand panel in Figure 3.17 shows what their relationship really looks like; money buys education. The richer the family, the more ready their young children are for school. However, it is not a straightforward relationship, because initially more and more income gives a quick improvement in School readiness, but gradually more and more money has to be invested to increase children’s school readiness further. The blue logarithmic curve in Figure 3.17 summarises this nonlinear relationship. Clearly, a linear correlation will not do the best job at describing the relationship. However, if we first transform Family income via a categorical principal component analysis (catpca), the relation becomes reasonably linear as in the right-hand panel of Figure 3.17. Applications ⇒ Chapter 7: Agricultural development on Java; nonlinear pca ⇒ Chapter 7: Agricultural development on Java; categorical pca ⇒ Chapter 14: Accentual prose rhythm in mediæval Latin; ordinal pca ⇒ Chapter 11: Rock art images across the world; binary pca ⇒ Chapter 16: Reading preferences; binary pca

3.8 Internal structure designs: Variables

73

Binning numeric variables Categorical principal component analysis works best if each category of a variable contains a reasonable number of cases; the analysis is rather sensitive to categories with only few observations. Therefore, the values of numeric variables are often grouped into a limited number of categories in order to avoid categories with only a few cases. Making the appropriate groups is called binning (see Chapter 4, p. 105). Practice has shown that for a binned category five cases is a preferred lower bound, but unfortunately this is not always possible. What binning does, however, is to limit the number of outliers because they will generally be grouped within a bin with other high or low scores, as appropriate. In sum, during a categorical principal component analysis the values of existing variables will be quantified in such a way that they are transformed into amenable categories. In this way the analysis gives a better insight into the relationships between variables with all kinds of different measurement levels and/or nonlinear relationships.

3.8.3 Factor analysis—FA In this section we will discuss factor analysis, and explain the differences between factor analysis and principal component analysis. These differences are not always acknowledged and appreciated, but it is useful to keep the two apart, because they

(a) Original nonlinear relationship between School readiness and Income. Approximately logarithmic (blue line).

(b) Linear relationship between School readiness and Quantified income after optimal scaling.

Fig. 3.17 Effect of an optimal scaling transformation. Income versus School readiness. Technical note: The quantification process also centres the variables. Quantified income only has an ordinal relation to dollars. This has to be taken into consideration for the interpretation.

74

3 Statistical framework

Fig. 3.18 Factor analysis

correspond to different models with different underlying assumptions, and sometimes are even used for different types of data. Factor analysis is sometimes used as a collective name for techniques using latent (i.e. unobserved) variables, or constructed variables such as components in component analysis. These are supposed in one way or another to account for the relationships between the observed variables. A program suite such as spss indeed collects various models under the name ‘Factor Analysis’, but the default workhorse within this program is principal component analysis. Within the context of this book we will use factor analysis in a more specific sense, i.e. as shown in Figure 3.18. This model is commonly referred to as confirmatory factor analysis. The term ‘confirmatory’ indicates that an a priori clearly defined model is specified based on theoretical knowledge or earlier empirical studies. Characteristically there are one or more, possibly correlated, latent variables (the factors) which function as predictors for the manifest variables, but typically not all latent variables predict all manifest variables. Moreover, none or only some of the errors of prediction are correlated among themselves (e3 and e4 in Figure 3.18). These concepts do not play a role in principal component analysis (see Figure 3.14). Confirmatory factor analysis is used extensively in psychology, especially because of the necessary validation of constructed questionnaires. This makes the technique also relevant for the humanities because such topics as art appreciation (Chapter 9), attitudes towards the deaccessioning of museum collections (Chapter 12), and reading preferences (Chapter 16) are typically assessed via questionnaires. Another common use in the behavioural sciences is model testing in connection with theory development about human behaviour. In other words, specific models are available,

3.8 Internal structure designs: Variables

75

and these have to be tested and evaluated against both developed theory and against each other. Baxter (1994, pp. 85–90) gives a fascinating, primarily mathematical and statistical overview of the controversy about the use of principal component analysis and factor analysis in archeology. He quotes several papers on the differences between and misconceptions about the techniques. His central point is that factor analysis requires an explicit detailed model, and the statistical techniques to carry out the analyses require strong assumptions. If we can fulfill these assumptions and have a solid model, factor analysis is a suitable technique, but without them principal component analysis seems to be more appropriate, especially for exploration. ∗ An intelligence test generally consists of a fair number of items. These items are not randomly chosen, but have been carefully constructed to measure specific concepts, often verbal intelligence and numeric ability. Although several numeric items are formulated in words, we do not want the answers to these items to be influenced by verbal intelligence, as it would defeat the purpose of measuring just numeric ability. A confirmatory factor analysis such as in Figure 3.18 could be used to see whether a numeric item is ‘contaminated’ by including a path from verbal intelligence to the specific numeric item, and compare the fit of such a model with a model that does not include such a path. Application ⇒ Chapter 12: Public views on deaccessioning

3.8.4 Structural equation modelling—SEM Confirmatory factor analysis as described above relates a concept or latent variable to a set of observed variables. Given the idea that the correlations of the observed variables are due to latent variables, one could also say that the observed variables together measure the latent variables. Therefore, the confirmatory factor analysis model is also called the measurement model of a structural equation modelling (as will be explained below). If there are several latent variables, some of which are hypothesised to correlate with or predict other latent variables, the description of these relationships among the latent variables is called the structural model. The measurement model and the structural model together form a structural equation model, often abbreviated as sem. In practice the parameter values of a confirmatory factor analysis model are estimated via specialised computer programs for structural equation modelling, and only rarely within general-purpose statistical packages. Figure 3.19 shows a comparatively simple structural equation model; the labels are hardly legible but it is the general structure of the model that is relevant here. There are three sets of observed variables, each of which is predicted by a single latent variable. Not uncommonly a latent variable also predicts the values of one

76

3 Statistical framework

Fig. 3.19 An example of a structural equation model (sem) of unspecified content. The square yellow boxes are the observed variables; the yellow circles the latent variables. The white circle a second-order latent variable. The large blue circles are the measurement models, the red box the structural model. The red arrows are not further explained here.

or more variables in another set. As discussed above in an intelligence test, a ‘text sum’, i.e. a sum embedded in a story, clearly requires both verbal intelligence and mathematical abilities and is thus predicted by two latent variables, which in turn are predicted by a third, i.e. general intelligence. Seeing the figures you could complain that the model is all but simple, but on the other hand just envisage the task of interpreting the 231 correlations between the 22 items. The four latent variables in the figure together form a structural model in the sense that three of the latent variables are predicted by the fourth. Note that the error terms or residuals belonging to the latent variables are uncorrelated, as are those of the observed variables. However, at times it is necessary to allow the residuals of the observed variables to be correlated in order to obtain a well-fitting model. Two residuals may have a stronger correlation than can be explained by the latent variables alone. Such local effects can, for example, occur when two questions have a specific parallel phrasing, unlike other questions in the questionnaire. The aim of the structural equation model is thus to account for all correlations between the observed variables. The quality of the model is assessed by computing the covariances between the variables on the basis of the model equations, and comparing these implied or model-based covariances with the observed ones (see also Section 2.5.2). It is not a perfect fit that is aimed for, but an adequate one. In most cases researchers use a wide variety of quality measures for the evaluation of their models (see Chapter 4, p. 125). In fact, rather than correlations it is common to analyse covariances, but this does not affect the conceptual aspects of the model. Extensive practical information on carrying out analyses with structural equation models can be found in several books, such as one of the versions of Byrne (2013)

3.8 Internal structure designs: Variables

77

∗ In our intelligence test example, we mentioned verbal intelligence and numeric ability as explaining factors. However, a model in which both these concepts are influenced by a general factor IQ or general intelligence seems an entirely reasonable proposition. It would imply that these factors themselves are correlated and that this correlation is caused by general intelligence. To find support for such a model we can formulate it in terms of a structural equation model, and test whether the correlations based on the observed scores of the items can be satisfactorily approximated, i.e. confirmed, by the correlations based on the hypothesised model. Application ⇒ Chapter 12: Public views on deaccessioning

3.8.5 Loglinear models If we want to investigate the association between two categorical variables, the (twoway) loglinear model can be a good choice for analysis. This is a model for the frequencies in a contingency able. The aim is to explain the frequencies in the cells from the frequencies of the row and column totals, as well as from the interactions between the row and column variables. The row and column totals are generally called the marginal row totals and marginal column totals, respectively.10 They are displayed along the right and bottom of a contingency table. There are also loglinear models for three and more categorical variables, called three-way, multiway loglinear models or just loglinear models without further specification (Table 3.2).

Table 3.2 Example of a contingency table Categories of column variable Marginal row Categories of row variable 1 2 3 4 totals 1 2 3 Marginal column totals

4 37 20

14 48 12

8 8 21 20 4 8

61 74 33 36 f •1 f •2 f •3 f •4

34 126 44 204 Grand Total

f 1• f 2• f 3•

N

The last row is a margin of the table; it contains the total frequencies of the columns (the marginal column totals). The last column is the other margin of the table; it contains the total frequencies of the rows (the marginal row totals).

10 Within

the present context the word marginal always refers to the column or row at the margin or border of a table. It has nothing to do with ‘minimal’, ‘barely enough’, or ‘hardly important’.

78

3 Statistical framework

With respect to its structure the multiway loglinear model is comparable to a multiway analysis of variance. There are separate terms (or elements) in the model for the main effect of each categorical variable (also referred to as marginal effects), as well as a term for each interaction. What we test is whether the interactions between the variables in the models contribute significantly to the fit of the model to the data. ‘Data’ should here be understood as the frequencies located in the cells of the multiway contingency table. To find the appropriate terms in a loglinear model for a four-way contingency table we may use backward elimination. This means that the highest order interaction is first removed from the saturated model containing all interactions, i.e. the six twoway interactions, three three-way interactions, and the single four-way interaction. The most complex interaction terms of this total model are successively deleted, until deleting the last candidate term deteriorates the fit of the model too much.11 For example, in a study we may have four categorical variables, from which a four-way contingency table can be constructed. To begin with, the saturated model for this table will consist of all possible interactions between the four variables. As a first step to simplify such a complex model the highest order interaction, i.e. the four-way interaction, is removed from the model provided it does not affect the fit of the model to the data. If this does not lead to a serious misfit of the simplified model compared to the saturated model, it is the turn of the three-way interactions to be deleted. We can remove all three of the interactions at the same time or eliminate them one by one. The simplification of the model stops when removing one further interaction leads to a serious misfit. Interpreting the remaining interactions is not necessarily easy, but always easier than interpreting the saturated model, i.e. the model with all terms included. ∗ Suppose we have a sample of 100 children and have measured three categorical variables: Hair colour (blond, red, brown, black), Eye colour (blue, brown, black, green), and Gender (girl, boy). Here Eye colour will be abbreviated as Eye and Hair colour as Hair. A loglinear model for this problem has – three main effects—one for each variable: Eye, Hair, Gender; – three two-way interactions—Hair × Eye, Hair × Gender, Eye × Gender; and – one three-way interaction—Hair × Eye × Gender. The model containing all possible terms is the saturated model: all interactions are included and together they produce a perfect fit to the data—not interesting to us. If we assume that the relationship between Hair and Eye is the same for boys and girls, the three-way interaction will not contribute to the fit of the model to the data, and we might as well drop the three-way interaction from the model. 11 Backward elimination has its problems and there is quite a bit of opposition against it, especially if it is used to select automatically significant regression and discriminant coefficients. However, as we use it in principle for descriptive and exploratory purposes and model testing and do this by hand we can live with it here. https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-

and-what-you-should-use-instead-90818b3f52df.

3.9 Internal structure designs: Objects, individuals, cases, etc.

79

The Gender × Eye interaction and the Gender × Hair interaction might not contribute either, so that the only interaction left is the Hair × Eye interaction. This could, for instance, show that in the sample Blond hair goes together with blue eyes, red hair with green eyes, etc. A nonsignificant main effect could indicate that all categories for all colour hairs are equally likely, but that is improbable in most samples. ⇒ Application Chapter 16: Reading preferences

3.9 Internal structure designs: Objects, individuals, cases, etc. So far we have only looked at variables and tried to analyse their relationships, but it is obvious that the objects, individuals, etc. in a study may not be a random sample from a single population. It is even likely that the sampling has been done from different explicitly targeted groups: boys and girls in the previous section. If cases from different groups are the focus of the analysis, the models should include group parameters, as in discriminant analysis. Alternatively, we may have good reasons to believe that groups exist, but we have no clear idea of how many there are and which individuals belong to which group. In that case we need statistical techniques to find the groups and sort out who is most likely to belong to which group. This is typically done by cluster analysis. Alternatively, we may want to get some information on which variables are the discriminating ones. However, it can be that we cannot use discriminant analysis because the grouping of the objects is not known beforehand. Only after groups/clusters have been found will such an analysis be possible.

3.9.1 Similarities and dissimilarities There are several situations in which researchers want to evaluate the similarities between objects in a numeric way. There are generally two ways to do this. One is to score an object on a number of features, so that a data matrix is created of objects (rows) and features (columns). This matrix can be converted to an objects × objects similarity matrix (see Figure 1.3, p. 15) via one of many similarity measures. For an overview of measures for such derived data, see Borg et al. (2013). For example, we may want to create a similarity matrix of paintings × paintings to see the similarities between Marian paintings from different countries on the basis of their features scored by students of art. The other way to create similarity matrices is via direct similarity or preference judgments. The research design in that case is to present the students sequentially with all possible pairs of paintings. The students have to indicate how similar the

80

3 Statistical framework

two paintings are, say on a scale from 0 to 10 where 0 = no similarity at all and 10 = maximum similarity. Alternatively, they are asked to indicate which painting of the pair they would prefer on the wall of their student room. This is the method of paired comparisons. Because of the technical characteristics of many analysis techniques, similarity matrices are often converted to dissimilarity matrices in order to have properties comparable to distance matrices, i.e. high values indicate that in a two- or threedimensional plot objects are far apart from each other, and small values indicate that they are close together. In similarity matrices the values on the diagonal from the top-left to the bottom-right have the maximal value (an object is exactly similar to itself), and in dissimilarity matrices the values on the diagonal are zero.

3.9.2 Multidimensional scaling—MDS Multidimensional scaling (mds) is a popular method to examine similarity and dissimilarity objects × objects matrices by means of a configuration of points in a low-dimensional space. Thus, mds provides a spatial representation in terms of similarities, akin to a spatial representation of a correlation matrix by principal component analysis. Dissimilarities seem more natural, because mds attempts to fit distances to dissimilarities or a function of them. However, it is the design of the study and the elements that need to be compared that mostly determine whether similarities or dissimilarities are collected. The result from the analysis is generally a two- or three-dimensional space in which the objects are displayed in such a way that their Euclidean (straight-line) distances correspond as closely as possible to the dissimilarities (see Chapter 4, p. 113). ∗ mds on painting similarities. An example of the result of a multidimensional scaling analysis on a similarity matrix is shown in Figure 3.20. The connections between points may be disregarded for the moment. The similarity between paintings is based on the types of cracks that have appeared over time (see Chapter 10), but the technique is not used in that chapter. The figure shows the fitted straight-line distances between paintings from four different countries. Paintings from the same country are close together, i.e. very similar, but there are also a number of outlying paintings not located with their compatriots, for example, f497 (French) and i170 (Italian). For a detailed discussion of the data and their analysis see Chapter 10. Application ⇒ Chapter 5: Bible translations

3.9 Internal structure designs: Objects, individuals, cases, etc.

81

Fig. 3.20 Multidimensional scaling and cluster analysis. Applied to a similarity matrix of paintings. A centroid is the central point of a group.

3.9.3 Cluster analysis In multidimensional scaling the internal structure of the similarity matrix is given a spatial representation, but an alternative way to show similarities between objects is by grouping or clustering them in such a way that objects which resemble each other are placed together in a group, and the clusters of objects are as separate as possible. A largely nontechnical discussion of cluster analysis and its relation to multidimensional scaling is given by Kruskal (1977). There is an large collection of clustering techniques available, and different techniques are sensitive to different aspects of the data. In particular, there are techniques that start from a data matrix of objects × variables or features and define their own similarity measure, while other techniques start from existing similarity matrices irrespective of how they were constructed. Furthermore, some techniques require numeric data while others can handle categorical data. For an overview of the plethora of clustering techniques available, see Everitt et al. (2011) and Bezdek (2017). Kruskal (1977, p. 32) suggests that ‘It has long been an item of folklore among some of us that scaling gives the information contained in the large dissimilarities while clustering gives the information contained in the small dissimilarities’. He

82

3 Statistical framework

goes on to give reasons why this is often so. A fruitful way to use the different sensitivities to the size of the similarities is to combine the two techniques. In a twodimensional graph the positions of the points can be obtained by multidimensional scaling. Which points belong to the same cluster can be shown by indicating the area where the points form a group. An example is Figure 3.20, where the main cluster boundaries have been drawn around paintings of a similar style. The clusters look largely, but not completely, country-based. An alternative way to represent clusters is by hierarchical trees, if of course such trees can be constructed and make sense, see Figure 3.21. The basic process is that the most similar objects are joined together first and are then treated as a single object. The same procedure is repeated: the next closest objects or clusters are joined in the next step, etc. This process continues until all objects are collected into one single cluster. An overview of the amalgamation creates a hierarchical tree. There are several measures to express the concept ‘closest’ and each defines a different clustering procedure. Due to the different measures the clustering can be different in each case. One of the commonly suggested hierarchical procedures is centroidˇ linkage clustering, see Everitt et al. (2011) and Šulc and Rezanková (2019), which is employed several times in this volume (see, for example, Chapter 11, p. 227). It should, however, be noted that according to Everitt et al. (2011, p. 83) ‘[..] no [hierarchical clustering] method can be recommended above all others’. ∗ Clustering paintings. Continuing the painting example, hierarchical clustering was performed on similarities based on the scale scores of the paintings (see again Chapter 10 for details). In the hierarchical tree we see clear country groups but often some foreign elements have cracks very similar to those in paintings from other countries. There is even a little group of three paintings from different countries which are joined very early on. Italian paintings seem to fall into two groups, and the French stand out most, although they are joined by four Dutch paintings. Incidentally, in Chapter 10 the analysis is more subtle than presented here, but without expert input it is not possible to explain why certain clusters are joined. Country of origin is clearly an insufficient. Following on from a remark above, one should note that whereas multidimensional scaling and principal component analysis are techniques intended to put as much variability as possible into a low-dimensional space, cluster analysis operates differently. It does not compact the variability, but looks for patterns of closeness in the similarity profiles. It can therefore happen that the clustering takes place in another part of the multidimensional space than with the other two techniques. The results of the clustering may then not be clearly visible in the low-dimensional graphs from the multidimensional scaling and principal component analysis, but this is not often the case (see Kruskal, 1977). ‘It has to be recognized that hierarchical clustering methods may give very different results on the same data and empirical studies are rarely conclusive. What is most clear is that not one method can be recommended above all others [...]’ (Everitt et al., 2011, p. 67). This observation is occasionally also true for other multivariate

3.9 Internal structure designs: Objects, individuals, cases, etc.

83

Fig. 3.21 Cluster analysis of cracked paintings. A sideways hierarchical tree.

techniques. It depends, among other things, on the assumptions one is prepared to make about the data, and the match between data and analysis techniques. By the way, it is not a good idea to try as many possible varieties of a technique until you find the one closest to your preconceived ideas. Such ‘cherry picking’ is not a proper scientific procedure (see also the remarks in Section 11.4.1). Applications ⇒ Chapter 5: Bible translations ⇒ Chapter 11: Rock art images across the world ⇒ Chapter 13: The Royal Book of Oz: Baum or Thompson?

84

3 Statistical framework

3.10 Internal structure designs: Objects and variables If the data have an internal structure design the emphasis is often not only on the variables or objects, but on both of them at the same time. For numeric data, principal component analysis (Section 3.8.1, p. 66) is a useful tool, also very suited for data reduction. In that situation the component scores and component loadings are treated together and represented in a biplot (Chapter 4, p 106). For categorical data, an important analysis tool is correspondence analysis. For more than two categorical variables, each with relatively few categories, loglinear modelling (see Section 3.8.5) can be an appropriate alternative choice. In addition, multiple correspondence analysis (mca) is an extremely useful technique for several nominal variables (see Section 3.10.2). For variables with a mix of measurement levels one may use Categorical principal component analysis-cat pca (see Section 3.8.2). This technique is included in several mainstream program suites, such as spss and sas: for spss details see Meulman and Heiser (2012).

3.10.1 Correspondence analysis: Analysis of tables Correspondence analysis is a technique for analysing contingency tables that is used widely in a large variety of disciplines including the humanities. It is primarily aimed at analysing frequency data from two or three categorical variables at a time, but it can be applied to any table with nonnegative numbers. Results are nearly always represented in a biplot portraying the rows and columns of the table. One of its uses is to visualise the relationships between the categories of the variables simultaneously. For an example see Figure 3.22, p. 87. Correspondence analysis is a technique from the data reduction family. An excellent book to get acquainted with the statistical method is Greenacre (2017); it also contains many references for further study. Beh and Lombardo (2014) present overviews of all kinds of variants of correspondence analysis, and also provide all sorts of practical advice and information on handling the technique, especially in R. If there is a dependence structure between rows and columns, i.e. row categories are predicted from the column categories and vice versa, NonSymmetric Correspondence Analysis—NSCA is an important variant. It is not treated in this book, but a relatively straightforward explanation can be found in Kroonenberg and Lombardo (1999). A full treatment is given by Beh and Lombardo (2014).

Characteristics of correspondence analysis Profiles A two-way contingency table contains three major kinds of information:

3.10 Internal structure designs: Objects and variables

85

1. Row profiles, containing the information on how the total score for each row category is spread out over the column categories; 2. Column profiles, containing the information on how the total score for each column category is spread out over the row categories; 3. Row/column interaction, containing information on how and to what extent row and column categories are associated. To understand how these profiles are related and can be interpreted, we have to evaluate the row profiles with respect to each other, do the same for the column profiles, and assess their mutual relationships, i.e. the interaction. Graphing in correspondence analysis The main use of correspondence analysis is to portray profiles in several plots: row and column profiles separately, and both together in a biplot (see Chapter 4, p. 106). Typically, a biplot contains one type of profiles as points and the other as biplot axes (indicated by arrows or lines through the origin and the point for the column of row profile) in a two-dimensional, sometimes three-dimensional space. Which of the two profiles are points and which are axes depends on the research question. In the example in Figure 3.22 we want to know which authors write about which topics, so that the most natural arrangement is to have the authors as axes and the topics as points, because it is the authors that are of prime interest. To construct such plots we need to compute coordinates for the row and column profiles, which are derived via what is called a decomposition of the contingency table. The heart of the computational procedure for correspondence analysis is a mathematical technique called singular value decomposition, which produces the required coordinates to construct the biplot (for details see Chapter 4, p. 125). Depending on the complexity of the relationships in the table, from the singular value decomposition two or three dimensions for both rows and columns are derived. There is, however, no need to explicitly interpret such dimensions; they function more as the ‘support’ of the space, and do not necessarily represent an intrinsic theoretical construct. Graphing profiles and interactions require careful thought. In a plot people are almost automatically inclined to use ordinary straight-line (or Euclidean) distances to assess how close points are in a graph. However, in biplots the evaluation of the associations between the row and column profiles takes place via inner products (see Chapter 4, p. 113). In order to realise this, we have to adjust or scale the coordinates of the points such that the interpretation is valid. Scaling can be realised by representing either the rows or columns as principal coordinates, and the columns or rows as standard coordinates. For details see again Chapter 4, p. 113, and also p. 106. Figure 3.22 illustrates the way the inner products are used for interpretation.

Example ∗ Polemical theological brochures. Figure 3.22 is an example of a biplot from a correspondence analysis. It shows the graph derived from a contingency table

86

3 Statistical framework

indicating the topics treated in mid-19th -century theological brochures in the Netherlands. (Authors are red squares and Topics are blue points). Biplot axes are drawn through the origin and an author point. For interpretation, the topics are projected onto an author’s axis to indicate the frequency with which the author refers to a topic. Topics with an average frequency project into the origin, topics frequently mentioned project onto the author’s side of the axis, and topics less frequently mentioned project on the opposite side of the axis. As examples of projections two thin blue solid lines have been drawn. Thus Lipman (a Jew converted to Catholicism) regularly and with almost equal frequency refers to authority, freedom, research, doubt, and ratio. These topics project almost on the same place on the positive side of Lipman’s axis. He hardly mentions Luther and truth as can be concluded from their projections on the opposite side of his axis. On the other hand, the ‘modern’ Protestant Zaalberg writes about Luther, and more or less equally frequently about truth and ratio, but not about research and authority. The Jesuit Frentrop is a prolific writer, but does not always write about the same topic, as the diverging positions of his four contributions show. In a biplot it is not the direct distance between an author and a topic, but it is the position of the projection of a topic on an author axis that determines the interpretation; for technical details see Chapter 4, p. 113. The graph is derived from Smit (2019). Inertia In principle all data reduction techniques need to produce information on how much variability is accounted for by the chosen number of dimensions. In correspondence analysis the measure for variability is called inertia, and is the total Pearson’s X 2 of a contingency table (also, somewhat inaccurately, called the χ 2 statistic, see Chapter 4, p. 128) commonly calculated to test for independence between rows and columns. Each dimension takes care of part of the inertia, and the sum of their contributions is equal to the total inertia. The dimensions are calculated in such a way that they account for successively less and less inertia, in the same way as is done in principal component analysis. If the first two dimensions account for a substantial part of the total inertia, and they give a good representation of the profiles of the table, their two-dimensional graph can be interpreted without impunity and fear for serious misrepresentations. For a further example see Section 6.5.2, p. 152. Applications ⇒ Chapter 6: Authorship of the Pauline Epistles ⇒ Chapter 8: Graves in the Münsingen-Rain burial site ⇒ Chapter 15: Chronology of Plato’s works

3.10 Internal structure designs: Objects and variables

87

Fig. 3.22 Example of correspondence analysis. Authors and topics in polemical theological brochures.

3.10.2 Multiple correspondence analysis Multiple correspondence analysis (mca) is an appropriate technique for nominal categorical data, including binary ones (Meulman and Heiser, 2012, p. 55). It is in fact a purely nominal form of categorical principal component analysis. and reduces to correspondence analysis if there are only two nominal variables. As in categorical principal component analysis, multiple correspondence analysis quantifies nominal data by assigning numeric values to objects and categories such that objects within the same category are close together, and objects in different categories are far apart. In a biplot each object is as close as possible to the categories that apply to it. In this way, the categories divide the objects into homogeneous subgroups. Application ⇒ Chapter 11: Rock art images across the world ⇒ Chapter 16: Reading preferences

88

3 Statistical framework

3.10.3 Principal component analysis for binary variables Binary variables constitute a special case for principal component analysis, which is why they are treated here separately. In general they can be analysed as if they are numeric, although they have different distributions because they have only two values. Moreover, for binary variables standard principal component analysis, categorical pca, and Multiple correspondence analysis are numerically identical, and thus yield the same results. However, the emphasis in the procedures is not always the same and not always geared towards optimal representation of the binary variables. This is due to the different underlying conceptualisations of the techniques; for a discussion, see, for example, Linting and Van der Kooij (2012, pp. 13–14). As mentioned above, in standard numeric principal component analysis, the components are ordered so that the first few retain most of the variance present in all of the original variables. In addition, the components are always the same and account for the same amount of variance, irrespective of how many components are extracted from a dataset, i.e. the components are nested. It can be shown that for binary variables the solutions with varying numbers of components are also nested, i.e. the first two dimensions of say a five-dimensional solution are the same as the first twodimensional of a three-dimensional solution, which is not always the case for other variants of categorical component analyses. Thus, in those methods the content of the earlier dimensions depend on which other dimensions are in the analysis, a somewhat unstable situation. With respect to the output, for binary variables the discrimination measures which indicate which variables contribute to the differences between the two groups are listed directly in the program for multiple correspondence analysis, but in categorical pca we need to square the loadings to obtain them. Summing these squared loadings over the components produces the standard-pca communalities for the binary variables. Communalities are equal to the squared multiple correlation coefficient if the attributes are regressed on the components, and thus they are the same quantities as the discrimination measures. In principle for binary variables there is no objection to equating the three techniques. However, as soon as any of the variables has a different measurement or analysis level, the numeric equality disappears. Applications ⇒ Chapter 11: Rock art images across the world ⇒ Chapter 16: Reading preferences

3.11 Internal structure designs: Three-way models In Figure 1.2, p. 12 we presented three three-way datasets as data formats: one for objects×variables×conditions/time point data, one for stimuli×stimuli×samples data (often referred to as individual differences similarity data), and sets of correlation

3.11 Internal structure designs: Three-way models

89

matrices. Such three-way datasets are the simplest forms of multiway datasets in which not just three but more types of entities are fully crossed. These models require sophisticated techniques to handle them properly, and they go under various names, such as the Tucker models, individual differences scaling, parallel factor analysis, and many more. In this book we only present two examples of analysing three-way datasets using one particular generalisation of standard principal component analysis: three-mode principal component analysis (tmpca) developed by Ledyard R Tucker (1964), (1966)). This model is now mostly referred to as the Tucker3 model (Kroonenberg, 1983, 2008). Examples of analyses of three-way data represented in this book can be found in Chapter 10 (Craquelure of paintings), and Chapter 17 (The Chopin Preludes). The chapters also present the power of (tmpca) for data reduction. A full insight into three-way models and their analysis can be found in Kroonenberg (1983, 2008) and in Smilde, Bro, & Geladi (2004). The latter book is especially useful for chemical analyses in archeology, for instance, carbon-C14 dating of artefacts.12

3.11.1 Three-mode principal component analysis—TMPCA The three-way data format is not always recognised as such, but it is more common than is often realised (see Figure 3.23 for example). Different measurement levels of the ‘ways’ require different models, and three-way analysis of variance and threeway loglinear models are also analysis methods for three-way data. In Figure 1.2 (p. 12) a few variants, primarily for numeric data, are shown: a fully crossed dataset of subjects × variables × conditions, sets of correlation matrices, sets of covariance matrices, and sets of similarity matrices all consisting of variables × variables × groups, or objects × objects × groups. The correlation matrices from multivariate longitudinal data have a three-way form as well: variables × variables × time points. The essence of three-mode principal component analysis is that for each of the modes a separate component matrix is defined. So, in the craquelure study (Chapter 10), one component matrix each for paintings (A), type of cracks (B), and judges (C) (see Figure 3.24). To link all components together there is the core array (G), whose values indicate how strongly the components of the three modes are linked, and they can be considered as generalised singular values (Figure 3.24).

12 Within the field of multiway analysis the word way

is generally used to indicate the dimension of the data array: three-way data form a three-way block, multiway data a hyperblock with multisides. The word mode refers to the content of the ways, so that way is the more general term. A set of correlation matrices is a two-mode three-way block. Because this distinction was not initially made by its godfather Ledyard R Tucker, the term ‘three-mode factor analysis’ is often used even if it is applied to sets of correlation matrices.

90

3 Statistical framework

Fig. 3.23 Three-way data: Experts judging craquelure using rating scales.

Fig. 3.24 Three-mode principal component analysis. The model is applied to a three-way array X = (X1 , · · · , X K ). Separate sets of components for Paintings (A), Characteristics of cracks (B), and Judges (C), plus the core array (G) linking the components of thethree ways.

Because it is hard to satisfactorily explain three-mode principal component analysis without going into a lot of detail, we will leave this explanation to the chapters in which the analysis is described (Chapters 10 and 17). ∗ In the craquelure study (Chapter 10) the three-way dataset consists of 40 painting categories × 7 type of cracks characteristics × 27 individuals. The induviduals were requested to judge to what extent the cracks in each of the painting categories have particular forms: circular, perpendicular, smooth, jagged, etc. Applications ⇒ Chapter 10: Craquelure and pictorial stylometry ⇒ Chapter 17: The Chopin Preludes

3.13 Model selection

91

3.12 Hypothesis testing versus descriptive analysis In this chapter an overview has been given of several important multivariate models that may be used in the humanities. The emphasis is on descriptive and exploratory models rather than on formal hypothesis testing. Many models such as general linear models have well-developed testing procedures, but in this book they are not emphasised. For researchers to be able to trust the results from an analysis, there is always the question to what extent the assumptions required by selected models can be fulfilled, and how they can be validated. This is true in the physical and social sciences as well as in many areas of the humanities. In Chapter 2 we have discussed and illustrated this aspect of data analysis. Such problems can be tackled, but not always solved, via data inspection and the testing of assumptions. It may also be necessary to employ robust versions of the intended techniques (see Chapter 4, p. 123).

3.13 Model selection In a research project, model selection refers to two aspects, (a) selecting a class of models suitable for the research question and (b) choosing the appropriate variant within a class of models. ∗ pca example After you have decided that a particular variant of principal component analysis is the appropriate method for a research question (decision a), the choice is how many components are necessary or adequate for answering the question in practice (decision b). Of course, it may be for that for answering a research question techniques from two different classes are necessary, for instance, multidimensional scaling and cluster analysis. An important task for researchers is to select or apply appropriate data collection methods for their substantive research questions. This topic lies outside this book; the initial chapters of Van Peer et al. (2012) will be instructive in this respect. Once the variables have been measured, i.e. the data have been collected, the problem is to find the appropriate statistical models and methods to analyse the data. In this book we will present some assistance with this process in each of the case studies. We have to take into consideration the nature and function of the variables, their measurement levels, and their data format. Furthermore, the research design is important: are we dealing with a dependence design or an internal structure design? Given the design we must sort out which multivariate technique will fit our research question and our data best. To aid model selection given the data and the research questions, I have constructed two charts (Figure 3.25) listing the primary techniques discussed in this book. Not surprisingly, some of the statistical techniques were used in their multivariate rather than the univariate form.

92

Fig. 3.25 Model selection charts

3 Statistical framework

3.15 Designing tables and graphs

93

3.14 Model evaluation Apart from the selection of an appropriate model class and a technique for estimating the model parameters, there is the question of which model or technique within a class is most appropriate for the interpretation of the data at hand. We may, for instance, decide that principal component analysis is probably the way to go, but within the technique other choices need to be made. Should we use the variant that can handle ordinal categorical data, or does the standard option for numeric data suffice? How many components will be adequate for the data and can still be interpreted? In cluster analysis we will have to choose which of the plethora of clustering techniques is best for the data, and how many clusters will give the best description. In principle such questions and decisions fall outside the scope of this book, although such choices will of course be discussed in the case studies. Several of the questions mentioned need to be answered by examining model fit, i.e. how well do the implied values estimated with the chosen model concur with the observed ones? In nearly all cases our advice about such decisions is to consult a professional statistician who also knows the substantive field. It goes without saying that in such cases the statistician should become part of the project or at least figure as one of the authors of the resulting paper.

3.15 Designing tables and graphs Exploratory data analysis cannot succeed without meaningful displays, i.e. tables and figures. A common saying is that One picture is worth a thousand words.13 Unfortunately, this is only partially true, as it is always necessary to explain how to read specific pictures and tables. To construct them is not an easy task either, and undoubtedly most pictures and tables need content knowledge to make them really well. Moreover, they must be made in such a way that the message immediately strikes the eye.

3.15.1 How to improve a table Ehrenberg (1981, p. 67), one of the champions of table presentation, states ‘[The following] few simple rules or guidelines can work wonders in communicating a table of numbers.[..] Using these rules generally produces tables that are easier to read’. (see also Chapter 4, p. 108)

13 For

the history of this saying see: https://en.wikipedia.org/wiki/A_picture_is_worth_a_thousand_words.

94

3 Statistical framework

1. Give marginal averages to provide a visual focus [i.e. add column and row averages at the bottom and to the right of table, respectively]; 2. Order the rows and columns of the table by the marginal averages or some other measure of size (keeping the same order if there are many similar tables); 3. Put figures to be compared into columns rather than rows (with larger numbers on top if possible); 4. Round to two effective digits [effective digits are those digits that make a difference in interpretation if they are not shown]. For example, if in a study a correlation .3 or .4 makes a difference, but .34 and .36 do not, two digits are necessary, but three or more, .34567 and .36123, are distracting; 5. Use layout to guide the eye and facilitate comparisons; 6. Give brief verbal summaries to lead the reader to the main patterns and exceptions. Of a published table used as example in his paper, he remarks ‘The numbers are not easy to take in. What are their main features? How can they be summarized? How can we tell someone over the phone?’ This suggests to show your table to other people, and ask them to tell you what the table tells them.

3.15.2 Example of table rearrangement: a binary dataset Table 3.3 containing a table with binary variables (blank = absent or 1 = present) is reproduced here from Chapter 8. The left-hand matrix is an unordered collection of ones and blanks. The rearrangement on the right was based on correspondence analysis (Section 3.10.1). The rearrangement of the left-hand matrix has a dramatic effect: now there is a clear pattern to be interpreted. Even some deviating rows and columns with unusual patterns are now visible. Note that this effect would be completely lost if zeroes had been used rather than blanks. The construction and interpretation are discussed in Chapter 8 on the content of Celtic graves in Switzerland.

3.15.3 Examples of table rearrangement: contingency tables The data for the improved presentation of a contingency table are taken from Chapter 14. The table consists of counts of different prose rhythms by Pope Urbanus II (11th century). Table 3.4 is a direct output table of a computer program and is highly problematic for interpretation. Two types of numbers are on alternating rows: frequencies and standardised residuals. In addition, the latter have more digits than is necessary for interpretation. The order of the columns and rows is based on rather weak content arguments, rather than on the numeric values.

3.15 Designing tables and graphs

95

Table 3.3 Münsingen-Rain data: 59 graves (rows) and 70 artefacts (columns)

Left: Raw matrix. 1=artefact is present; blank = it is absent. Right: Ordered matrix based on the first dimension of a correspondence analysis (Right)

The reformatted Table 3.5 is split into two sections, one for the frequencies and one for the standardised residuals, see Chapter 4, p. 108. The relative importance

Table 3.4 Raw spss cross-table: Urbanus II

96

3 Statistical framework

Table 3.5 Formatted cross-table: Urbanus II Ultimate word Counts Penultimate pp p 1 Total

4p 156 21 7 184

4pp 11 86 6 103

3p 3 53 8 64

5pp 11 7 0 18

5p 6 8 4 18

3pp 6 7 3 16

2 4 2 1 7

Adjusted standardised residuals Penultimate 4p 4pp 3p 5pp pp 7 −5 −5 1 p −7 6 5 0 1 −2 0 2 −1 Ultimate = last; penultimate = one but last.

5p −1 0 2

3pp −1 0 2

2 0 −1 1

Total 197 184 29 410

of sizes of the frequencies is influenced by the marginal totals, and moreover it is the patterns in the cells we are interested in. We want to know to what extent the observed frequencies deviate from the situation when there would be no interaction between the rows and columns, i.e. from the model of independence. If that model was true then all rows would be proportional to each other and the same would hold for the columns. The size of the standardised residuals tells us about what deviations from the values to expect if the model of independence was true. We can rearrange the table on the basis of the residuals to show the pattern of the interaction. After readjustment, all the residuals with highest absolute values are in the top-left corner. The standardised residuals are more or less equal to standard scores from the normal distribution. Thus, absolute values of 2 and lower are what we expect if the model of independence was true. After reformatting the pattern of the table is much easier to interpret. Bolding the larger residuals now makes quite clear that the (pp,4p), (p,4pp), and (p,3p) combinations occur much more often than expected (positive residuals); the (p,4p), (pp,4pp), and (pp,3p) much less often (negative residuals), and all the other cells are roughly as expected if the model of independence was true, as they have small residuals. Similar reformatting of correlation and similarity matrices can be so effective in showing the pattern among the correlations or similarities that performing a further multivariate analysis may not even be necessary. Alternatively, it allows us to check whether the analysis is performed as expected. This view echoes a remark by Baxter (2008, p. 973): [..] you don’t always need complex statistical methods to analyse the data generated. Some analysts when confronted with multivariate data reach for their favourite statistical technique without looking to see if simple methods (for example bivariate graphs) will answer the research question.

∗ Table 3.6 is a reorganised similarity table of fourteen Bible translations. Due to the reorganisation it is evident that there are two major sets of similar translations: the English translations and the German translations, both from the same

3.15 Designing tables and graphs

97

Table 3.6 Grouped similarity matrix of Bible translations

Hebrew-Aramaic text (for details see Chapter 5). Within the English translations we also see a somewhat separate group: all the British-English translations based on the King James Bible. The German translations show much lower correlations among themselves, pointing to lower dependence among the translations. Incidentally, the systematic, but low correlations between the English and German translations are probably due to proper names appearing in both. Further multivariate analysis is not necessary to show the presence of the groups of translations mentioned. Note that the similarities have just two decimals, and the decimal point has not been printed, except for the similarity of 1.00 to distinguish them from 0.01. Note that a similarity of 3 means 0.03. Table 5.2 shows what difference it makes whether or not the decimals and the decimal points are included. However, if making such a table available for others to analyse, this form is preferable because of increased analysis accuracy.

3.15.4 How to improve graphs Presenting guidelines for making effective graphs is much harder than for tables because there are many different kind of graphs for different purposes. There are several books and articles about making adequate graphs for statistical reporting, with many examples and references to the subject, such as Grant (2019), Keen (2018), Cook et al. (2016), Chambers et al. (1983), and Wainer (2000) to name but a few. One rather basic warning is that the use of colour may be tricky. Not all publishers are willing to use colour, as printing colour is much more expensive than black and white. This could be a reason to have a black-and-white version of a manuscript for printing, and an one in colour for a digital version. To make a black-and-white version of a colour graph is a rather daunting task, especially in applied multivariate

98

3 Statistical framework

texts, which often rely heavily on graphic presentations of the material—as does this book.

3.16 Software Because the data analyses presented here are intended merely as illustrations of the way research questions can be tackled statistically by multivariate procedures, there is no great emphasis on how the results presented were acquired, i.e. via commercial software such as spss, sas, and Stata, or via high-quality programs developed and made available by individuals. The statistical programming language R, now almost standard, is not discussed or demonstrated either.14 Some introductions especially written for humanities-oriented researchers are Baxter (1994) and Baxter (2010) [archeology], Arnold and Tilton (2015) [networks, images, text, geospatial data], Lebart et al. (1998) [textual data], and Baayen (2008) [linguistics], but new introductions for specific fields are appearing regularly. For a more extensive list of such books, see the Appendix. The practical aspects of performing multivariate analyses have here been explicitly ignored in order to be able to concentrate on conceptual, selection-related, and interpretational aspects of the analyses. In a sense the lack of practical guidance in handling statistical software can be considered a serious shortcoming of this book, but I felt that it would be a bridge too far to include software procedures as well. It would take the reader into a totally different domain of endeavour and requires more training into the details of multivariate statistics. Moreover, it is easy to write a whole book about the topic, but due to ongoing statistical developments it may be largely out of date by the time the text gets to the reader. There is much controversy over what is the best way to carry out analyses, and which type of programs to use. Baayen, for instance, states in the introduction of his 2008 book: Statistical packages tend to be used as a kind of oracle, from which you elicit a verdict as to whether you have one or more significant effects in your data. In order to elicit a response from the oracle, you have to click your way through cascades of menus. After a magic button press, voluminous output tends to be produced that hides the p values, the ultimate goal of the statistical pilgrimage, among lots of other numbers that are completely meaningless to the user, as befits a true oracle (Baayen, 2008, p. x).

Personally I think Baayen goes too far in his criticisms of standard (commercial) software, especially since in this book I have used the approach he mentions as well as non-menu-driven software, depending on what was available to carry out my analyses. Truly advanced and dedicated software seems hard to beat, as it has often been created by the experts themselves. However, on the question which types of 14 For information on the programming language and collections of statistical software written in R any of several introductory books may be consulted, as well as the R website https://www.r-project.

org/.

3.17 Overview of statistics in the case studies

99

programs are best for standard applications I think the jury is still out, in particular, for aspects such as versatility, availability, quality control, ease of application, and reliability, not necessarily in that order. Furthermore, are p values really the ultimate goal of the statistical pilgrimage? Whether this statement by Baayen is a bit of tongue in cheek or honestly meant is open to interpretation. That noncommercial software is free seems to me a dubious reason to prefer it. It is the quality and correctness as well as ease of use that counts. As a final sobering thought, it seems to me that some software programs would seriously benefit from improvements in output and interface. Personally I would love to have ‘cascades of menu options’ to manipulate output. A related view in this respect can be found in Section 1.3 (pp. 4–7) of the highly recommended book by Baxter (1994).

3.17 Overview of statistics in the case studies In this section I present a compact overview of the statistical techniques used for the case studies in Chapters 5–18. Minor statistical methods are not mentioned, but the listing below gives a fair overview. Most of the techniques focus on descriptive, exploratory techniques rather than on hypothesis-testing ones. In fact most of the tests mentioned are model and not hypothesis tests.

Theology/Bible studies 5. Similarity data: Bible translations ∗ ∗ ∗ ∗

data inspection; presenting similarity matrices; multidimensional scaling (mds); cluster analysis (centroid linkage).

6. Stylometry: Authorship of the Pauline Epistles ∗ ∗ ∗ ∗

contingency tables, row, and column profiles; plotting of profiles, principal coordinates; correspondence analysis, inertia, dimensional fit; biplot of row and column profiles, biplot axes.

100

3 Statistical framework

History & Archeology 7. Economic history: Agriculture development on Java ∗ ∗ ∗ ∗ ∗ ∗ ∗

measurement level discussion; data inspection: incomplete data matrices, outliers; scatterplot matrix; quantification of nominal variables via means; categorical principal component analysis (catpca); biplots, biplot axes; visual clustering.

8. Seriation: Graves in the Münsingen-Rain burial site ∗ correspondence analysis on binary variables; ∗ rearranging an incidence matrix; ∗ validation.

Arts 9. Complex response data: Evaluating Marian art ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

three-way data; rearranging three-way data into a wide data format; data inspection, boxplots, missing data; univariate analysis of variance (anova); multivariate repeated measures (anova); means, interaction plots/line graphs, effect size; principal component analysis; scale construction, Cronbach’s alpha.

10. Rating scales: Craquelure and pictorial stylometry ∗ ∗ ∗ ∗ ∗

three-way data; data inspections, means, mean plots, outliers; three-mode principal component analysis (tmpca); components; joint biplots; discriminant analysis.

11. Pictorial similarities: Rock art images across the world ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

binary data; methods for analysing proportions, chi-squared tests; categorical principal component analysis (catpca); multiple correspondence analysis; discrimination measures; cluster analysis (centroid linkage); biplot, jittering; discriminant analysis.

3.17 Overview of statistics in the case studies

12. Questionnaires: Public views on deaccessioning ∗ ∗ ∗ ∗ ∗ ∗

data inspection, properties of distributions, confidence interval; Pearson and Spearman correlations; confirmatory factor analysis (cfa); structural equation modelling (sem); deviance plot; scale construction, Cronbach’s alpha.

Linguistics 13. Stylometry: The Royal Book of Oz: Baum or Thompson? ∗ ∗ ∗ ∗

data inspection, boxplots; principal component analysis (pca), communalities; biplots; cluster analysis (centroid linkage).

14. Linguistics: Accentual prose rhythm in mediæval Latin ∗ ∗ ∗ ∗

contingency tables, standardised residuals, X 2 test; binning; ordinal principal component analysis; biplots.

15. Linguistics: Chronology of Plato’s works ∗ correspondence analysis; ∗ biplots. 16. Binary judgments: Reading preferences ∗ ∗ ∗ ∗ ∗

binary data; loglinear analysis; higher order interactions, backward elimination; multiple correspondence analysis of profiles, graphs; supplementary variables.

101

102

Music 17. Music appreciation: The Chopin Preludes ∗ three-mode principal component analysis (tmpca)); ∗ multivariate two-factorial anova; ∗ joint biplots. 18. Musical stylometry: Characterisation of music ∗ ∗ ∗ ∗ ∗ ∗

data inspection, boxplots; comparison distributions, skewness, kurtosis; analysis of variance (anova); discriminant analysis, backward elimination; logistic regression, backward elimination; principal component analysis (pca).

3 Statistical framework

Chapter 4

Statistical framework extended

Abstract This chapter deals with more advanced statistical topics, and should be approached as a reference work. The topics are ordered alphabetically rather than by content.

4.1

Contents and Keywords

∗ Analysis of variance (anova) designs ∗ Binning ∗ Biplots ∗ Centroids ∗ Contingency tables ∗ Convex hulls ∗ Deviance plots ∗ Discriminant analysis ∗ Distances ∗ Inner products and projection ∗ Joint biplots ∗ Line graphs ∗ Missing rows and columns ∗ Multiple regression ∗ Multivariate, multiple, multigroup, multiset, and multiway ∗ Quantification, optimal scaling, and measurement levels ∗ Robustness ∗ Scaling coordinates ∗ Singular value decomposition ∗ Structural equation modelling—sem ∗ Supplementary points and variables ∗ Three-mode principal component analysis—tmpca ∗ X 2 test (χ 2 test)

© Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_4

p. 104 p. 105 p. 106 p. 108 p. 108 p. 110 p. 111 p. 112 p. 113 p. 114 p. 115 p. 115 p. 117 p. 118 p. 120 p. 121 p. 123 p. 124 p. 125 p. 125 p. 127 p. 127 p. 128

103

104

4.2

4 Statistical framework extended

Introduction

I have chosen to deal with slightly more advanced statistical topics and extended explanations of topics treated elsewhere in a separate chapter, rather than hiding them inside the case studies or the main statistical chapter. They are presented in alphabetical order, and hopefully still elementary enough to make them accessible for most quantitatively inclined researchers. These topics largely complement those in Chapters 2 and 3, and are not necessarily included in the Glossary as they are explained here. Statistical terms within the sections generally are. Section names without page numbers refer to sections in this chapter.

4.3

Analysis of variance designs

Analysis of variance (anova) is a technique to assess differences between two of more group means, or between mean difference scores (repeated-measurement design). It is always important to realise that the spread of the scores plays a crucial role in an analysis. The same difference between two means can be important if there is little spread around the means, but unimportant if the scores are such that most scores could have come from either distribution. In a hypothesis test of the difference between means the assessment takes place by evaluating variances. In other words, the test takes the spread around the means as a starting point (see also Section 3.10, p. 59). • Between-subjects differences. Assume we have three groups of persons: experts, apprentices, and amateurs. We ask them how skilfully they think a painting has been executed. We want to assess whether the three groups agree, i.e. whether Mean expert ≈ Mean apprentice ≈ Mean amateur . This is a between-subjects assessment. Equality in the sample is tested against the null hypothesis of equality of means in the population. • Within-subjects differences. Assume the same persons have to assess whether there is a difference in the levels of skill in the execution of two paintings. First we compute each person’s difference score with respect to their judgment of the skill shown in each painting. Then we calculate the mean difference score. This mean of the differences, (Mean painting1-painting2 ), is evaluated against the null hypothesis that there is no perceived difference between the two paintings. This is a withinsubjects assessment, generally referred to as a repeated-measures design. It can be extended to more time points and/or evaluations on more aspects of the paintings. In both designs, the null hypothesis is tested via the variability associated with the measurements, i.e. via a form of analysis of variance. In this technique the variance based on the systematic differences in the means is compared with the random variance of the individuals with respect to the means.

4.4 Binning

105

Analysis of variance is actually a cover name for a host of variants of the general linear model with categorical predictors, such as: ∗ Multivariate analysis of variance: several response variables. ∗ Longitudinal analysis of variance or repeated-measures analysis of variance: several measurements on the same response variable(s) over time. ∗ (Multivariate) analysis of covariance: one or more response variables and a mix of categorical and numeric predictor variables. ∗ Random-effects analysis of variance: the categories of the predictor variable are not fixed, but a random sample of possible categories of the variable. ∗ Multilevel analysis of variance: measurements for which the predictor variables have different levels, which are nested. For example, children are grouped in a class (level 1), nested within teacher (level 2). Children in the same class have the same teacher. Thus, at teacher level the observations are no longer independent. For details see Snijders and Bosker (2012).

4.4 Binning In various datasets variables are counted, but the numbers may vary considerably. For instance, for some variables the values of the objects or individuals can run from 0 to 100 or more [say 200]. The distribution of such variables can contain many different values which occur only once, whereas some other values occur much more often. Such variables can have very skewed and very discrete distributions, for instance, if the value zero occurs very often and 200 only once. Imagine a population sample of ordinary people; most are not depressed at all, and only very few are severely depressed. For many analyses such distributions are impossible to handle in their original form, and we have to take measures to make them at least reasonably appropriate for analysis. Of course this means maltreating the data somewhat, but often there is no alternative if we want to understand what is going on. One option in this case is what is called binning. The distribution is compacted so that a limited number of different values remain, each of which represents an interval of scores and the binned categories are mostly considered ordinal. Instead of treating extreme values such as 50, 100, 150, and 200 separately, these scores are merged into a single category which is given the highest ordinal number, i.e. the individual values of the distribution are binned. All binned categories should preferably be reasonably full, and they are subsequently treated as an ordinal variable. By doing so the correlation will not be unduly determined by (extreme) outliers (see Figure 2.1, p. 20). Below we give Figure 4.1 as an example.

106

4 Statistical framework extended

Fig. 4.1 Binning a very sparse distribution. The distribution represented in the table and graph on the left and in the middle is binned into four categories, shown as the distribution on the right. The binned categories are given the ordinal values 1, 2, 3, and 4.

4.5

Biplots

In many of the case studies principal component analysis (pca) has been used for numeric data (Section 3.8.1, p. 66) and correspondence analysis (ca) for contingency tables (Section 3.10.1, p. 84), to unravel the relationships between the quantities in the rows and columns. For numeric data these are mostly individuals or objects for the rows, and variables or scales for the columns. For a contingency table we have two categorical variables, one for the rows and one for the columns. It may be useful to read the sections in Chapter 3 first for an introduction into these techniques. More complete, primarily practical, introductions can be found in Greenacre (2010), and Kroonenberg (1995). One step up in technical level is Gower et al. (2011). Especially in correspondence analysis particular terms are used for the numbers in a row, column, and their interactions, respectively: 1. Row profiles, containing the information on how the total score for each row category is spread out over the column categories; 2. Column profiles, containing the information on how the total score for each column category is spread out over the row categories; 3. Row/column interaction, containing information about the interaction between row and column categories. The heart of principal component analysis and correspondence analysis is formed by two sets of coordinates, one for the rows and one for the columns. If X is the data matrix, the coordinates are derived from the decomposition X = UV . U contains the row coordinates (i.e. the M row components), V, the column coordinates (i.e. the M column components), and the diagonal matrix  with λ1 , . . . , λ M on the diagonal. The M λs are the standard deviations of the components. This decomposition is called the singular value decomposition. Depending on the complexity of the data we derive

4.5

Biplots

107

via the singular value decomposition two or more components (dimensions) for both U and V. To understand how these components are related and can be interpreted, we have to evaluate the rows with respect to each other, do the same for the columns, and assess their interaction. This takes place via the M orthogonal component matrices U and V, optionally scaled with the λs, for instance, U and V. Generally the components of U have standardised coordinates, i.e. have zero mean and variance one, and the components of V are in principal coordinates, i.e. they do not necessarily have a zero mean. Their variance is equal to λ2j for each of the M components. The results can be portrayed in a biplot, of which there are various kinds, see Greenacre (2010). Such plots display both rows and columns in the same graph, so that the relationships between the row and columns can be assessed, and depending on how the biplots are made specific statements about rows and columns separately can be made as well. The interpretation depends on the scaling of the components for rows and columns, i.e. whether they are in standard or in principal coordinates. The main purpose of the biplot is to throw light on the relationships between columns and rows. As also discussed in the entry Distances, in a biplot it is possible to represent the coordinates in three different ways. 1. Principal coordinate biplot. Both rows and columns are scaled as principal coordinates. However, there is no properly defined way to talk about distances between them, which defeats the purpose of the biplot. 2. Asymmetric coordinates; asymmetric biplot. Either rows or columns are scaled as principal coordinates and the other as standard coordinates. In addition, distances between the sets of components scaled as principal coordinates can be interpreted, but no proper distance measure exists for the components of the other set scaled as standard coordinates. A disadvantage of the asymmetric biplot is that often the standardised levels (rows or columns) cluster close to the origin, whereas the principal levels (columns or rows) are located throughout the plot. The solution is to multiply the standard coordinates by an appropriate number and divide the principal coordinates by the same number. 3. Symmetric coordinates; symmetric biplot. Symmetrical coordinates can be defined for √ both rows and columns by multiplying each standardised component j by their λj. For symmetric and asymmetric coordinates we can properly inspect the relationships between rows and columns using the inner product or scalar product distance. How this works in detail is explained in the Distances entry below. There it is also explained why we can only assess the Euclidean distances between rows and columns in principal coordinates, and how in symmetric and asymmetric displays the relationships between the two sets of profiles are calculated and interpreted, for example, see Figure 3.22, p. 87. It is not always advisable to present a biplot in a single graph—there are too many rows, or columns, or both. In such a case the rows and columns have their own graphs and the reader has superposed the graphs mentally, aligning the origin and the

108

4 Statistical framework extended

two coordinate axes. An alternative is to only portray the major rows and columns together, and produce the separate plots for rows and columns in full. Another help can be to calibrate the axes of the rows or columns in principal coordinates. This is not done or explained in this book, but is fully explained in Greenacre (2017, pp. 107–109). The main idea is that a biplot should make the relationships between columns and rows transparent. Several attempts to make biplots even more insightful are described in Blasius et al. (2009).

4.6

Centroids

In, say, a two-dimensional plot a group centroid is located in the middle of the group members. Its position in the plot can be found by averaging the coordinates of the group members for each dimension. The plot centroid’s coordinates can be found by its projection on the axes. Figure 4.2 shows the administrative residencies on Java and their region centroids, see Chapter 7 for an analysis of agricultural development on Java.

Fig. 4.2 The residence centroids indicate regions on Java.

4.7

Contingency tables

A contingency table consists of a table of two categorical variables. They can be binary, nominal, or ordinal; sometimes the variables are interval or numeric if they have only few values, for instance, rating scales. The bottom row and the right-most column are called the margins of the table. The numbers at the bottom (61, 74, 33,

4.7

Contingency tables

109

Table 4.1 Example of a contingency table Categories of Categories of column variable row variable 1 2 3 4 1 2 3 Marginal column totals

Marginal row totals

4 37 20 61

14 48 12 74

8 21 4 33

8 20 8 36

34 126 44 204

f •1

f •2

f •3

f •4

Grand Total N

Table 4.2 Standardised residuals 1 1 -1.9 -.1 2 3 1.9 X 2 = 12.1, p = .06; st. res. < 2.0

2 .5 .3 -1.0

3 1.1 .1 -1.2

f 1• f 2• f 3•

4 .8 -.5 .1

36), and at the right of the table (34, 226, 44) are the marginal column totals and marginal row totals, respectively (Table 4.1). The row variable has (r ) row categories and the column variable has (c) column categories. The numbers in the table are usually observed frequencies. The frequency of objects is written mathematically as f i, j for a cell (i, j), which is defined by the row category i and column category j. Themarginal row totals are written as f i,• and f i, j over all i and j is the total number the marginal column totals as f •, j . N = of observations. The row and column variables can be either related or independent. If they are independent, i.e. there is no relationship between them, the expected value for cell (i, j) is ei, j = f i,• × f •, j /N . If the variables are independent, the difference between all expected values under independence ei, j and the observed values f i, j is small. These √ differences are usually indicated by X i, j = ( f i, j − ei, j )/ ei, j and called the the standardised residuals. The sum of the squared standardised residuals is equal to the X 2 statistic, which is often evaluated via the χ 2 distribution. There are slightly more sophisticated versions of the standardised differences (especially adjusted residuals), but they will not be used here (Table 4.2). ‘Independence’ in a table also means that all rows are proportional to the marginal row totals. Thus, the probability of being in a particular column category is equal for each row category. Similarly all column categories are proportional to the marginal column total. ‘Dependence’ thus means that the probability of being in a row category is different for each column and vice versa. Given a significant overall test, standardised residuals above around 3 are taken to indicate that the combination of the row and column category occurs much more often than expected under independence, and residuals below around -3 indicate that

110

4 Statistical framework extended

the combination of the row and column category occurs much less frequently than expected under independence. It is worth noting that the standardised residuals are sensitive to overall frequency of a category as indicated by the marginal totals. Thus a high cell frequency in a high-frequency row is less special than a high cell frequency in low-frequency row, and the same for the columns. For example, a cell frequency 100 is nothing special in a ten-category row with a marginal total of 1000, but it is in such a row with a marginal total of 105.

4.8

Convex hulls

A convex hull around a number of points is a set of connected lines within which the points are contained (see Figure 4.3, which also shows some centroids). The hull is only convex if there are no dents in its hull; compare it to a rubber band around some nails fixed on a board, which also cannot have a dent. Formally all points should be within the hull, but if there are serious outliers I have preferred to leave the outliers outside the convex hull. For an applied example, see Figure 10.9, p. 225. Sometimes a convex hull is not closed, but only a line connecting a series of points. The line is convex as long as it does not have dents, for example, see Figure 4.4. The green line would not be a convex hull if it went through point E when we connect D with F.

Fig. 4.3 Convex hulls and centroids: Paintings by countries.

4.9

4.9

Deviance plots

111

Deviance plots

Fig. 4.4 Deviance plot: deviance versus degrees of freedom. Models on the left part of the curve have less deviance and thus fit better, but this requires many parameters to do so, and leaves less degrees of freedom. The model is more complex due to all its parameters which require interpretation.

Deviance is another word for lack of fit. It is a statistic that indicates how well or how badly a model fits the data; common statistics used for this purpose are the residual sum of squares, the X 2 statistic, the χ 2 statistic, and the likelihood ratio statistic. It is used in a deviance plot with the deviance as Y -axis and the degrees of freedom as X -axis. The degrees of freedom = number of data points minus number of parameters in the model. The degree of freedom (df ) is a measure for the simplicity of the model and ease of its interpretation. The more degrees of freedom, the simpler the model, and the easier its interpretation; there are less parameters to interpret.. The unfortunate situation is that in general models with many degrees of freedom have a lot of error, i.e. score high on deviance. What we prefer is a model with little error and many degrees of freedom, but that is rarely possible; we have to settle for a compromise. The reverse formulation of the same situation is that we prefer a well-fitting model with few parameters, because each additional parameter in a model increases its complexity and makes it more difficult to interpret. To help us select a good compromise (= a reasonably fitting model with a moderate number of parameters) we can make a deviance plot. It is a bit counterintuitive because it shows deviance rather than fit, and degrees of freedom rather than parameters. Thus the deviance plot is a graph in which models can be compared with respect to each other so that an acceptable compromise can be worked out. The

112

4 Statistical framework extended

deviance (or lack of fit) is graphed against the degrees of freedom (sparseness of parameters). In Figure 4.4 we see a deviance plot at the right-hand side simple models which has a high deviance, and to the left the complex models with a low deviance and few degrees of freedom. To make comparisons easier, we have connected the models in the graph with a convex line running from the worst-fitting model on the right to the best-fitting model on the left. The models on the curve are the possible compromises. Generally, models at the bottom right of the curve represent a good compromise, not too complex and with reasonable fit. In Figure 4.4 this would probably be model C or D. The curve connecting the compromise solution is called convex, because it has no dents, which it would have if we had connected model D with F via E. Model E is a never preferred because there always is a better compromise, see Kroonenberg (2017). For an interpreted example see Figure 12.3, p. 263.

4.10

Discriminant analysis

Fig. 4.5 Discriminant analysis with two variables X 1 and X 2 and two groups with means M1 and M2 . The graph shows that the difference between the means is largest for di f f D A , and that the differences between the means on X 1 and X 2 may not even be significant.

One of the measures to assess how well the data discriminate between the response groups via the discriminant functions is the canonical correlation. It fulfills the same function as the multiple correlation coefficient in regression. Its square provides the proportion variation accounted for by each discriminant function, in other words, it is a measure of fit (Figure 4.5).

4.11

4.11

Distances

113

Distances

• Euclidean distance. In the normal course of things we measure the distance on a flat surface between two objects (a = (a1 , a2 ) and b = (b1 , b2 )) with a ruler, or if you wish, as the crow flies. The  technical term for this type of distance is the Euclidean distance: d(a, b) = (a1 − b1 )2 + (a2 − b2 )2 It is also the distance measure used in a multidimensional scaling plot (see Section 3.9.2). Distance in a two- or three-dimensional plot is an approximation to the distance in the full dimensional space. The Euclidean distance is independent of the orientation of the coordinate axes, and the coordinate axes do not necessarily have an intrinsic meaning. An example where there is such a meaning can be found in the music styles Chapter 18, whereas this is not the case in the Chopin preludes Chapter 17. • Inner product. In biplots (see Biplots) we want to show how high a point scores on a variable compared to another point. Such comparisons are based on inner products (also called scalar products); see in Section 4.12 how the inner product works and is defined. The comparison between points in a biplot works by first drawing a line (the biplot axis) through the origin (the mean of the variable) and the variable itself, extending the line on both sides of the origin. Then we project the points onto that axis. The order of the projections of the points on the biplot axis, with the highest values near the tip of the arrow, approximates the scores of the points (i.e. objects, individuals, etc.) on the variable. This process is illustrated in Figures 4.6 and 4.7. The perpendicular distance (in the graph indicated by a dashed red arrow) from a point to the line is irrelevant for the size of the inner product. For an example in the chapter on the agricultural history of Java see Figure 7.3, p. 170. • Mahalanobis distance. The Mahalanobis distance is a way to measure how far an observation is away from a multivariate cloud of observations. In principle it is like a Euclidean distance, but because the variables do not have the same scale and may be correlated, the distance has to be adjusted for these differences. Therefore this distance is expressed in the number of standard deviations, and the correlations between the variables are taken into account. • Chi-squared (χ 2 ) distance. This distance is used in contingency table analysis to express how far a row (or a column) profile is from the average profile. A row or column profile in a table is equal to the frequencies in that row or column divided by their respective row or column margin total. Thus, if a row has frequencies (20, 5, 25), it has a marginal total of 50, and its profile is (.4, .1, .5). The average row profile is equal to the marginal column profile divided by the total frequency. All this is clearly explained in Greenacre (2017, p. 8ff).

114

4.12

4 Statistical framework extended

Inner products and projection

Projection Suppose we have a vector v starting at the origin (0,0) (see Figure 4.6). Its length is indicated by |V |. We also have point p, which is the end of a vector p. The length of that vector is | p|. To get pV , the orthogonal projection of the vector p onto V , we drop a perpendicular line from p onto V (the red dotted line). The place where it lands is p⊥ . The point p⊥ is the end point of the vector pV (the red dashed vector), which lies along V . The length of this vector is | pV |, i.e. from the origin to p⊥ . If we call the angle between the vectors p and V , θ , then algebra shows that the length of pV is | pV | = | p| cos(θ ).

Fig. 4.6 Orthogonal projection of p onto V.

Inner products In this book projection is primarily used in connection with the inner product I ( p, V ) of the vector of point p and the vector of variable V . It is defined as the length of the projection of p onto the vector V × the length of the variable (V), i.e. I ( p, V ) = | pV | × |V | = | p| cos(θ )|V |. The formula for the inner product in terms of coordinates of a two-dimensional plane is (4.1) I ( p, V ) = | p| cos(θ )|V | = p1 × V1 + p2 × V2 , where p1 and p2 are coordinates of p, and V1 and V2 are coordinates of V . In a biplot the line through the origin (0,0) and a variable V is called a biplot axis. The values of inner products are used to compare the scores of points p, q, r on a biplot axis such as V . Visually this can be done via the position of the projections of the points ( p⊥ , q⊥ , r⊥ ) onto the biplot axis. If these values are shown on the biplot axis, the axis is calibrated (Greenacre, 2017). As r is perpendicular (= orthogonal) to V , i.e. θ = 90◦ and cos(90◦ ) = 0, the inner product I (r, V ) = 0, and thus r projects exactly into the origin (0,0), i.e. onto the mean of V . Projections onto V on the other side of the origin from V have negative values, i.e. they indicate scores below the mean of V . Thus, in this case p scores

4.14

Means plot with error bars, line graph, interaction plot

115

Fig. 4.7 Inner or scalar product.

the highest on V , then r , and q has the lowest value. For example see Figure 10.6, p. 222.

4.13

Joint biplots

Standard biplots display two kinds of quantities: the row profiles and the column profiles of a contingency table. For another type of biplot the matrix consists of appropriately scaled coordinates of components from a principal component analysis (see also Scaling). In three-mode pca a biplot, known as a joint biplot, is constructed from components from two of the modes (the display modes) for each component of the remaining mode (the reference mode). For example, if the third mode, the reference mode, has two components, r = c1 and c2 , then there are two joint biplots, i.e. one for each component. They consist of the components of the display modes 1 (a1 and a2 ) and 2 (b1 and b2 ). The construction formula is AG r B  , where r =1, 2, and 3. G r is a slice from the core array G. G r is further decomposed via a singular value decomposition. Examples of the application and interpretation of joint biplots can be found in Sections 10.5.2 and 17.5.5. For further details about joint biplots see Kroonenberg (2008, Section 11.5.3).

4.14

Means plot with error bars, line graph, interaction plot

In various circumstances means of different groups, different time points, etc. need to be visually compared. An example is the line graph in Figure 4.8. In addition results from analyses of variance, especially the shapes of interactions can benefit from visual inspection via interaction plots. A collective name for these graphs is means plot. To properly evaluate these plots the variability around the means needs

116

4 Statistical framework extended

to be presented as well. Commonly, 95% confidence intervals around the mean are used for this purpose. These intervals extend 1.96× the standard error on both sides of the mean. However, in a mean plot these confidence intervals are too wide for visual testing of equality. Instead, error bars around the mean are used; these error bars are equal to 1.5 × the standard error of the mean, as, for instance, recommended by Basford and Tukey (1999, p. 266). If the error-bar-based intervals around two means overlap, the means are not significantly different at an α level of approximately .05. Thus, a mean plot with error bars can be used to get an overall view of the reliable differences between means.

Fig. 4.8 Line graph with means and error bars. The groups in the plot are accessions or seed samples of cultivars, i.e. cultivated plants; the lines in the graph represent characteristics of these cultivars.

Figure 4.8 shows a line graph from an agricultural paper with eight accessions or cultivar samples, and seven characteristics of these cultivars indicated with abbreviations. For every characteristic each accession has a mean with corresponding error bars of 1.5× the standard error of the mean at both sides. As indicated above, using

4.15 Missing rows and columns

117

this type of intervals around the mean allows for pairwise comparisons of significant differences. The red line in the figure connects the means of variable sw. The error bars for the mean of this variable just about overlap for groups 1, 3, and J, and even more for groups 5, 6, and H, but the two sets of means do not overlap at all. The means of the variables df, ns, ph, and y follow a common pattern over the groups, and per group most of the error bars overlap. Note that dm also show the same pattern but its error bars do not overlap at all with the other four variables. Thus, the plot shows both detailed comparisons of the mean and general trends. There is no point in going into the content here. What is important is that the figure gives an overall view of 7 × 8 (= 56) means and their differences at a glance—a feat difficult to replicate in any other way.

4.15

Missing rows and columns

This section is a bit of an oddity as it does not really concern statistics. It explains how we may handle complicated datasets consisting of two (or more) matrices in which whole rows or columns are missing, for instance, measurements are available for two different years. Such situations can easily occur in historical data where data collectors at different points in time have collected partly different forms of data in the same situation (see, for instance, Chapter 7). Suppose that for each year there is a matrix of Individuals×Variables, say Publishers×Book types. In each year there are some book types missing (Figure 4.9): Publishers B and D did not publish types 1–10, in 1880, and publisher C did not publish types 11–15 in 1920. Depending on the research questions, we have to make one or more decisions about the appropriate way to analyse such data formats with completely missing rows and/or columns. Figure 4.9 shows two examples of ways to arrange such data. In particular, decisions have to be made about which book types and which publishers to use when the data matrices are combined before analysis. This of course depends on the research questions at hand. There are at least two options. Separate analyses. Each of the time points is analysed separately, and the results are compared with each other. It is obvious that the comparison will not be an easy task due to the wholly missing rows and columns for the two years. Imputation (see Chapter 2, p. 38) is not really an option because there is so little overlap. How to proceed will all depend on the specific research question. Maybe just restricting ourselves to these separate analysis could be a reasonable but restricted option. Joint analyses. The time points are analysed together in a single analysis by putting the two matrices next to each other, creating a wide matrix of publishers by (book types × years) (Figure 4.9—left). In our example this would be a 2 × (10 + 15) design without missing data, or a 5 × (10 + 15) design including three publishers with blocks of missing data.

118

4 Statistical framework extended

Fig. 4.9 Data arrangements. Five publishers (A,..,E) by fifteen book types with two classes 1,...,10 and 11,...,15, respectively. An X indicates that one or more rows are missing. In 1880 there were no books published in classes 11–15.

Alternatively we could create a long matrix by stacking the two matrices on top of each other; thus (publishers × years) × book types, if there are not too many book types missing (Figure 4.9—right). In our example this would be a (3 + 4) × 10 design without missing data, or a (3 + 4) × 15 design with three publishers having blocks of missing values in 1880.

4.16

Multiple regression

Suppose we want to perform a multiple regression with one response variable y and two predictors x1 and x2 . The regression equation has the form Response = Model + Residual; the residual is also referred to as the error. The model consists of a linear combination of the constant 1, and two predictors x1 and x2 : y = b0 × 1 + b1 × x1 + b2 × x2 + e = yˆ + e = yimplied + e. In the population the regression weights b0 , b1 , and b2 are the parameters mostly written with the Greek letter β, and therefore are often called β weights. The yˆ or yimplied are the y values estimated on the basis of the regression model, and the e are the residuals. The basic assumption is that the residuals cannot predict the scores of yimplied and thus are independent of them.

4.16 Multiple regression

119

Backward elimination An often-used procedure for selecting the relevant predictor variables in a multiple regression analysis is backward elimination. It is common to first specify a model containing all interesting predictors, and then successively remove (hopefully irrelevant) ones, provided they do not decrease the fit too much. This process, called stepwise backward elimination, stops when no more variables can be removed without seriously decreasing the fit of the model to the data. Although there is considerable opposition to this procedure from various quarters, especially when the elimination is done automatically, it can be used for content multicollinearity, but this should be done with great care and expert statistical advice is recommended. More technical details about variable selection and arguments against automated stepwise procedures can be found in Harrell Jr (2015). His book also contains other advice on performing regression analyses, and refers to Derksen and Keselman (1992) for studies on automatic stepwise variable selection. Example of a multiple regression equation with four predictors Cost of a painting = constant + b1 × Beauty + b2 × Provenance + b3 × Fame of the artist + b4 × Size of the painting + e. Mathematically written as: y = yˆ + e or yobser ved = yimplied + e Fit of the model = Squared correlation between (yobser ved , yimplied )2 Residual plots It is important in regression analysis and many other techniques to inspect the residuals. A residual plot of the (standardised) predicted values against the (standardised) residuals is a powerful way to do so. The residuals and the predicted values are independent in linear multiple regression and this will show up in a residual plot. When there is a nonlinear association between the variables in the regression analysis, or there are nonnormal variables, residual plots will show evidence of this, see Figures 4.10(a) and 4.10(b) for two simple examples without much further explanation. Further details on multiple regression and residual plots can be found in most specialised books on applied regression analysis such as Harrell Jr (2015), and standard textbooks on applied multivariate analysis, such as Field (2017), Hair et al. (2018), and Tabachnick and Fidell (2013). Regression is frequently mentioned in this book but it is only sparingly used on its own. In Chapter 18 it is used in the form of logistic regression, and it is also featured to show the use of supplementary variables.

120

4 Statistical framework extended

(a)

(b) Fig. 4.10 Residual plots. Horizontal axes:standardised predicted values; Vertical axes: standardised residuals Top: Residual plot of an acceptable regression analysis. Residuals have (more or less) equal spread for increasing predicted values ( yˆ ) Bottom: Residual plot of a questionable regression analysis. Residuals have a funnel shape: they have increasing spread for increasing predicted values ( yˆ ).

4.17

Multivariate, multiple, multigroup, multiset, and multiway

There are several terms used in connection with the word multi that are often confused; in all cases the word multi means many. ∗ multivariate and multiple, as in multivariate analysis of variance and multivariate multiple regression. ‘Multivariate’ refers to more response variables and ‘multiple’ to more predictors. ∗ multigroup refers to objects, individuals, etc. that can be partitioned into several clusters or groups.

4.18 Quantification, optimal scaling, and measurement levels

121

∗ multilevel refers to a situation where individuals are grouped according to members of a higher level group, for example, students (lower level) form groups, with each group taught by one instructor (higher level); the instructors may themselves be grouped by department (highest level). ∗ multiset refers to a situation where the data consist of several groups or sets of different people. For instance, there are several samples each measured under different conditions on the same variables. This is different from a multiway dataset, because multiway data are fully crossed: each subject is measured on the same variables in each condition, a repeated-measures design. A multiset design is a not repeated-measures one because in each condition there are different people ∗ multiway refers to a data format that consists of a fully crossed design of three variables, e.g. hair colour × eye colour × gender, but also paintings × characteristics × century. Thus, three-way data form a box or block of data. ∗ multinomial refers to the probability distribution of variables with several categories (multicategorical) of variables. ∗ multidimensional refers to matrices with more than one column, which can geometrically be displayed in more dimensions: a plane for two dimensions, a cube for three dimensions, a hyper-cube for more dimensions. The term is also used to indicate that a variable is assumed to be determined by more than one theoretical concept, for instance, the score on an item in an intelligence test is assumed to be determined by both mathematical and verbal ability, each of which is considered to be an aspect of general intelligence.

4.18

Quantification, optimal scaling, and measurement levels

There are times when research questions seem to require a regression analysis or principal component analysis, but the dataset consists of variables with different measurement levels or there are nonlinear relationships among the variables. Standard techniques are inadequate in such situations, because these rely on numeric measurement levels and often also on normality assumptions. Moreover, they perform unsatisfactorily when there are not too many observations, as is often the case in the humanities. In such cases it is difficult to satisfy the statistical requirements for standard multivariate techniques. Powerful solutions to such problems are available in the form of categorical analysis methods for both regression and component analysis (and several others), see for example Linting et al. (2007). The core idea is to first treat all variables to be categorical, and then assign new numeric values to the original values of the variables so that they can be analysed by standard numeric techniques. The process is called quantification, and the method to realise the quantification is called optimal scaling; however, the two terms are often used as synonyms. In standard principal component analysis, linear combinations or components of the original variables are derived so that they successively explain as much variance

122

4 Statistical framework extended

as possible. This is realised by weighting the variables such that the average of their correlations with the component is a large as possible. For example, such a linear combination of three variables (or a component—c) has the form c = w1 × x1 + w2 × x2 + w3 × x3 , where the weights w1 , w2 , and w3 have to be found such that the weighted sum c captures as much of the variance of the three variables as possible. In order to find the best component given categorical variables, we not only seek the weights w1 , w2 , w3 for the linear combination, but also numeric values for the categories. Not just any kind of quantification is appropriate because the quantifications have to be allowed by the measurement level of a variable. For example, the quantification of ordinal variables has to respect the ordinality, so that the newly quantified values are also properly ordered. Thus the value 1, 2, 3, 4, may, for instance, be quantified as .05, .11, .12, and .25, but not as .05, .25, .12, .11, because the latter quantification is not in accordance with the original order of the categories. The way the different measurement levels affect quantification is as follows. ◦ Nominal variables. To start with, we know that there are no relative relationships defined between the categories of a truly nominal variable (for instance, Fruit with labels apple, pear, grape, lemon, and orange). Numbers that are a priori assigned to them can only be mere labels. However, we can attach a numeric value to each fruit, with a particular meaning. For instance, the categories of Fruit could be assigned numbers which indicate the sweetness experienced by people, say grape is 10 and lemon is 2, with the other fruits in between. These values can then be used in a study to find out which chemicals are responsible for their sweetness. This would not be possible with the labels of the categories. Any transformation of the categories is allowed, irrespective of the analysis technique. In principal component analysis the quantification may be different per component. In this case we refer to multiple nominal quantification. Multiple quantification is not possible for other measurement levels. ◦ Ordinal variables A similar process is possible for the categories of an ordinal variable, but in this case the assigned values are restricted by their measurement level, i.e. rank order. As an example takes the sequence in which athletes cross the finish line in the marathon. The 1, 2, 3, 4 rank order may be arbitrarily transformed to 1, 3, 5, 9, because this transformation does not violate the order in which they crossed the finish line; it is a monotone transformation. This does not make any difference to their positions on the podium, as it is only the finishing order which is relevant, and not their exact times: the sequence of arrival of numbers 1 and 2 may have been decided by a photo finish, and number 3 may have crossed the finish line 15 minutes later. ◦ Numeric variables Values of quantified variables must maintain their numeric characteristic, as it happens with the logarithmic transformation. Sometimes it is necessary for an acceptable quantification to first categorise numeric variables. There are three kinds of numeric variables: interval, ratio, and counted variables. ◦ Interval variables. For an interval variable the relative distances between values are numeric. A quantification must honour this, and in an analysis the relative sizes of numeric differences between the values have to stay the same.

4.19 Robustness

123

For instance, we want the variable ‘bad versus good’ to be an interval variable also during analysis. Therefore, original intervals between 1 2 3 4 5 can only be quantified such that also after quantification the relative distances between the values stay the same, for instance, 1 3 5 7 9. ◦ Ratio variables. These are interval variables with a genuine zero point, for instance, one can speak of one person reading twice as fast as another person (ratio: 1:2). In this case we do not need quantification, because the variables are already quantified. However, in practice we might want to reduce the number of values to be analysed by binning (p. 105), or modify the existing values in order to remove nonlinear relationships with other variables, as we did with a logarithmic transformation in the example of family income and school readiness (p. 73). ◦ Counts. Variables that are counts or frequencies (as 1, 2, 10, 100) are a special kind of numeric variables. They are discrete, not continuous, but they have a ratio character. Counting has a ratio scale, and there is nothing wrong saying one has one-and-a-half apple, or two times as many as someone else, but the actual counting process concerns complete objects. Calculating means and standard deviations is no problem, but results do not refer to whole objects. Techniques to find the proper quantifications consist of two alternating steps: the optimal-scaling step and the model-fitting step. The idea is that those quantifications are sought that provide the best solution for the problem at hand. However, the technicalities of the quantification techniques and their computation will not be explained here. They are clearly a mathematicians’ puzzle, which we will happily leave in their capable hands; Gifi (1990) is the standard reference. However, one may consult the paper by Linting et al. (2007) as a limited, more accessible option.

4.19

Robustness

Over the years statisticians have developed methods to improve analyses of variables with nonnormal distributions, such as skewed distributions with long tails and outliers, and distributions with sharp peaks, by either developing procedures to make distributions more normal or making them less sensitive to anomalies in the data. In (possibly unintended) marketing terminology they are sometimes called modern statistical procedures. An important line of research in this area is the development of robust procedures. The term robust implies that the techniques give solid results that are not very sensitive to minor instabilities of the solution and irregularities in the data such as outliers, skewness, etc. One of the basic approaches is to first calculate robust means, standard deviations, and correlations, and then use either standard techniques or specialised ones which include the robust statistics themselves.

124

4 Statistical framework extended

For this book, I emphasise the importance of careful data inspection and conceptual decisions about what to do with irregularities in the data. An obvious continuation of such an approach is to carry out multivariate analyses with and without measures taken to combat irregularities and instabilities, and evaluate the differences on the basis of substantive knowledge. In addition, we may make use of the services of an expert in robust statistics to include such techniques during the analyses. There is a considerable collection of robust multivariate techniques, but they are not included here that might have to wait for a next edition or another author. Therefore, a full discussion of robustness unfortunately falls outside the realm of this book, as does a discussion of nonparametric methods. An entirely different approach is that of nonlinear multivariate analysis using optimal scaling or quantification, which figures several times in this book; for a detailed and statistically more demanding book-length treatment see Gifi (1990). An extensive specialised book on robustness with many R snippets is Wilcox (2016), but is not geared towards the audience of this book. A reasonable start that also includes some historical information about the role of the normal distribution can be found in Liang and Kvalheim (1996).

4.20

Scaling coordinates

Among the core output of principal component analysis, correspondence analysis, factor analysis, and related techniques are components, factors, axes, etc. They are used for plotting analysis results, such as components scores and component loadings. Within this context the coefficients are referred to as the coordinates for the units that are plotted. Such coordinates can be scaled in various ways: standard coordinates, principal coordinates, and symmetrical coordinates. ◦ Standard coordinates Standard coordinates of component p, ci p , for the I objects  i = 1, · · · , I , with I variance of 1. The length (L) of such a component is L = i=1 ci2p = 1. Often they have a mean of zero as well, so that the components are standardised. ◦ Principal coordinates Principal coordinates of a component p have no a priori mean but their variance is equal to the pth eigenvalue λ p , which is the square of the singular value, φ p . In correspondence analysis, λ p is referred to as principal inertia of the pth principal axis. ◦ Symmetrical coordinates In correspondence analysis biplots, the axes for the rows and columns are some times both scaled by φ p , and hence this is called symmetric scaling. It should, however, be noted that there are other definitions of symmetric scaling.

4.22 Structural equation modelling—SEM

125

4.21 Singular value decomposition A matrix X is an arrangement of categorical or numeric data of R rows and C columns. A data value is indicated by xi, j . The value xi, j is a score of an object (i) on variable ( j), but it can also be a frequency indicating how often the combination of row i and column j occurs, say how many children have blue eyes and blond hair. A matrix X of objects by variables always has a structure, which can be found by the so-called singular value decomposition. The structure consists of scores for the rows, U, on what is called components; scores for the columns on the same components, V; a square diagonal matrix, , which has the standard deviations of the components on the diagonal; the diagonal elements are also called singular values. The columns of U are the left singular vectors, and the columns of V the right singular vectors. A component is a weighted linear combination of the original variables. For example, c1 (Cost of a painting) = w1 × v1 (Beauty) + w2 × v2 (Provenance) + w3 × v3 (Fame of the artist) + w4 × v4 (Size of the painting). The weights w j are derived such that the average of the four correlations between c1 and each of the v j ( j = 1, · · · , 4) is as high as possible. The common way to write the singular vector decomposition is X = UΦV . This series of matrices has a special way of multiplication, which does not concern us at this moment. The equation may also be written as X = [UΦ][V ] or X = U[VΦ] . The U are standard scores or standard coordinates because the columns (the components) U have length 1; should the variables have mean zero, the components will also have mean zero. The VΦ are principal coordinates, and they can be shown to be the correlations between the variables and the components. It is also possible to have the rows in principal coordinates [UΦ], and the columns in standard coordinates [V]. It would go too far to go into this issue in more detail here; the different scalings of the components are also discussed above under Scaling coordinates. The importance of the structure of a matrix is that with row coordinates and column coordinates we can make plots of the similarities of the row entities, the column entities, and their mutual relationships via a so-called biplot, see Biplots. The story above has been couched in terms of numeric variables, but it can also be applied to contingency tables, in particular, in connection with correspondence analysis. This will be done in the case study chapters when the need arises.

4.22

Structural equation modelling—SEM

A structural equation model is specifically appropriate for modelling covariance matrices. Such a model consists of latent variables and observed variables, and describes the specific relations between them. The parameters of this theoretical model are estimated with the available observations, and it is tested whether the covariances computed on the basis of the model approximate the observed covari-

126

4 Statistical framework extended

ances. A confirmatory factor analysis model is an example of a structural equation model. See for more details on confirmatory factor analysis in Chapter 3, p. 73. Because a structural equation model (sem) always consists of one or more latent variables and there is no way to measure latent variables, the structure of a sem must be hypothesised and tested. We can only determine whether a theoretical model is acceptable by examining the consequences of the hypothesised structure via estimating its parameters, for example, by estimating the factor scores in a factor model. If these estimated parameters and the equations for the model are appropriate and lead to reasonable outcomes, we can have some faith that the hypothesised model is plausible. We cannot prove that it is true. The basic numeric information to assess the model are the observed covariance matrix S and the model-based covariance matrix . The latter can be computed from the estimated parameters of the model. The theoretical covariance matrix should be compared with the observed matrix S. If the two covariance matrices have a close fit, we say that the model is plausible. If not, we have to devise another model that does a better job. There are many different fit measures, and many ways in which the parameters of the model can be determined. A standard way to determine the parameters of the model, i.e. what exactly the model looks like, is via maximum likelihood method estimation, used in nearly all computer programs for this purpose. I will skip the mathematical description, but details can be found in Everitt and Dunn (2001), and in the general multivariate analysis books by Field (2017), Hair et al. (2018), and Tabachnick and Fidell (2013). The fit of a structural equation model must always be compared with other plausible models. Another recommended approach is to perform cross-validation, i.e. split the sample in two, develop the model on one part, validate on the other part, and vice versa. You will, however, need a large number of observations for this, probably 200-300 for each part. Below are a few recommended fit statistics from a frequently well-cited paper, Schermelleh-Engel et al. (2003): ∗ Chi-squared statistic (χ 2 ). The mean of a chi-squared distribution is equal to its degrees of freedom (df ), and its variance is equal to 2 × df. Often the statistic χ 2 /df is used as a rough measure of fit, because a value of ≤ 2 indicates that the statistic is smaller than 2× the mean of the distribution. A value of ≤ 3× the mean of the distribution for the statistic indicates that it is smaller than or equal to most of the values under the null hypothesis of no difference between the observed covariance matrix and the implied covariance matrix. ∗ Root mean square error of approximation (rmsea). Preferred values are less than or equal to 0.05. They should be at least below 0.08 for a good fit. ∗ Standardised root mean square residual (srmsr). Values should be at most 0.08 as an indication of good fit. The statistic more or less measures how far the implied mean correlation is different from the observed mean correlation. ∗ Comparative fit index (cfi). A value above 0.95 is preferred. ∗ Akaike information criterion (aic). In principle, the lower the better, but there is no need to be very picky about this.

4.24 Three-mode principal component analysis (TMPCA)

127

It should be realised that all these statistics are influenced by the number of observations and the number of parameters. A well-cited survey of these and many more fit measures can be found in Schermelleh-Engel et al. (2003). Everitt and Dunn (2001) cite a warning from Cliff (1983): ... beautiful computer programs do not really change anything fundamental. Correlational data are still correlational, and no computer program can take account of variables that are not in the analysis. Causal relations can only be established through patient, painstaking attention to all relevant variables, and should involve active manipulation as a final confirmation.

4.23

Supplementary points and variables

∗ Supplementary points are part of a dataset, but are not used in the analysis proper. After the main analysis has been carried out, such points are added to a biplot based on their observed values. This may also be a profile of a row category from an analysed contingency table. ∗ Supplementary variables can also be added in a biplot to visualise how these variables are related to the variables on which the biplot is based. There are two types: ∗ Numeric variables. Even though the supplementary variable does not play a role in the analysis itself, its correlation with each of the coordinated axes can be calculated afterwards. The common way to find out where the biplot axis should be drawn in the plot is to regress the standardised supplementary variable on the standardised coordinate axes. The standardised regression coefficients can then be used to determine its position in the plot. ∗ Categorical variables. The categories of a supplementary categorical variables can be shown in a biplot by calculating category means for each coordinate axis, and using these means as coordinates of the categories. If desired a t test or an anova can be also be performed, with the categories of the supplementary variable as predictors and the coordinate axes as response variables. For example see Figure 16.5, p. 324.

4.24

Three-mode principal component analysis (TMPCA)

The essence of three-mode principal component analysis is that it is a component analysis for three-way data (X), often referred to as a three-way data array. The model defines a separate component matrix for each mode, i.e. one each for, say (A) paintings, (B) type of cracks, and (C) judges (see Figure 4.11). To tie all components together there is a three-way core array (G), whose values indicate how strongly the components of the three modes are linked.

128

4 Statistical framework extended

Fig. 4.11 Three-mode principal component analysis Separate sets of components for Paintings (A), Characteristics of cracks (B), and Judges. (C), plus the core array G linking the components of the three ways. In the technical literature three-way arrays like the three-way data array and the core array are often bold and underlined: X and G, respectively.

The technical model formula for three-mode pca is  C ), X = AG(B

(4.2)

 where is the Kronecker product. A standard way to compute the parameters of the model is by alternating least squares, see, for example, Kroonenberg (2008). Because we use a principal component technique for each of the modes, the first components account for most of the variance, which mostly results in their being most strongly linked. Which other components of the three modes have strong links depends entirely on the data. Many more details and applications can be found in Kroonenberg (2008), see also Smilde, Bro, & Geladi (2004) for chemical applications; many references can be found by searching for multilinear and tensor on the Internet.

4.25

X 2 test (χ 2 test)

A continuation of Contingency Tables (p. 108). A common measure to test the independence between row and column variables is Pearson’s X 2 statistic. The standard statistical method to calculate the lack of independence is for each cell to calculate the standardised difference between the observed and expected frequencies, the squares of which sum to the X 2 statistic. The standardised differences may in terms of size be compared to standard normal z scores. In a formula the X 2 statistic is X 2 = all cells standardised residuali,2 j = all cells X i,2 j √ = all cells {( f i, j − ei, j )/ ei, j }2 The X 2 statistic is often called the χ 2 statistic, because the discrete distribution of X 2 is generally compared to the continuous χ 2 distribution. The difference between the discrete distribution and continuous distribution leads to all kinds of difficulties in

4.25 X 2 test (χ 2 test)

129

testing hypotheses, but this can be left to for a proper statistics book to explain. What is important here is that the X 2 test is used when we want to establish whether the row and column variables are independent or related. Small p values indicate that there is most likely a relationship. The X 2 statistic tells us how close the observed frequencies are to the expected frequencies under independence of the row and column variables; a small value indicates that independence is a reasonable proposition, a large value that there is probably a relationship between the two variables. A χ 2 distribution is characterised by its degrees of freedom (df ). As the mean of a χ 2 distribution is equal to the df, χ 2 /d f is often used as a fit measure in model fitting. The variance of the χ 2 distribution is 2df. In a contingency table with r rows and c columns, df = (r − 1) × (c − 1), and a χ 2 ≤ 2 generally indicates that the model of independence between the row and column variable need not be rejected.

Part II

The Scenes

Chapter 5

Similarity data: Bible translations

Abstract Main topics. Many Bible translations from different times are based on the same Masoretic text. How alike are these translations, and how many differences have emerged over time? Data. The data for this study were collected by Zachary Bleemer from British, American, and German translations. First, Bible verses were matched, and subsequently parallel words were counted. Research questions. How can the similarities between texts based on the word counts be analysed and graphed, so that conclusions can be drawn with respect to influence between the various translations? Statistical techniques. Multidimensional scaling, cluster analysis. Keywords Bible translations · Hebrew old testament · Masoretic text · Tanakh · King James Bible · Similarity coefficients · Multidimensional scaling · Cluster analysis · Centroid-linkage clustering

5.1 Background An intriguing question1 regarding language development is whether words have the same meaning and connotations at different points in time. Did the imagery used by Shakespeare in his plays and sonnets evoke the same sentiments and associations in his contemporaries as they do in readers nowadays? If we assume that cultural expressions such as translations are locked in the context of their own time; is there a way in which such an issue can be researched? Such questions may be even more relevant when people interpret Bible texts and discuss the relevance of the texts in their own times. This point is reinforced when we consider that the Bible is known by virtually everyone in their own vernacular, so that its interpretation is often time and language dependent. The same thing can 1 This

chapter was co-authored by Zachary Bleemer, Department of Economics, University of California, Berkeley, http://zacharybleemer.com/.

© Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_5

133

134

5 Similarity data: Bible translations

be seen in many Western paintings of religious events which took place in Jesus’ time; these are portrayed as if situated in the painter’s own time and surroundings, which is especially evident in the clothing of the depicted persons and the depicted landscapes. Translations of the Hebrew Bible, the Tanakh,2 are an interesting opportunity to get a view on what according to the translators was being described and expressed by the text. As over time the same basic Hebrew (Masoretic) text has been translated several times into several languages, shifts in interpretations can be compared and evaluated through these translations. An advantage of the Old Testament is its length and narrative variation: most of the interesting concepts appear in a wide variety of contexts throughout the text. The study presented here is part of a project by Zachary Bleemer in which aesthetic concepts such as beauty and handsomeness, employed by translators from different countries or centuries, are assessed. Bleemer (2016, p. 3) discusses the idea that over time attributes of objects may be perceived differently, and thus translated into different adjectives. He then postulates that such differences are a reflection of the aesthetic vocabulary available, and perhaps of the ‘latent aesthetic categories’ that beautiful objects can belong to, according to those translators. This is generally not an idiosyncrasy of a single translator because translations were generally done by teams of scholars. Moreover, Bibles in the vernacular were intended for the general population and thus had to be translated in the then current common variant of the language. Therefore, it seems safe to assume that the translations represent if not the common language of the period, then at least the language which would be commonly understood at the time.

5.2 Research questions: Similarity of translations In this chapter we will concentrate on one aspect of Bleemer’s study: the extent of the variability of the translations, and the question whether period effects are reflected in the available translations. These queries refer to the more general questions posed in the original Bleemer (2016) study, which focusses on one particular Hebrew word (yaphah, to be beautiful), the way it is handled in different translations, and what this has to tell us about social and cultural differences between people over time.

2 The

Tanakh [..] is the canonical collection of Jewish texts, which is also the textual source for the Christian Old Testament. These texts are composed mainly in Biblical Hebrew, with some passages in Biblical Aramaic[..]. The traditional Hebrew text is known as the Masoretic text (www.DBpedia. org/page/Tanakh).

5.3 Data: English and German Bible translations

135

5.3 Data: English and German Bible translations Acquiring the data for proper comparisons between the translations proved an arduous task, and impossible without proper digital support. Bleemer first collected 14 available Bible translations and divided them into verses, on the basis of the versification in the King James Bible—KJV. This is an intricate process, the full details of which are given in Bleemer (2016, Section 6.3.1). The translations can be grouped into four types; see Bleemer (2016) for full details. • KJV, GNV. Two Early Modern Protestant translations into English from the 16th and 17th centuries. These Bibles were translated by large teams of British Protestant scholars from the original Hebrew text rather than from prior Latin and Greek translations, as were some earlier renderings. The King James Version (KJV ) was partially a revision of the Bishop’s Bible, which itself was a partial revision of the Geneva Bible (GNV ), so that vocabulary decisions are not wholly independent. • ASV, NKJV, JPS. Three 20th -century revision translations derived from the KJV (with contemporary emendations arising from comparison with the Hebrew text). The American Standard Version (ASV ) and the New King James Version (NKJV ) were produced by Protestant organisations, whereas the Jewish Publication Society’s version (JPS) was published by a Jewish organisation. • NWT, NAB, NIV, NET, GNT. Five 20th - and 21th -century American English sui generis translations, independently produced by unconnected organisations. Specifically, these were the New World Translation of the Holy Scriptures (NWT ) by the Jehovah’s Witnesses, the Catholic New American Bible Revised Edition (NAB, also abbreviated as NABRE), the Protestant New International Version (NIV ), the Protestant New English Translation (NET ), and the interdenominational Good News Bible (GNT ). • Luth, EU, HFA, GNB. Four German translations: the 16th -century Protestant Luther translation (Luth, also Luther) and three 20th -century translations: the Catholic Einheitsübersetzung (EU), the Protestant Hoffnung für Alle (HFA), and the Gute Nachricht Bibel (GNB), jointly produced by Evangelical Protestants and Catholics. Each Bible verse in a translation was compared with the same verse in the other translations via Dice’s coincidence coefficient, which measures the similarity between two sets as the fraction of elements that are present in both sets. Details can be found in the original publication (Dice, 1945). What is important here is that the measure of similarity was defined in such a way that the higher the number the more similar the verses, with as its highest value 1 (the translations are identical) and lowest value 0 (the two translations do not share any words). It should be remarked that certain types of words are excluded from the comparisons, such as function words or articles and prepositions. This in contrast with authorship studies, where function words are the crucial information in the study; see Chapter 6 on the authorship of the Pauline Epistles, and Chapter 13 on the authorship of The Royal Book of Oz. The coincidence coefficients for the verses of two translations were combined to create the average fraction of identical words. They are reproduced in Table 5.1,

136

5 Similarity data: Bible translations

where the translations are listed in alphabetical order, and in Table 5.2, in which that table has been reformatted for better visual interpretation. These tables are similarity matrices with only the lower triangle and diagonal shown, because the upper triangle is a mirror of the lower one and contains no new information. A simple perusal of Table 5.1 is not very enlightening, and the alphabetical order of presentation does not give much insight into the relationships between the translations. The main reason is, of course, that the alphabet is unrelated to the similarity between translations. It is a general phenomenon that tables are only insightful if the orders of rows and columns correspond to a relevant aspect of the objects presented; see Section 3.15.1, p. 93 for some rules to improve the presentation of a table, where Table 5.2 is also presented in a slimmed-down version (Table 3.6, p. 97). Table 5.1 Similarities of Bible translations based on the Average fraction of identical words: Unsorted

ASV EU GNT GNB GNV HFA JPS KJV Luth NAB NET NKJV NIV NWT

ASV EU GNT 1.00 0.05 1.00 0.35 0.04 1.00 0.04 0.39 0.04 0.77 0.05 0.36 0.04 0.36 0.04 0.89 0.05 0.36 0.87 0.05 0.36 0.03 0.22 0.03 0.53 0.04 0.43 0.46 0.04 0.45 0.73 0.04 0.42 0.54 0.04 0.49 0.53 0.04 0.37

Translations GNB GNV HFA JPS KJV Luth NAB NET NKJV NIV NWT

1.00 0.04 0.43 0.04 0.04 0.15 0.04 0.04 0.04 0.04 0.04

1.00 0.04 0.77 0.84 0.04 0.53 0.46 0.70 0.52 0.47

1.00 0.04 0.04 0.15 0.04 0.04 0.04 0.04 0.04

1.00 0.86 0.03 0.55 0.48 0.74 0.54 0.51

1.00 0.04 0.53 0.46 0.76 0.53 0.49

1.00 0.03 0.03 0.03 0.04 0.03

1.00 0.56 0.62 0.59 0.51

1.00 0.56 1.00 0.58 0.62 0.48 0.53

1.00 0.52 1.00

To get a better insight into the patterns, Table 5.1 has been reformatted on the basis of the size of the similarity coefficients, see Table 5.2. Coefficients greater than .50 are in bold. Several aspects now stand out much more clearly, such as the fact that there is no real similarity between the English and German translations, which is understandable as they are in different languages. In fact, we may wonder why the coefficients are not zero. The occurrence of names and places is the most likely source of such similarities. Some other conclusions that may be drawn from the table are that the UK and US King James versions are rather more similar to each other than the US translations are to each other. As they intended, the translators of the Good News Bible (GNT ) have succeeded in making their translation different from any other English translation.

5.4 Analysis methods

137

Table 5.2 Similarities of Bible translations based on the Average fraction of identical words: Sorted and formatted Trans

Year

UK

USA - KJV

USA -sui generis

Germany

KJV GNV ASV JPS NKJV NAB NIV NWT NET GNT EU HFA GNB Luth KJV

1560

1.00

GNV

1611

0.84 1.00

ASV

1900

0.87 0.77

1.00

JPS

1917

0.86 0.77

0.89 1.00

NKJV 1982

0.76 0.70

0.73 0.74 1.00

NAB

2011

0.53 0.53

0.53 0.55 0.62

1.00

NIV

1996

0.53 0.52

0.54 0.54 0.62

0.59 1.00

NWT

1976

0.49 0.47

0.53 0.510.53

0.51 0.52 1.00

NET

2005

0.46 0.46

0.46 0.480.56

0.56 0.58 0.48 1.00

GNT

1978-

0.36 0.36

0.35 0.36 0.42

0.43 0.49 0.37 0.45 1.00

EU

1980

0.05 0.05

0.05 0.05 0.04

0.04 0.04 0.04 0.04 0.04 1.00

HFA

1996

0.04 0.04

0.04 0.04 0.04

0.04 0.04 0.04 0.04 0.04 0.361.00

GNB

1982

0.04 0.04

0.04 0.04 0.04

0.04 0.04 0.04 0.04 0.04 0.390.43 1.00

Luth

1534

0.04 0.04

0.03 0.03 0.03

0.03 0.04 0.03 0.03 0.03 0.220.15 0.15 1.00

Denominations:Anglican: KJV; Protestant: GNV, ASV, NIV, GNT, HFA, Luth; Jewish: JPS; Jehova: NWT; Catholic: NAB, EU; Evangelical Protestant: NKJV, NET, GNB.

Furthermore, the German translations are not very similar among themselves. It may be that the greater word-level variation in German is the cause of this, as the language contains compound nouns and declensions that make overlap less likely, even after attempts to trace to their stems to make them more comparable. These comparative observations could not have been derived from the alphabetically sorted Table 5.1 without hard graft.

5.4 Analysis methods In this case study we wanted to compare word occurrences in various Bible translations without having a particular response variable in mind. Thus we are dealing here with an internal structure design (see Section 3.7). The analysis of similarity matrices has a long tradition, and the main working horses are various types of multidimensional scaling (mds) and cluster analysis. Here we will use only one method for each type of analysis, but there are many variants of both techniques, often specially geared towards specific analytic goals (see Section 3.20, p. 81).

138

5 Similarity data: Bible translations

5.4.1 Characteristics of multidimensional scaling and cluster analysis Similarity and dissimilarity matrices are eminently suitable for both multidimensional scaling and cluster analysis. These techniques supplement each other, as the former produces a spatial representation and the latter a group representation. Thus, we can first make an appropriate multidimensional scaling graph and then draw contours around the points that the cluster analysis has indicated as belonging to the same group. The two techniques do not necessarily use exactly same information, because the dimensions which account for the maximum variance in the multidimensional scaling solution need not be those that provide the best cluster separation. Fortunately they often do. This makes combined graphing an insightful activity (see also Section 3.9.3).

5.4.2 Multidimensional scaling Multidimensional scaling aims at a graphical representation of similarities, which have been first transformed into dissimilarities. They are subsequently portrayed as Euclidean (straight-line) distances in a preferably two-dimensional graph. Higher dimensional graphs are only used if a proper representation in two dimensions is not adequate. The specific method mds employed here is implemented in spss (Heiser, 1988; Busing, Commandeur, & Heiser, 1997). Note that very high similarity means a very low dissimilarity and a very short distance. Given the high similarities between the King James Version and the first four translations in Table 5.2, we expect that in the mds graph they will be very close to each other. Furthermore, the German translations will be clearly separated from the English ones as similarities among the two languages are uniformly low. Moreover, it is unlikely that the German translations form a tight group, because the similarities among themselves are not very high, certainly not as high as among the English-language ones.

5.4.3 Cluster analysis Cluster analysis attempts to group the items (here, translations) based on their similarities. Because cluster analysis uses the similarity information differently from multidimensional scaling, the produced grouping may not be clearly visible in the low-dimensional space derived by the multidimensional scaling; cluster analysis uses information contained in all dimensions. Nevertheless, the grouping found can often enhance the scaling results, as is the case in the present study. Here we used Cen-

5.5 Bible translations: Statistical analysis

139

troid linkage cluster analysis, which is fairly generally applicable. (see Section 3.9.3, ˇ p. 81), and Everitt et al. (2011), and Šulc and Rezanková (2019)).

5.5 Bible translations: Statistical analysis 5.5.1 Multidimensional scaling Figure 5.1 shows the outcome of the multidimensional scaling and clustering procedures. The two-dimensional graph shows the patterns which we discussed after rearranging the table of similarities (Table 5.2). This is not surprising, because we actually rearranged the table using the multidimensional scaling solution. The more surprising effects are the stretched-out positions of the American and German translations and those of the German ones. The ellipses indicate the subgroups produced by the cluster procedure (see below). Note by the way, that the coordinate axes (dimensions) themselves need not be interpreted, but only the positions of the translations with respect to each other.

Fig. 5.1 Multidimensional scaling and cluster analysis of the Bible translations. The space based on the similarities in Table 5.2; clusters based on the visual clustering in the table.

140

5 Similarity data: Bible translations

5.5.2 Cluster analysis Centroid-linkage cluster analyses were carried out and the six-cluster solution was chosen to be presented here. The result for the six-cluster solution is drawn in the space constructed by the multidimensional scaling of the similarity matrix. Figure 5.2 shows that the German translations are rather different, both from the English ones and among themselves, as could be expected from their relatively low similarities. The Good News Bible (GNT) is clearly distinct from the other English translation, in accordance with the intentions of the translators. The translations derived from the King James Version (KJV ) group together, as do the four American translations (NIV, NABRE, NET, and the Jehovah’s Witness NWT ). Continuing the analysis with less clusters via merging the already found clusters, the EU translation join the German group, and almost at the same time the British and America translations merge into one group. Finally, GWT joins the Englishlanguage translations and Luther the German ones. Then it takes a very long time before all translations amalgamate into one single cluster. The moment of translations or translations groups merging is regulated by similarities, so that the progression of ‘time’ is actually an expression of decreasing similarity.

5.6 Other approaches to analysing similarities Similarity and dissimilarity matrices appear in many disciplines, including the humanities and the social and behavioural sciences, and many of these have been analysed with similar methods as have been used in this chapter. There is a wealth of applied, technical, and statistical books and papers, and computer programs on multidimensional scaling and cluster analysis. A reasonable starting point to study multidimensional scaling is Borg et al. (2013), and for cluster analysis Everitt et al. (2011) may be fruitfully consulted. Several general introductory multivariate statistics books have chapters on multidimensional scaling and cluster analysis (e.g. Field, 2017; Tabachnick & Fidell, 2013).

5.7 Content summary What probably makes the character of the Bible translation dataset unique is that the translations were all made from the same original text, the Masoretic text of the Hebrew Bible (or Tanakh) dating back in its final form to roughly the 9th century CE.3 This common starting point, plus the digitisation of the text and computer-assisted text analyses made it possible to compute the similarities used in the analysis. 3 Wikipedia:

https://en.wikipedia.org/wiki/Masoretic_Text.

5.7 Content summary

141

Fig. 5.2 Multidimensional scaling and centroid clustering of Bible translations. Space enhanced by a six-cluster solution; three clusters consist of a single translation. P = Protestant, C = Catholic, JW = Jehova Witness.

The differences between the translations come to the fore by applying standard multivariate analysis methods to the similarities. Some of the results are obvious, such as the lack of similarities between the English and German translations. Interesting is, however, that the German translations among themselves show less similarity than the English ones. Moreover, some other results, such as the clear differences between modern translations made by different Christian denominations, or with different groups of readers in mind (for example, the Jehovah Witness New World Translation of the Holy Scriptures versus the Protestant Good News Bible), were less obvious beforehand. The analyses presented here are only the start of a project which examines changes in aesthetic categories over time and between denominational groups, as can be derived from different translations of the same text (see Background for this study, p. 133). Readers should turn to the work by Bleemer (2016) for such evaluations and interpretations.

Chapter 6

Stylometry: Authorship of the Pauline Epistles

Abstract Main topic. When the authorship of a text is in doubt, statistics can assist in shedding light on this question. Data. Frequently occurring words in the Pauline Epistles are the base for an investigation into the authorship of the Epistles. Research questions. Did Paul write all Epistles assumed to be his? Statistical techniques. Correspondence analysis, proportions, biplot. Keywords Pauline epistles · Pauline letters · Bible studies · New testament · Authorship · Stylometry · Word frequencies · Function words · Occurrence proportions · Correspondence analysis

6.1 Background Authorship questions arise regularly, and over centuries have held the interest of experts and laymen alike. Did Paul himself write all Epistles ascribed to him? Were all Shakespeare’s poems and plays really written by Shakespeare? Was the ‘The Heart of a Soldier: as Revealed in the Intimate Letters of Genl. George E. Pickett, C.S.A.’ really written by General Pickett or by his wife? Did Elena Ferrante really write the Neapolitan novels or were they authored by Domenico Starnone under a pseudonym? For technical and full discussions of the issue see (Mosteller & Wallace, 1984), Tuzzi and Cortelazzo (2018) and Morton (1978). Stylometrics Linguists have struggled with such questions for a long time and continue to do so, but in modern times statisticians have come to assist them in trying to shed light on these issues. In the field of linguistics they are generally referred to as stylometricians.

© Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_6

143

144

6 Stylometry: Authorship of the Pauline Epistles

Stylometry is defined as the statistical analysis of literary style. However, as outlined by Holmes and Kardos (2003), stylometricians do not seek to overturn the traditional scholarship of literary experts and historians, but rather want to complement their work by providing an alternative means of investigating works of doubtful provenance. Holmes and Kardos’s (2003) attractive introduction to this statistical endeavour goes by the title Who was the author? An Introduction to stylometry. Authorship of the New Testament One of the fascinating puzzles for theologians and linguists alike is who did or did not write various parts of the New Testament. For the four canonical gospels it seems obvious, as they are known under the names of their authors: Matthew, Mark, Luke, and John. It might be assumed that the Pauline Epistles were written by Paul; after all, they have his name attached to them. However, life is not that simple. The book of Acts in the New Testament has no explicit author, neither does Revelations. Moreover, many parts of the New Testament were only written many years after the events described took place, and in addition, the earliest existing versions of these documents were often copies, dating decades and even centuries after the first versions were made. Many scholars have attempted to establish the authorship for various parts of the New Testament on the basis of theological, historical, and linguistic arguments. In the case of the epistles there is clearly an open set of possible authors, as there is no specific collection of possible authors for them. In authorship studies a difference is made between a closed set of authors, in which texts could only have been written by one of a small number of plausible candidates (see Oakes, 2014, p. 1) and an open set, where the author can be one of a set of plausible candidates and/or someone altogether unknown. The case study in this chapter belongs to the latter category, whereas in Chapter 13, concerning the authorship of The Royal Book of Oz, the former is discussed. Oakes (2014, p. 176) considers that Morton’s (1965) pioneering work on the letters of Paul forms the basis of modern computational stylometry. Morton (1978) expanded on his early work on Paul’s epistles by publishing a book-length treatise under the title Literary detection. How to prove authorship and fraud in literature and documents. Before Oakes Greenwood (1993, p. 217) stated that Morton proposed the then ‘[...]unconventional hypothesis that the frequencies of common words, sentence lengths and other simple parameters offer a prescription for determining patterns of authorship [...]’. The title of Oakes’s (2014) book mirrors Morton’s: ‘literary detective work on the computer’. It is not only an introduction and a textbook-like monograph into the topic, but also a source for many studies done in the field. His Chapter 4 contains a stylometric analysis of religious texts, including an analysis of the Pauline Epistles. Libby (2016) also provides an overview of the stylometry literature and contains an extensive up-to-date reference list.

6.1 Background

145

Paul and his Epistles The apostle Paul, who was fluent in Hebrew and Greek, lived roughly from about 5 or 10 CE until his execution by the Romans in 67 CE, having been twice imprisoned by them. He was born in Tarsus, a large Mediterranean port in the South-East of modern-day Turkey. The Pauline Epistles collected in a 3rd century codex (P[apyrus]46) were written in Greek and found in Egypt. A codex is essentially an ancient book, consisting of one or more quires (sheets of papyrus or parchment folded together to form a group of leaves, or pages). At present the pages of P46 reside partly in the University of Michigan Papyrus Collection, and partly in the Chester Beatty Collection in Dublin, Ireland; for further information see the website of the University of Michigan.1 Although P46 was copied more than a century after Paul originally wrote his Epistles, this codex is nevertheless the closest that modern scholars have been able to get to Paul’s original words. The codex is the oldest surviving manuscript containing the Pauline Epistles in the order given in Table 6.1; 86 of the 104 leaves of the codex survive today,2 The order of the Epistles in P46 is different from that in the modern text, but reflects the Egyptian views at the time, emphasising the importance of the Epistles to the Jews. The authenticity of Hebrews is debated to this day, and at the time when P46 was written it was generally rejected in the West, but in Egypt and the East considered a genuine epistle by Paul. Table 6.1 Epistles in the Codex P46 Pages 1-41 41-64 64-117 118-145 146-158 158-168

Contents Romans Hebrews 1 Corinthians 2 Corinthians Ephesians Galatians

Pages 168-176 176-184 184-191 191-195 195-205

Content Philippians Colossians 1 Thessalonians 2 Thessalonians Uncertain

Source: University of Michigan (https://www.lib.umich.edu/reading/Paul/contents.html)

Some of the Pauline Epistles, the so-called Missionary Epistles, are addressed to the citizens of various cities in present-day Greece and Turkey. In particular, they are addressed to the Thessalonians—1 (50 CE), Galatians (53 CE), Corinthians—1 (53-54 AD), Romans (57 CE); Philippians (55 CE), Corinthians—2 (55-56 CE); Colossians, Thessalonians—2, and Ephesians. Hebrews possibly addressed to the 1 https://www.lib.umich.edu/reading/Paul/index.html. 2 The

information in this chapter is partly based on the Wikipedia entry about Paul, accessed 15 June 2020, https://en.wikipedia.org/wiki/Paul_the_Apostle as well as on Libby (2016) and the University of Michigan website already mentioned.

146

6 Stylometry: Authorship of the Pauline Epistles

Jews in Jerusalem. The dates given for several Epistles are approximate. The Pastoral Epistles are addressed to individuals: Philemon (55 CE), 1 Timothy, 2 Timothy, and Titus.

6.2 Research questions: Authorship The central but limited question in this chapter is what can be said about the authorship of the Pauline Epistles purely on the basis of the occurrence of function words. Function words may be defined as words whose purpose is to contribute to the syntax rather than the meaning of a sentence. Strictly speaking, the results will only give an indication of texts with a similar word usage or style. The core of the argument for using function words is that authors do not consciously employ such words in their work, so that a great similarity in their usage can be an indication of (unconscious) personal style. For instance, Oakes (2014, p. 9) cites Grieve (2007, p. 260): ‘[a] common assumption of authorship attribution thus appears to be true; function words are better indicators of authorship than content words’. At the same time, similarity of texts might also be explained in terms of genre, as sentence structure can be influenced by the subject matter of a text rather than by an author’s idiosyncrasies; see Libby (2016) and Oakes (2014, Chapter 1) for discussions of this aspect and to references to this debate. Clearly, settling the authorship or genre question requires historical, theological, and linguistic arguments. However, in cases where it is known which authors have written which texts, statistical experiments can be carried out to find which language characteristics provide the best discrimination between the authors. Over the years many authors have made statements about the authenticity of the various Epistles. These different views have been collected by Libby (2016) and are summarised in his Table 2, along with the author’s opinions about other books in the Greek New Testament. Libby’s table, as far as it is concerned with the Pauline Epistles, is reproduced here in rearranged form as Table 6.2. From this table we see that there is general agreement that the missionary Epistles Romans, Galatians, 1 Corinthians and 2 Corinthians were actually written by Paul. They are also called the Hauptbriefe, i.e. the principal Epistles. About all others there are divergent views with varying support for the authorship of Paul. Both attributions and dates are sources of controversy. The main aim of this chapter is to show how linguistic information can be analysed to shed light on one linguistic aspect of the Pauline Epistles: The function word usage in the various Epistles. Similarity or dissimilarity can then be used as an argument in the authorship discussion for a group of texts, or, alternatively, the common genre of a number of texts. The long-standing debate about the authorship of the Pauline Epistles was initially limited to theologians, historians, and linguists, and primarily based on content, theological and historical arguments. This changed by the time authorship questions caught the interest of linguistically inclined statisticians; see, for instance, Mosteller

6.2 Research questions: Authorship

147

Table 6.2 Attributions of Pauline Epistles Original

Reduced

Modern

Modern

Modern

Baur

Baur

Critical

Critical

Critical

21

19

18

17

15

No. of Authors

Traditional 9

Missionary Epistles Romans

Paul

1 Corinthians

Paul

2 Corinthians

Paul

Galatians

Paul

Philippians

Phil W

1 Thessalonians

Paul Thess1 Writer

Colossians

Paul

Colossians Writer

Ephesians

Paul

Ephesians Writer

Hebrews

Paul

Hebrews Writer

Thessalonians2

Tess2 Writer

Paul

Pastoral Epistles 1 Timothy

Tim2 W

Pastoral Paul

Tim1 W

2 Timothy

Tim2 W

Testament Paul

Tim2 W

Titus

Titus W

Pastoral Paul

Titus W

Philemon

Pastoral Pa4ul Testament Paul

Paul

Pastoral Paul

Paul

Pastoral Paul

Philemon Writer

Paul

Paul

Total assigned to Paul

4

5

7

7

8

13

Notes: Epistles in italics are included in the analyses in this chapter. W = writer = unknown author of particular Epistles; Baur = F.C. Baur (1845)—Source: (Libby, 2016)

and Wallace (1984). An overview, and references to the Greek New Testament may, for instance, be found in Oakes (2014) and Libby (2016). Note that in his paper Libby indicates that “Of the over 1,000 books, articles, monographs, and reviews of computational stylistics since the late 19th century, only about 20 multivariate studies have been executed upon the GNT” (p. 189).3 As several other authors, Libby did not consider the Pauline Epistles in isolation but investigated them in the context of the entire Greek New Testament. He and many of his contemporaries and predecessors approached the Epistles from both a combined qualitative linguistic and a quantitative point of view. In fact, he formulated four questions about the Pauline Canon with the aim to contribute to the debate using statistical procedures (Libby, 2016, p. 123, 124; here paraphrased). 1. Authorship or Genre? Qualitatively, do the texts of the Pauline Epistles cluster more by authorship or by genre? Is there a way to quantify this? 2. Sensitivity Clusters? Qualitatively, are the groups different when explored by word, word group, clauses, and clause complexes, respectively?4

3 GNT

= Greek New Testament. clause is strictly speaking not the same as a sentence. For this enumeration the difference is not essential.

4A

148

6 Stylometry: Authorship of the Pauline Epistles

3. Authorship or Genre? Quantitatively, how well do statistically derived groups based on language support the historic authorship and/or genre theories? 4. Best Fit Theory and Data? Quantitatively, given the answers to 1-3 above, which single authorship or genre theory best fits the language data of the Pauline Epistles? These four questions show that it is a marriage of substantive arguments and numeric analysis which should shed light on the authorship issue. It is not my intention to discuss all Libby’s points in this chapter; I rather want to emphasise that statistical analysis, in particular multivariate analysis, may be called upon to point to solutions in the debate on disputed authorship. The numeric treatment of the questions is based on the fact that characteristics of texts can be quantified by counting elements such as specific words, themes, and stylistic elements. The values for a number of relevant characteristics scored for each Epistle form a database which can be used to examine in detail the similarities between the Epistles, using multivariate statistical techniques.

6.3 Data: Word frequencies in Pauline Epistles The data for this case study were extracted from a paper by Morton (1965).5 I will provide an analysis of the frequencies of certain common function words, as they appear in each sentence for each of the Epistles. Among stylometricians there is an ongoing debate on the question of what words should be used for authorship studies. Morton chose a number of high-frequency common function words. His choice is post-hoc supported by Oakes (2014, p. 50): ‘[...] differences between authors are obscured by differences in topic. However, topic is best determined by examining mid-frequency words in the texts, while differences in authors are best found with high-frequency words which have grammatical functions but no bearing on the topic of the text’. The data are here presented in a cross-tabulation of Pauline Epistles as rows and ‘words times occurrence per sentence’ as columns; we will indicate the column variable as Word categories (see Table 6.3). The table is based on the work of Morton (1965): Tables 40B, 41, and 42). The rows are the 11 longest Epistles; 2 Thessalonians (822 words), Titus (657 words), and Philemon (336 words) were excluded by Morton because they were too short to afford reliable samples (p. 217). The columns represent how often certain common function words appear in the same sentence, i.e. in how many sentences they occur 0, 1, or 2 times. The common words included were και (kai —and), εν (en—in), αυτ oς (autos—he/himself), ειναι (einai —to be) and δε (de —but, and). As these words rarely occurred more than two or three times in the same sentence, I condensed Morton’s tables somewhat, ending up with a contingency table (Table 6.3) with 11 rows (Epistles) by 12 columns (Word categories). The entries in the table indicate how often these words occur in the sentences of each 5 It

is not entirely clear what text edition Morton used and which decisions he made when counting words. For instance, I do not know whether conjugations were included in the word counts.

6.4 Analysis methods

149

Epistle. Thus, the first three numbers in the table indicate that Romans consists of 385 sentences in which the word kai does not occur, 145 in which it does once, and 51 times in which it occurs more than once. Note that the total number of sentences in Romans is 385 + 145 + 51 = 581, and because there are five words the marginal total is 5 × 581 = 2905. The other Epistles have the same structure, although it seems that there were occasional incorrect counts in Morton’s tables. Table 6.3 The Pauline Epistles: Frequency of occurrence of common function words. Source: Morton, 1965 kai Epistle

0

en

1 2+

0

autos

einai

1 2+

0 1+

0 1+

0 1+

de

Number of Total Sentences

Romans

385 145 51

449 104 28

472 109

486 95

540 41

2905

1 Corinthians

425 152 51

504 90 34

562 66

492 136

580 48

3140

628

2 Corinthians

198 93 43

232 71 31

288 46

277 57

316 18

1670

334

Galatians

128 42 11

147 26

169 12

Ephesians

33 30 37 54a

8

160 21

139 42

46 25 29

65 35

60 40

93

7

905

181

500

100

27 31

57 30 15

81 21

87 15

520

102

Colossians

24 31 26

33 24 24

49 32

57 24

81b

2

407

81c

1 Thessalonians

34 22 25

47 23 11

61 20

69 12

81

0

405

81

1 Timothy

49 39 19

78 20

9

76 31

101

6

535

107

61 21

83

Philippians

101

6

92 10

581

2 Timothy

45d

7

76 13

72 17

6

444

89

Hebrews

158 96 61

263 41 11

213 102

271 44

299e 21

1580

315

1533 705 370

1917 475 207

2128 471

2086 512

2435 171

13010

2602

Total

28 15

Notes: Column headers indicate whether a word occurred 0, 1, or 2 times in a sentence; 1+, 2+ indicates 1 or more and 2 or more times, respectively. For each Epistle, the total number of sentences is the same for each word: for example for kai 385+145+51=581 sentences, and also for en 449+104+28=581 sentences, etc. There seem to be five errors in Morton’s (1965) data: a The number of sentences for kai for the Philippians is listed as 112, rather than 102 b The total number for de in Colossians should be 81, but is listed in Morton’s Table 42 as 83 c The total number of sentences is listed as 83 but should be 81 d 2 Timothy is one observation short for kai ; the total number of sentences listed for kai in the table is 88, but there are 89 sentences e The total number of sentences in Hebrews is 315, but for de it is listed as 320. For the analyses I have left the numbers as reported in Morton, as I did not always know where the incorrect counting occurred.

6.4 Analysis methods The data in this chapter (Table 6.3) form a contingency table, a cross-classification of two categorical variables: Epistles (rows) versus Word categories (columns). The content of the table consists of the number of times a particular word occurs in

150

6 Stylometry: Authorship of the Pauline Epistles

the sentences of a particular Epistle. Contingency tables are often tested for the independence of the row and column variable, but in large tables such as this one independence is seldom the case, and the interaction between the row and column variable is the prime focus of the study. For more technical details see Chapter 4, p. 108 and Chapter 4, p. 128.

6.4.1 Choice of analysis method The two variables (Epistles and Word categories) form a contingency table without a specific dependence structure, so that we have an internal structure design where we are interested in both the two variables and in their interaction (see Section 3.10) (p. 84); hence, correspondence analysis (see Section 3.10.1, p. 84) is a preferred analysis technique. Correspondence analysis was especially designed to graphically represent interactions. An excellent first introduction to the statistical method is Greenacre (2017); this book also contains many references for further study. Erwin and Oakes (2012) discuss and demonstrate the use of correspondence analysis specifically for New Testament studies. If we had more categorical variables, each with relatively few categories, loglinear modelling (see Section 3.8.5, p. 77) could have been an appropriate alternative choice, but with the large number of categories per variable this still would have left us with a too large interaction table to be investigated.

6.4.2 Using correspondence analysis In our contingency table of Epistles × Word categories we want to get an overview of their interaction. This can be realised by first constructing Epistle profiles and Word category profiles (see Chapter 4, p. 108).6 These profiles can be plotted in three graphs: one for the Epistle profiles, one for the Word category profiles, and a biplot with the Epistle profiles, the Word category profiles, and their interaction. For the biplot of Epistles and Word profiles (Figure 6.2) we use a display with symmetrical coordinates, which means that we concentrate on the relationships between the Epistle and the Word category profiles via their inner products. These will be explained and demonstrated informally below (see also p. 113).

6A

profile consists of proportion and it is either a row across all columns, or a column across all rows. The times that a particular word occurs in an epistle, say twice, is divided by the total number of times it occurs twice in all epistles. All such proportions form a word profile. Analogously, the times a word occurs twice in the sentences of Romans is divided by the total number of times that the word occurs in Romans. All such proportions form the Romans profile, or in general an epistle profile. For further explanations see Section 3.10.1, p. 84.

6.5 The Pauline Epistles: Statistical analysis

151

6.5 The Pauline Epistles: Statistical analysis Before starting with multivariate analysis, it is often wise to look at something simpler to start with. In particular, it is generally a good idea to simply inspect the data at hand directly. Depending on the size of the data, this can be done using the dataset itself, or alternatively some adequately chosen summaries; see Chapter 2, Data Inspection.

6.5.1 Inspecting Epistle profiles Patterns in large contingency tables are generally difficult to see, especially because the totals for the rows and those for the columns are not necessarily equal. However, some patterns are more easily to discern than others. If we look at the first two rows of Table 6.3 we see roughly that the marginal totals for Romans and 1 Corinthians (2905 and 3140, respectively) are almost equal, which makes it feasible to compare their profiles. For these two Epistles it looks as if their profiles are rather similar. For instance, the first two columns have values 385 versus 425 and 145 versus 152, respectively. The profile of 2 Corinthians also looks rather similar to the other two, even though its marginal total is only half of that of the other two (1670). Most of its frequencies are also about half those of the first two rows. The marginal total of Galatians is about a third of that of the first two rows (905), but still the pattern is rather similar. We may conclude that the use of the common function words is very similar in all four canonical Epistles. This fits with the uniform judgements seen in Table 6.2, i.e. all four Epistles are attributed to the same person, identified as Paul. There is an easier way to see the similarity of the patterns, namely, by examining proportions rather than the actual counts. The standard way to create proportions is by dividing each count by its marginal total within a row. However, in this case within a row the frequencies for each of the five words sum to the number of sentences, so that the marginal totals are equal to five times the number of sentences. To get proportions per word, the numbers in Table 6.3 need to be divided by the number of sentences rather than the marginal totals; this was done in the construction of Table 6.4.7 The patterns mentioned above can now be seen much more easily. Furthermore, the organisation of the table has also been improved for easier inspection, by rearranging the rows somewhat and adding extra horizontal spacing. This is an example of how reorganising a table can considerably increase its interpretability. Some further insights about the Epistles can be derived from this table. Not surprisingly, sentences with the same word occurring once or twice and more are much rarer than the nonoccurrence of the word. This is true for all Epistles and Word categories investigated, except for kai : especially in the epistles to the Ephesians 7 We

could consider this table as consisting of five separate tables, one for each word, placed next to each other.

152

6 Stylometry: Authorship of the Pauline Epistles

and Colossians the proportions for 0, 1, 2+ are almost equally likely. Obviously this indicates that the sentences in these two epistles are longer than in the other ones, as kai (= and) is used to concatenate main clauses. Table 6.4 The Pauline Epistles: Proportions of occurrences of common words. Source: Morton, 1965

2+ 0.04 0.05 0.05 0.09

autos 0 1+ 0.88 0.12 0.89 0.11 0.81 0.19 0.86 0.14

einai 0 1+ 0.77 0.23 0.78 0.22 0.84 0.16 0.83 0.17

de 0 1+ Total 0.93 0.07 1.00 0.92 0.08 1.00 0.93 0.07 1.00 0.95 0.05 1.00

0.53 0.27 0.30 0.42 0.27 0.31

0.56 0.29 0.15 0.58 0.28 0.14

0.79 0.21 0.75 0.25

0.85 0.15 0.85 0.15

0.90 0.10 1.00 0.00

1.00 1.00

Ephesians Colossians

0.33 0.30 0.37 0.30 0.38 0.32

0.46 0.25 0.29 0.41 0.30 0.30

0.65 0.35 0.60 0.40

0.60 0.40 0.70 0.30

0.93 0.07 1.00 0.02

1.00 1.00

1 Timothy 2 Timothy Hebrews Marginal profiles

0.46 0.51 0.50 0.59

0.73 0.69 0.83 0.74

0.94 0.85 0.68 0.82

0.71 0.81 0.86 0.80

0.94 0.93 0.95 0.94

1.00 1.00 1.00 1.00

Epistle Galatians 1 Corinthians Romans 2 Corinthians

0 0.71 0.68 0.66 0.59

Philippians 1 Thessalonians

kai 1 0.23 0.24 0.25 0.28

0.36 0.31 0.30 0.27

en 2+ 0.06 0.08 0.09 0.13

0.18 0.17 0.19 0.14

0 0.81 0.80 0.77 0.69

1 0.14 0.14 0.18 0.21

0.19 0.24 0.13 0.18

0.08 0.08 0.03 0.08

0.06 0.15 0.32 0.18

0.29 0.19 0.14 0.20

0.06 0.07 0.07 0.07

Notes: Column headers indicate whether a word occurs 0, 1, or 2 times in a sentence; 1+, 2+ indicates 1 or more and 2 or more times, respectively. The entries in the table are the row profiles for each Epistle. The five sets of numbers per row, one for each word, contain the proportions of occurrences per sentence.

Due to the proportions the similarity between the first four Epistle profiles now really stands out. The near-equality of the profiles of the Ephesians and Colossians is also particularly noticeable. Notwithstanding the easier interpretation of the table with proportions, it still remains difficult to obtain an overview of all relationships between the Epistles and to evaluate where exactly the larger and smaller differences between the different Epistles lie.

6.5.2 Inertia and dimensional fit The total variability in a contingency can be measured by the total inertia, which is the Pearson X 2 statistic.8 This statistic is slightly inaccurately often referred to as the chi-squared statistic (see Chapter 4, p. 128). The total inertia can be partitioned into portions for each dimension derived via the correspondence analysis. The (rounded) 8 X2

= the sum over all cells of the squared standardised difference between the observed frequency and the expected frequency if there was no relationship between words and epistles. In Table 6.4 such a relationship does exist.

6.5 The Pauline Epistles: Statistical analysis

153

values of the inertia for the first five Pauline Epistle dimensions are 69%, 17%, 6%, 4%, 2%, respectively. Thus, the first two dimensions account for 87% of the total inertia and give a good representation of the profiles of the table. In other words, these two dimensions can be interpreted without impunity and fear for serious misrepresentations. The implicit assumption is that the higher order dimensions are a mixture of idiosyncrasies of some Epistles, nonsystematic variability, and naturally occurring irregularities or inaccuracies while writing. For example, the third dimension (not shown) indicates primarily that enai1 and en1+ are a bit further apart than one would conclude from the plot of the first two dimensions.

6.5.3 Plotting the results A more detailed overview of the Epistle profiles in the table can be acquired by graphing their profiles in a plot. In addition, the relationships between the Epistle profiles and Word category profiles can be evaluated via a biplot. On the basis of the size of the inertias of the two dimensions (87% of the total inertia), we can restrict ourselves to just two dimensions. The Epistles plot is shown in principal coordinates in Figure 6.1. To evaluate the words and the epistles together, we depict the Epistles and the Word categories in symmetrical coordinates (Figure 6.2); see Chapter 4, p. 106 for technical details on coordinates. These coordinates provide a good overview of the relationships between the rows and columns, via the inner products (for technical details see Chapter 4, p. 114).

6.5.4 Plotting the Epistles profiles Figure 6.1 shows the profile plot of the Epistles in principal coordinates, so that we may use straight-line distances to examine the distances between the Epistles as we would on an ordinary map (Chapter 4, p. 113). We have connected the four unchallenged Pauline missionary Epistles (see Table 6.2), as well as Ephesians and Colossians, which are almost inseparable according to Libby’s analyses (2016) (Section 9.4). Interestingly, of Libby’s author groups (Table 6.2) only the Traditional authors concur with this coupling. The close proximity of 1 Thessalonians and Philemon supports the argument among the Critical authors that they were written by the same person. Whether this was indeed Paul is harder to confirm from Figure 6.1 if one considers their location with respect to the four core missionary Pauline Epistles. A clear statement about 1 Timothy and 2 Timothy is difficult to make on the basis of Morton’s data. They are not really close in the figure, but there is no clear consensus about them by the author groups in Table 6.2 either. In conclusion, overall the Epistles can be grouped via the stylistic characteristic of word usage, but this grouping is consistent only with several, but not all scholarly opinions about their authorship. For example, some authors are of the opinion that

154

6 Stylometry: Authorship of the Pauline Epistles

Fig. 6.1 Correspondence analysis of Pauline Epistles. Similarity of Epistle) profiles—rows.

Ephesians is an extension of Colossians, but most of them do not agree that they were written by the same person, let alone by Paul. The consensus that Hebrews is not by Paul is supported by its solitary location away from the other Epistles. However, an interesting statement was apparently made around 200 CE by Clement of Alexandria, as mentioned by Eusebius Pamphilius; see also Hughes (1977, pp. 19–22). Clement wrote that Hebrews was written by Paul, but in Hebrew. It was later translated by the apostle Luke into Greek,9 and therefore shows signs of Luke’s own writing. In a way it seems to make sense for someone bilingual in Greek and Hebrew to write in Hebrew to Jews, rather than in Greek. Of course, an unlikely discovery of an early Hebrew version would solve the issue. However, the authorship of Hebrews will not be further examined here. In all these discussions it should be realised that Figure 6.1 contains only information about similarity in word usages, and thus is technically not about authorship. It is necessary to actually read the Epistles and related information from contemporaries to establish a link between word usage and authorship attributions, although on lexical grounds authorship is probably easier to exclude than include.

9 http://www.ccel.org/ccel/schaff/npnf201.iii.xi.xiv.html.

6.5 The Pauline Epistles: Statistical analysis

155

6.5.5 Epistles and Word categories: Biplot Figure 6.1 provides information about the similarity between the Epistles, but it does not tell us what is the basis for such similarities. In Figure 6.2 both elements are depicted together in a symmetric biplot (see Chapter 4, p. 106). This representation means that for examining the relationships between the Word categories and the Epistles, we can no longer use straight-line distances but have to use inner-product distances (see Chapter 4, p. 113). To facilitate the characterisation of the Epistles in terms of the word usage, we may draw lines or biplot axes through the Epistle points in the graph. These lines go through the origin and the Epistle profile points. In correspondence-analysis biplots the origin represents both the column and the row margins, so that points close to the origin have profiles which closely resemble the marginal or average profiles, and points far away from the origin have patterns that deviate from the marginal profile. Note that the last row of Table 6.4 is the marginal row profile with which all the other row profiles in the table are compared (see Figure 6.2). If an epistle is very similar to the marginal profile it lies in the biplot next to the origin and is virtually equal to the average profile. If an epistle lies at the border of the plot the profile deviates strongly from the average profile, i.e. in comparison with the average profile some words occur much more frequently and others rarely. This is illustrated in Figure 6.3 for Colossians and in Figure 6.4 for Hebrews. The horizontal axis in the biplot (Figure 6.2) is roughly the biplot axis through the centre of the four undisputed missionary and Pauline Epistles, with at the opposite end the Epistle to the Colossians and the Ephesians, so that the coordinates of the word categories on this axis correspond to the relative frequencies of the word categories in these Epistles. They show that frequent occurrences of such words as en2+, kai2+, autos1+ distinguish Colossians and Ephesians from the other missionary Epistles, in which these words are used sparingly; furthermore, kai0, en0 and autos0 are all the way to the left. This is even clearer from Figure 6.3, which shows the rank order of the projections of the word categories on the Colossians biplot axis (horizontal axis) against its relative frequencies. The Hebrews biplot axis is perpendicular to that of Colossians, and the Word categories in Hebrews have completely different relative frequencies as is indicated below the figures. Figure 6.2 shows that almost all the categories for ‘no occurrence’ of a word in a sentence are located on the left, whereas the categories for more frequent word occurrences are located on the right. In addition, sentence length (Morton, 1965, Table 39B) correlates very strongly (.93) with values of the Epistles on the first dimension. From Morton’s point of view this observation would suggest that a common authorship for the four core missionary Epistles is a reasonable assumption, and that Colossians and Ephesians were each written by someone else or both by the same person.

156

6 Stylometry: Authorship of the Pauline Epistles

Fig. 6.2 Correspondence-analysis biplot: Joint representation of Epistle and Word category profiles. The connected Epistles are the Hauptbriefe, and Colossians and Ephesians, respectively. The dashed lines are the biplot axes for Colossians and Hebrews.

6.5.6 Methodological summary Correspondence analysis makes it possible to find the underlying dimensions simultaneously for row and column profiles. These dimensions can be displayed separately in profile plots based on principal coordinates for rows and columns profiles. Rows and column profiles may together be analysed via biplots using symmetrical coordinates. With respect to interpretation it should be reiterated that it is the similarities and dissimilarities which are under study, and solid substantive information is necessary to make the transition to substantive interpretation. In the discussion of Morton’s (1965) original data it was mentioned that Morton did not include 2 Thessalonians, Titus, and Philemon due to the limited number of words in these Epistles. However, they can be portrayed in the biplot, because correspondence analysis can be extended by including these epistles as supplementary points; further discussed in Chapter 4, p. 127. Such points are added to the dataset after the main analysis has been carried out, and they are placed into the graph on the basis on their observed profiles in the contingency table. Thus, they do not influence the analysis itself but their locations in the biplot are calculated on the basis of their

6.6 Other approaches to authorship studies

157

profiles. By treating the three short Epistles as supplementary points their similarity with the other Epistles could have been evaluated. Other multivariate analysis techniques applied to the Pauline Epistles are mentioned by Erwin and Oakes (2012). In particular, they list the use of the Mahalanobis distances and discriminant analysis by Neumann (1990, p. 191), principal component analysis by Ledger (1995), and clustering by Greenwood (1992). A complete analysis comparing the Pauline Epistles with other texts of the New Testament can also be found on the web page of Patrick Milano. It includes snippets in the computer language Python for creating the database.

6.6 Other approaches to authorship studies Principal component analysis for continuous data and correspondence analysis for discrete data are common statistical procedures used for authorship studies. The latter has seen extensive use in stylometry in general. More detailed and varied applications of, and references to correspondence analyses in stylometry can be found in papers and books written by experts in the field such as Holmes and Kardos (2003), Oakes (2014), and Libby (2016). Correspondence analysis has also been used for the analysis of contingency tables in many other disciplines.

Fig. 6.3 Correspondence analysis: Relative frequencies of words in Colossians The word order is according to the projections on the Colossians biplot axis in Figure 6.2. The order is different from that of Hebrews (see Figure 6.4). Words with red bars are frequent, but not for Hebrews; words with green bars are rare, but frequent for Hebrews; words with blue bars occur occasionally in both.

158

6 Stylometry: Authorship of the Pauline Epistles

Fig. 6.4 Correspondence analysis: Relative frequencies of words in Hebrews The word order is according to the projections on the Hebrews biplot axis in Figure 6.2. The order is different from that of Colossians (see Figure 6.3). Words with red bars are rare, but frequent for Colossians; words with green bars are frequent, but not for Colossians; words with blue bars occur occasionally in both.

In Chapter 13 a few other multivariate approaches towards establishing authorship will be discussed and illustrated. In particular, we will investigate the situation in which two known authors have written each a part of a book series, but it is contested who wrote one specific work, i.e. The Royal Book of Oz.

6.7 Content summary In this chapter questions of authorship of the Pauline Epistles have been scrutinised. The investigation is entirely based on the occurrence of particular common words (function words) in the various Epistles. In accordance with most of the experts, our present analysis showed there is a contrast between the four core missionary Epistles: Romans, Galatians, 1 Corinthians and 2 Corinthians on the one hand, and Colossians and Ephesians on the other. One of the reasons for this is that the latter two comparatively often have sentences in which the word kai (‘and’) appears several times. This points to long sentences concatenated by kai. Virtually all experts consider the first four Epistles have been written by Paul, whereas the last two are not, but possibly written by the same unknown person. The more frequent use of the word autos in Hebrews, and a sparing use of einai and en differentiates Hebrews from the other Epistles. The general consensus seems to be that Hebrews was not written by Paul. However, the Greek text of Hebrews may

6.7 Content summary

159

actually be a translation from Paul’s original Hebrew text, as was held by Clement of Alexandria († 215 CE). He also suggested that Luke may subsequently have translated it for the benefit of the Greeks. If so, this may put an entirely different look on the authorship of Hebrews (Hughes, 1977, pp. 19–22). Thus, multivariate analysis shows, in a single analysis, that the Epistles can be stylistically characterised by their use of common words, and by their average sentence lengths. In contrast, Morton back in 1965 had to carry out a large number of individual, and more or less unrelated, significance tests to analyse these data. Moreover, he had to combine these results subjectively (Morton, 1965). From these data it is of course not possible to prove that different authors or scribes have written the various Epistles, the more because word usage and sentence length may also vary with genre, and the Epistles clearly represent different genres. Oakes’s (2014) (p. 5) supports the Pauline authorship of the four core Epistles, but he also points out that word and sentence length are ‘[..]under conscious control of the author, and may be better discriminators of genre or register’. In other words, in this case statistics makes the factual situation much clearer, but the linguistic interpretation is a question for linguistics experts, as it should be.

Chapter 7

Economic history: Agricultural development on Java

Abstract Main topics. The analysis of historical data measured at different points in time and at several locations. One of the issues is how to analyse such data, given that data from the same variables and the same locations are not always available over time, and that some variables have different measurement types. Data. Historical data about the state of agriculture on Java in 1880 and 1920. Research questions. How can the agricultural landscape in Java be made visible? What similarities and differences can be observed between the four Java regions and their administrative units, the residencies? What changes can be observed between the two-time points? Statistical techniques. Categorical principal component analysis, joint analyses of categorical and numeric variables, analysis of nonlinear relationships between variables. Keywords Agricultural history · Java · Indonesia · Sugarcane · Rice fields · Wet rice · Residencies · Categorical principal component analysis · catpca · Nonlinear relationships

7.1 Background In Boomgaard and Kroonenberg (2017)1 historical data about the state of the agriculture on Java (then Dutch East Indies, nowadays Indonesia), in 1880 and 1920 were used to evaluate Geertz’s involution theory (Geertz, 1961). With his theory Geertz makes claims ‘about the allegedly unlimited capacity of wet-rice fields on Java (sawahs in Javanese) for almost constant intensification during the colonial period (c. 1820–c. 1940)’ (Boomgaard & Kroonenberg, 2017, p. 56).

1 This

chapter is primarily based on work done together with Prof. Peter Boomgaard who died far too early (†10/1/2017), and it is dedicated to him and his work. © Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_7

161

162

7 Agricultural development on Java

In the 2017 paper most of the information used to refute Geertz’s theory was presented via bivariate correlations, bivariate graphs, and simple contingency tables, but no explicit attempt was made to statistically combine the variables for an overall picture of the agricultural situation on Java. The intention in this chapter to show how the available data from about 1880 and 1920 can be re-examined via multivariate analysis, in order to acquire an integrated picture of the change in agriculture on Java, but I will not go into the arguments around Geertz’s involution theory. The first part of this chapter has a strong methodological and statistical slant. The emphasis will be on a categorical variant of principal component analysis. The technique is discussed in more depth on Section 3.8.2 (p. 71), but I considered it appropriate to be included within the context of this case study as well.

7.2 Research questions: Historical agricultural data Historical data are almost by definition problematic: they have been collected under unknown circumstances, by persons unknown of whose diligence, rigourousness, accuracy, etc. we know nothing, and there is nobody to ask about these aspects. The effect of this is that we have to take such data at face value and in good faith. The case study clearly has an internal structure, rather than a dependence design, because we will not attempt to predict the 1920 data from those of 1880, but want to describe the relationships among the variables within each year and between the two years; see for the distinction between the designs Section 3.5.1, p. 54. We could imagine a dependence design if we wanted to predict the results for 1920 from the data of 1880, and explicitly wanted to predict the main causes for changes between the two years. For this we would require more detailed insight into the administrative and economic processes of change in Java, as well as a more extensive database; see Boomgaard and Kroonenberg (2017). For the statistical analyses in this chapter we will accept the data as given. This means that whatever their origins may be, outliers and missing data will be taken at face value without doubt as to their veracity. However, for the diligent researcher in the field it is necessary in such situations to look for other sources of information and confirmation. Analysing these historical data is a somewhat uncertain undertaking, because of the rearrangement of borders between the administrative units. Moreover, there are variables with different levels of measurement requiring special procedures, as will be illustrated first with a miniature example. Relationships between the variables may also be nonlinear; this will be checked and handled in the discussion of the main study. Handling different measurement levels and nonlinearities requires much flexibility of the data analyst and his analysis programs. Many decisions may be difficult to justify or can easily be challenged. An example would be decisions on which objects and variables to include, how to treat measurement levels, missing data, and outliers. This is of course endemic in data analysis in general, but for historical data these

7.3 Data: Agriculture development on Java

163

decisions are enhanced, the more because one cannot go back in time and contact the data collectors. It seems wise to start the investigations here with a limited number of objects and variables and only analyse the full dataset later, so that it is easier to keep control of the results. In addition, it is easier to first analyse each year separately, and only after that analyse both years simultaneously. Measurement Historical data are typical mixtures of variables with different measurement levels: categorical (nominal or ordinal) and numeric (interval or ratio) (see also Section 1.5.2, p. 11). To ensure a proper analysis of a dataset careful attention should be paid to these measurement levels, because a measurement level is not necessarily the same as an analysis level. Therefore, it is advisable to first carry out a preliminary investigation to see whether the measurement and analysis levels can be the same, especially whether the requirements for the analysis are met by the variables. For instance, a numeric measurement level is not an appropriate analysis level for loglinear analysis (see Section 3.8.5, p 77) because that technique is intended for categorical variables. Even though the variables have different measurement levels, we would like to use all of them together in a single analysis to answer the research questions. A related problem is whether standard statistical techniques based on normality of the variable distributions can be applied. Many statistical techniques work best if the mutual relationships between the variables are linear, so that these can be represented by the standard correlation. A further problem is to perform a multivariate analysis if the variables have nonlinear relationships. These questions will also be discussed here.

7.3 Data: Agriculture development on Java The historical data used in this chapter pertain to the state of the agriculture in Java (then part of the Dutch East Indies-Nederlands Indië) measured in 1880 and 1920. The data available relate to the larger administrative units, in particular the residencies and their subdivisions (Afdelingen).2 Given that some of the data of interest were not available at the subdivision level, and that our interest is methodological rather than substantive, I decided to use only data on the residencies for this chapter. The

2 For this chapter I have followed the present-day official transcriptions of the names of the residen-

cies. These are not necessarily those used by the Dutch colonial administration. The lax high back rounded vowel as in book was generally transcribed by the colonial Dutch as ‘oe’ rather than ‘u’ as is now commonly done in Bahasa Indonesia.

164

7 Agricultural development on Java

issues raised here occur regularly in data analysis in the humanities, and hence in this book. The 1880 data were taken from the 1880 Koloniaal Verslag [Colonial Report]. The 1920 data were copied from the Landbouwatlas van Java en Madoera [Agricultural Atlas of Java and Madura] produced by the Dutch Colonial Government see (Boomgaard and Kroonenberg, 2017).3 Various administrative changes were made by the Colonial Government during the period 1880–1920, one of which was the redrawing of the borders between some of the residencies. For our data this means that certain comparisons between 1880 and 1920 are difficult to make because there are several residencies for which complete data are not available for both years. Some considerations about this problem are discussed in Chapter 4, p. 117. The residencies are grouped into four parts (West, North Central, South Central, and East). This is depicted in Figure 7.14 . The section of the complete dataset used here is related to land use, rice and sugarcane production, livestock, and population density. The variables used in our analyses are displayed in Figure 7.5, p. 172. All variables are numeric except for those identifying residencies and regions. Data structure The major purpose of the analyses described in this chapter is to create an overview of the various agricultural land uses in Java, as well as their changes over time. In view of Section 1.5.1 (p. 11) it seems advisable to first investigate the situation at each of the two-time points separately. Subsequently, the two partial datasets can then be linked, to carry out an analysis on a combined matrix constructed with the variables common to the two-time points. However, for several residencies data were not available for each time point, so that these residencies could only be included in the joint analyses for a single time point. An alternative to analysing the long matrix (see Section 1.5.1, p. 11) could have been a three-way analysis (Section 3.11.1, p. 89). Unfortunately, this technique also requires a fully crossed design (i.e. each residency has to have at least some scores on all variables in both years). The lack of data for some residencies, and for some variables in one of the years, makes any analysis less reliable because missing data have to be estimated (see Section 2.3, p. 36). Furthermore, a limited number of residencies do not have complete information for both years (see Chapter 4, p. 117). In Boomgaard and Kroonenberg 2017 close attention is paid to outlying observations because some residencies, especially Yogyakarta (no. 6), showed wildly

3 This

project was devised during a stay at the Netherlands Institute for Advanced Study in the Humanities and Social Sciences (nias), Wassenaar, during the academic year 2003–2204. Particular thanks are due to Anneke Vrins-Aarts for creating the database on which the statistical analyses are based. 4 Source: Boomgaard (1989, p. 10).

7.3 Data: Agriculture development on Java

Fig. 7.1 Residencies and regions on Java. Blue numbers refer to the availability 1880 data; red numbers to the presence of 1920 data.

165

166

7 Agricultural development on Java

different values, leading to a correlation of 0.66 for two variables with the residency included in the calculation and 0.44 if the residency was not. Although it is feasible to make comparisons with and without a specific residency when variables are compared in pairs, it is more complicated in the multivariate case as different residencies may have unusual values for different variables. Other problems occur if some variables have very skewed or irregular distributions, which makes using linear correlations inappropriate. In addition we would want to include the location of the residencies in the analysis, which is a truly categorical variable. As if this were not enough, we are also faced with the fact that it does not make sense to treat the data as random samples from a population. The number of residencies is fixed and all those without missing data are included in the dataset. This means that hypothesis testing is not really appropriate, because we are not generalising from a sample to a population; the analyses are therefore purely descriptive. Thus, we here aim for insight at residency level into the state of agriculture on Java in both years. In addition, both the differences and changes between the two years are of interest.

7.4 Analysis methods Given the descriptive aims of this case study we will briefly look at three aspects of the data: (1) their univariate characteristics, such as distributions and means, (2) the relationships between the variables, between the residencies, and jointly among the residencies and the variables, and (3) changes over time.

7.4.1 Choice of analysis method I expected some of the variables to have a nonlinear relationship with other variables, and several variables seemed to have rather skewed distributions. Moreover there is one truly categorical variable, i.e. Region of Java (West, North Central, South Central, East). Because of this I decided to tackle our questions with a very powerful but underused technique, Categorical principal component analysis - catpca, which features in several chapters (see Section 3.8.2, p. 71 for more details). Because the technique is rather unknown, and extremely useful within the humanities, I will give somewhat more explanation than usual starting with a miniature example based on a subset of the data to illustrate some of the features of the procedure. However, because I only intend to give a first taste in situations involving the technique rather than a full mathematical exposition, for many technical and practical details the reader is referred to other publications, such as Meulman and Heiser (2012). Linting et al. (2007) contains an example dealing with personality assess-

7.5 Agricultural development on Java: Statistical analysis

167

ment, and Wulffaert et al. (2009) an explanatory example dealing with parental stress caused by disabled children.

7.4.2 Cat PCA: Characteristics of the method The starting point in categorical principal component analysis (catpca) is that all variables are treated as if they were categorical. The technique assigns new values to the categories in such a way that the variables behave as if they are numeric. This process is called quantification. The quantification is restricted by the measurement level of a variable. Put differently, we want to quantify the categories and find the weights for the variables simultaneously, so that we can still find the best components available (for a detailed explanation of quantification Section 4.18, p. 121); a more extensive discussion of categorical principal component analysis can be found in Section 3.8.2, p. 71.

7.5 Agricultural development on Java: Statistical analysis When variables in a dataset have different measurement levels, or relationships are clearly nonlinear, we have to take a number of actions to perform an adequate analysis as indicated above. One basic approach is to transform a nonlinear relationship between two variables into a linear one. This can be done by using a specific function to transform the original values for one or both variables, so that the relationship between the variables becomes more linear. Linearity is preferred because it is the basis for correlation and regression. Figure 7.2 is a scatterplot matrix. It is comparable to a correlation matrix, but instead of correlations it shows scatter plots of the variables. In a correlation matrix the correlation between the horizontal variable (X ) and the vertical variable (Y ) is the same, but in scatterplots the X variable is always the predictor and the Y variable the response variable. Thus diametrically opposed scatterplots are mirrored, and the roles of the variables are reversed. Commonly the diagonal is used to portray the distributions of the variables. Figure 7.2 is the scatterplot matrix of Sawahs on arable land 5 and Production of wet-rice per unit sawah, and shows the effects of categorisation on the distributions, and, via optimal scaling, the effects on the relationship between Sawahs on arable land and Production of wet-rice per unit sawah; see Chapter 4, Quantification (p. 121) for a further explanation. The categorised distributions of the transformed variables are more regular and the outliers are less extreme. The latter are now combined into a single category. There has also been some gain in the linearity of the relationship between the two 5 Sawah

(Malay): A rice field surrounded by low dikes. During the growing season sawahs are permanently covered by water. The harvested rice is called padi (wet rice).

168

7 Agricultural development on Java

(a) Sawahs and wet rice production: Numeric (b) Sawahs and wet rice production: Quantified

Fig. 7.2 Sawahs per arable land and Production of wet-rice per unit sawah. The blue histograms represent the distributions of the original numeric (left), and the quantified variables with five categories (right).

variables, but it is only slight: the absolute value of the linear correlation between them has increased from −0.13 to −0.22.

7.5.1 Categorical principal component analysis in a miniature example In this section a categorical principal component analysis (catpca) is demonstrated in a miniature example consisting of only four variables: Regions in Java (West, North Central, South Central, and East), Percentage of arable land of the total available land, Percentage of sawahs on arable land and Wet-rice production per unit sawah. The first one is categorical, the other three are numeric. Before going into the analysis proper I refer the reader to Table 7.1, which shows how the quantification of the categorical variable Region takes shape. The numbering for the regions is entirely arbitrary so that any set of numbers will do the job This is true for any truly categorical variable. The table shows that if the categories of Region are reordered in accordance with the size of the Arable land means of the regions, the linear correlation between these variables jumps from 0.04 to −0.62, and even a tiny bit more (−0.68), if the numbers are replaced (quantified) with their actual arable land means. In other words, we could use Region in an analysis based on linear correlations by properly quantifying its categories through reordering them or replacing them by the means. This is not the whole story, because quantifying the category values optimally with respect to one variable may lower the correlation with respect to other variables.

7.5 Agricultural development on Java: Statistical analysis

169

Table 7.1 Summaries Arable land of total land area Ordered

Ordered

Mean of

by Labels

by Mean

Standardised

Projection onto

Region

N

Mean

St.Dev

Region

Mean

St.Dev

Variable

Arable land

1. West

8

34

16

1. West

34

16

−0.83

−1.15

2. North Central

11

65

13

2. East

34

17

−0.83

−1.13

3. South Central

11

61

20

3. South

61

20

−0.44

0.56

4. East

6

34

17

4. North

65

13

−0.62

0.88

Total

36

52

22

Total

52

22

Correlations (r ) between Arable land and Region: ∗ Treating original label numbers of the regions numerically, i.e. 1, 2, 3, 4 : r = 0.04. ∗ Replacing the region labels with the rank order of the mean on Arable land: r = −0.62 ∗ Using per region the value of the mean of Arable land itself : r = −0.68. ∗ Using per region the size of the centroid projections on the Arable land vector in the biplot: see Figure 7.3.

Categorical principal component analysis will seek an optimal quantification for all variables at the same time; see Chapter 4, p. 121. The miniature example biplot With a slight overstatement we can say that a good analysis of a dataset such as the one presented here culminates in a biplot displaying the variables and objects (here: residencies) in their mutual relationships, both in a within fashion for objects and variables separately, and in a between fashion for objects and variables together (Figure 7.3). A good way to characterise and compare the regions is by inspecting their centroid positions on the lines or biplot axis of the variables. The centroid of a region lies in the middle of all residencies belonging to that region. These positions can be found by projecting the centroids on the biplot axes, as indicated for the regions’ centroids on the Arable land variable; see Chapter 4, p. 108. Figure 7.4 shows the locations of the centroids on each of the three biplot axes. In most cases the fit of the data should also be supplied. In particular, the biplot should indicate how much of the total variability is represented by the twodimensional plot. In this case the first axis represents 56% and the second axis 37%; together they account for 72% of the total variability in the data.6 By the way, the fact that the axes account for a specific percentage does not mean that they themselves need to be meaningfully interpreted. The axes ‘span’ or ‘support’ the space, but they may not represent a particular content variable. The interpretation is deter6 Because there are both categorical and continuous variables, the percentages of variance accounted

for by the components do not add up to the total variances accounted for (Meulman and Heiser, 2012).

170

7 Agricultural development on Java

Fig. 7.3 The miniature example: biplot. Numeric variables: Arable land of total area, Sawahs per arable land, Wet-rice production per unit sawah; categorical variable: Region. As an example the projections of the Region centroids (squares) onto the biplot axis of Arable land are drawn; see also the last column of Table 7.1.

mined by the variables and objects themselves, in particular by their locations in the plot. Interpretation miniature example Because the position of the variable Arable land of the total land area is roughly perpendicular to the positions of the other two variables (see Figure 7.3) it is virtually uncorrelated with them. Thus, Sawahs on arable land and Wet-rice production per unit sawah are unrelated to the amount of land of the area in a residency that can be used for agriculture. The figure also shows that per regions there are noticeable differences between residencies. For example, the residencies of West Java show more differences among themselves than do those of North Central Java. In addition we see that Arable land of the total land area separates central area of Java from the western and eastern parts. The central residencies are shown to have larger percentages of arable land; likely due to the terrain in Central Java. The south central and eastern parts of Java are distinguished from the north central and the western parts in that they have fewer sawahs but these produce more rice per sawah

7.5 Agricultural development on Java: Statistical analysis

171

Fig. 7.4 Miniature example: Biplot axes for the numeric variables Positive values of centroid indicate above average scores; negative values below average score.

unit. For a summary see the relative positions of the regions for each of the variables in Figure 7.4. For further interpretation of the agricultural history of Java, the other available variables are needed to make more detailed statements. This is done in the next section.

7.5.2 Main analysis To acquire a better insight into the agricultural history of Java we need to use the complete dataset, which also consists of Population density, Sugarcane on sawahs, Number of sawahs per capita, Production of wet-rice per capita, Arable land per capita, and Number of livestock per 10 bau.7 Several technical and data-related details had to be taken into account for the analyses reported here. For instance, in 1880 Banyuwangi (no. 20) and Krawang (no. 27) had very odd values for livestock per 10 bau, and very small population densities. I therefore decided to exclude them from this analysis; it would require further archival research to resolve this issue. As mentioned in Section 7.3, Yogyakarta was also a serious outlier, and hence was also kept outside this analysis. Furthermore, I have 7 10

bau ≈ 7 hectares.

172

7 Agricultural development on Java

not treated the numeric variables as straightforward numeric, but they were analysed as ordinal with a monotone transformation. For technical details of this approach see Meulman and Heiser (2012, p. 18).

Fig. 7.5 Biplot of the complete dataset with biplot axes for the numeric variables.

The categorical principal component analysis resulted in the biplot shown in Figure 7.5, from which several interesting observations can be made. The full evaluation and interpretation of these results can be found in Boomgaard (1989), Boomgaard and Kroonenberg (2017). Here I will only call attention to some of the patterns. Variables Angles between the axes of the variables in the biplot represent approximate correlations between the variables. Notably, there is a very strong correlation between Population density and Arable land. Much opportunity for agriculture goes together with many people per land area. Similarly, high percentages of sawahs on arable land lead to many sawahs per capita, but not necessarily to large productions of wet-rice per capita.

7.5 Agricultural development on Java: Statistical analysis

173

Changes over time Turning our attention to the residencies, we see that for 1880 nearly all residencies are located in the North-East corner of the biplot, and for 1920 they are located in the South-West corner, which points to a considerable shift in agriculture between 1880 and 1920. For instance, comparatively little sugar was grown on the sawahs in 1880 in contrast with 1920; the large cities Batavia (no. 3) and Probolinggo (no. 32) were the exceptions. It seems not too farfetched to assume that Batavia, the present capital Jakarta, simply had too many people to feed to have space for commercial sugar crops mainly meant for export.

Fig. 7.6 Changes over time in the locations of the residencies. The same residencies are connected by arrows. For clarity the variable biplot axes are not shown. The circled residencies were only measured once.

Already in 1880 the East Java residencies, Pasuruan (no. 29), Probolinggo (no. 32), and Besuki (no. 22) all seemed to have larger amounts of wet-rice per

174

7 Agricultural development on Java

sawah, together with more livestock than other residencies. Because the data from Banyuwanghi (no. 20) were excluded, we cannot tell whether this also applied for that residency. Note that a similar situation can be seen for Priangan (no. 22) in West Java. The changes over time can be shown in more dramatically by connecting the locations of the same residencies in the two years (Figure 7.6); the biplot itself is the same as in Figure 7.5. Here we see for nearly all residencies a comparable shift from Wet-rice per capita to Sugarcane on sawahs towards more Arable land per capita. Again this needs to be analysed in more detail and validated against the analyses per year. However, such an exercise would take us too far here. Differences within regions A final look at the same biplot, again without drawing the biplot axes, can be obtained by colouring the residencies by region and connecting the residencies within a region for each year, leaving aside a few outliers (Figure 7.7). It is clear that the West Java residencies (blue dots) are very much spread out, that the South Central residencies (green squares) strongly cluster by year, and that this is also true for the North Central regions, although less explicitly. This implies that per year the residencies have similar scores within a region, with the exception of West Java.

Fig. 7.7 Regional clusters of residencies. For clarity the biplot axes for the variables are not shown.

7.6 Other approaches to historical data:

175

7.5.3 Agricultural history of Java: Further methodological remarks The above analysis may look convincing, but this is unfortunately no longer the case when the residencies’ means and correlations for 1880 and 1920 are examined separately. The mean values for the two years for three of the variables are the following: Wet-rice per sawah (1880: 26.4; 1920: 25.0), Sugarcane/sawah (1880: 2.1; 1920: 10.4) and Livestock per 10 bau (1880:1 4.8; 1920: 5.7). Thus, over forty years changes occurred in the residencies with respect to sugarcane. The correlations between the three variables changed even more (1880: 0.09, 0.09, 0.00; 1920: -.24, 0.59, 0.34). Thus, proper interpretation can only take place by investigating both the separate results per year and the years together. Such interpretations can be found in Boomgaard and Kroonenberg (2017). Together the three figures for the main analysis sketch an image of the historical development of Java agriculture, although several aspects need to be further investigated by analyses per year. Furthermore, a comprehensive interpretation requires much more nonstatistical knowledge about agriculture on Java, the way in which the data were collected, and the general administrative system at the time.

7.6 Other approaches to historical data: Historical data generally have the common data matrix format of cases × variables, sometimes measured at several time periods, as were the data in this chapter. An obvious complication is that the conditions under which the historical data were collected are often unknown, and they are rarely the result of controlled experiments. Moreover, the persons who collected the data can no longer be consulted to sort out unexpected anomalies. Careful further historical research will almost always be necessary for these kinds of data, the more so as over the years definitions of variables with the same names may have shifted in meaning and cases may have changed their characteristics, as the boundaries of some of the residencies did here. Historical data are not the only kind to have data with different measurement levels and distributions. Such properties are equally found in contemporary data and in many disciplines. Therefore, it is often necessary to look beyond straightforward principal component analysis if researchers want to jointly analyse sets of variables. Categorical principal component analysis can be a great help in this respect, even though it will require more specialised statistical expertise. It is an example of how present-day techniques can be used to gain insight into what happened many years ago.

176

7 Agricultural development on Java

7.7 Content summary The agricultural data for 1880 and 1920 provide a view of some changes over time in the Dutch colonial era. Population density in the Java residencies was, not surprisingly, related to the amount of Arable land. Much opportunity for agriculture goes together with more people per land area; they are both consumers and labour. Similarly, high percentages of wet-rice fields (sawahs) on arable land lend to many sawahs per capita, but not necessarily to large productions of wet-rice (padi) per capita. In contrast with 1920, in 1880 comparatively little sugarcane was grown on the sawahs. The growth in sugarcane production was considerable, except for population centres such as Batavia, the present capital Jakarta. It seems not too farfetched to assume that Batavia also had too many people to feed to have space for commercial sugar crops, which were mainly meant for export. Both in 1880 and 1920 the residencies within a region show similar scores for the agricultural variables, except for West Java. This does not indicate a lack of change in the historical development of Java agriculture, because an almost five-fold increase occurred with respect to sugarcane in the period, but it occurred in equal measure for most residencies in a region. A comprehensive interpretation of this phenomenon requires much more nonstatistical knowledge about agriculture on Java, the data collection, and the general administrative system at the time. Proper interpretation should take place via a more thorough multivariate investigation, both for the results per year and the years together. Historical data are often complex because it is not always easy to know their quality, representativeness, and the purpose for which they were collected. Therefore, it is often necessary to look beyond a straightforward analysis. A complete answer to the research questions discussed here requires more specialised statistical and subject expertise than could be presented in the chapter. However, the beauty is that up-to-date multivariate techniques may provide insight into what happened many years ago.

Chapter 8

Seriation: Graves in the Münsingen-Rain burial site

Abstract Main topics. The burial site of Münsingen-Rain is situated on a ridge and contains a large number of graves, many of which contain artefacts. The graves have been dated by their distance from the village by Hodson. Is it also possible to statistically validate this dating from the artefacts in the graves? Data. An occurrence table of graves by artefacts consisting of ones (artefact present in the grave) and zeroes (artefact absent from the grave). Research questions. Is it possible to validate the archeological chronology of the graves by applying statistical procedures on the table of occurrences? Statistical techniques. Correspondence analysis, validation results. Keywords Celtic graves · Artefact fashion · Münsingen-Rain cemetery · Switzerland · La Tène · Seriation · Correspondence analysis

8.1 Background One of the core questions with archeological excavations is the age of the objects or artefacts found in graves.1 Not much can be done to understand the material without knowing its age. At present, chemical analysis is a main contributor to establish the age of uncontaminated material, in particular radiocarbon C14 dating for organic materials and thermoluminescence for pottery. However, there are other ways to establish age, especially in those cases when carbon dating is not feasible; and it is always valuable to validate outcomes of one method via another method. For instance, extensive efforts were carried out to validate dating by tree rings via calibrations with radiocarbon dating; Clark (1974).

1 Valuable

comments on an earlier draft were made by Prof. Ruud Halbertsma, Curator Greece and Rome, National Museum of Antiquities, Leiden, The Netherlands. © Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_8

177

178

8 Seriation: Graves in the Münsingen-Rain burial site

In this chapter I will approach the chronology problem statistically, and compare the ordering of the graves with dating based on archeological arguments, in particular the positions on a ridge emanation from the village Münsingen-Rain. The key technique used in archaeological data analysis is seriation, also called artefact sequencing. There are several definitions of seriation, but here I will refer only to Liiv’s: ‘Seriation is an exploratory data analysis technique to reorder objects into a sequence along a one-dimensional continuum so that it best reveals regularity and patterning among the whole series’. (Liiv, 2010, p. 71). Note that Liiv sees seriation as useful in a wider context than only ordering in time. In a brief online introduction to seriation, Hirst (2017) makes the interesting observation that ‘Seriation is thought to be the first application of statistics in archaeology’. Today seriation is no longer the mainstay of the dating efforts of archeologists, partly because it can only be used for relative rather than absolute dating. As will be explained below, seriation based on artefacts also rests on the assumption of fashion, probably first envisaged by the Egyptologist Sir William Flinders-Petrie (1899), cited in Kendall (1971) . In 1968 a cooperation among mathematicians, statisticians, and archaeologists led to a dedicated joint conference on Mathematics in the Archaeological and Historical Sciences. According to Liiv (2010, p. 77), the Proceedings, edited by Hodson, Kendall, and T˘autu, (1971) to date serve as one of the most comprehensive collections on research done on archaeological seriation. There are many other applications of statistics to archeological problems, such as finding out whether archeological objects found in a burial site are randomly distributed over the entire burial site or whether they are concentrated in specific sections. When settlements are built on top of another, stratification is an obvious way to obtain a relative dating of settlements. This is, for instance, the case with the city of Troy, located at Hisarlik in Anatolia, Turkey, and made famous to this day by Homer. Using stratification might seem the ideal method for dating in the field, but in a level burial site that has been in use for many centuries, this is not really an option. Statistical techniques such as spatial analysis can assist in the dating, as was, for instance, demonstrated by Graham (1980, 1981) for the Iron Age Hallstatt burial site in Austria. An overview of the state of affairs in the 1970s can be found in the proceedings by Hodson et al. (1971) mentioned earlier; a more recent overview was published by Liiv (2010).

8.2 Research questions: A time line for graves The Late Iron Age burial site of Münsingen-Rain is situated in the Aare valley between Bern and Thun [i.e., was roughly used between 450 BCE and 100 BCE; some 50 years before Caesar invaded Gallia] . [...]Due to the size of the burial site, the well-documented excavation and the abundance of grave goods, Münsingen is an indispensable reference for chronological aspects of the Late Iron Age. (Moghaddam, Müller, Hafner, & Lösch, 2016, p. 149)

8.3 Data: Grave contents

179

The burial site is located in the north-south direction along terraced land running out from the village Münsingen. It was excavated around 1906 and was first published about by Wiedmer-Stern (1908) (see Figure 8.1). The dating and sequencing of the graves is obviously of prime importance so that the archaeological finds can be understood in the temporal and local context, but also in the link to other Celtic burial sites in Europe. The major work on archaeological seriation of the graves at this site was carried out by Hodson (1968). It was Kendall (1970, 1971) who started and for a large part developed the statistical approach to seriation for the site, using Hodson’s tables indicating which artefacts had been found in which graves.

8.3 Data: Grave contents The data for this example consist of an incidence matrix of 59 graves (rows) and 70 artefacts (columns) recovered from the graves. The entries in the matrix indicate whether an object was present ( = 1) or not ( = blank), hence the name ‘incidence matrix’. An alternative would have been an abundance matrix, in which the entries indicate how many of a particular kind of artefact were present in a grave. In the left-hand panel of Figure 8.2 there is no particular order either for rows or for columns. It was created by Kendall (1971) from Hodson’s original incidence table . A statistically reordered table is shown at the right.

Fig. 8.1 A Sunday excursion to the excavations at Münsingen-Rain (July 1906) Excursion led by Jakob Wiedmer-Stern (far left). Source: Archives of the Bernisches Historisches Museum, Bern, Switzerland; reproduced with permission of the Museum.

180

8 Seriation: Graves in the Münsingen-Rain burial site

Fig. 8.2 Münsingen-Rain data: 59 graves (rows) and 70 artefacts (columns). ∗Left-hand panel: Unordered incidence matrix. 1=artefact is present; blank = artefact is absent ∗Right-hand panel: Statistically reordered incidence matrix ∗The box in the right-hand panel indicates Hodson’s Grave 48.

8.4 Analysis methods Various techniques have been applied to create ordered incidence matrices, see Liiv (2010). For a successful seriation of the incidence table it is important that there is an underlying order for both the row variable (here graves) and the column variable (here artefacts); we just do not know beforehand what these orders are. The graves and artefacts do not have a specific dependence structure, i.e. it is not a question of which one of the variables predicts the other. In contrast, we have an internal structure design in which the relationship between rows and columns is to be examined (see Section 3.10, p. 84 for an explanation of these designs). The time order of the variables leads to similarities between graves and between artefacts, because graves of the same age are expected to contain approximately the same types of objects.

8.5 Münsingen-Rain graves: Statistical analysis

181

To recover these orders correspondence analysis (see Section 3.10.1, p. 84) is an appropriate technique. A good introduction to correspondence analysis for archaeologists can be found in Chapters 5 and 6 of Baxter (1994) and in De Leeuw (2013)2 ; a good practical general introductory book is Greenacre (2017).

8.5 Münsingen-Rain graves: Statistical analysis For a successful seriation there has to be a natural order underlying both graves and artefacts. That there is a natural order for the graves is not in question; in fact it is the recovery of this order that is the purpose of the exercise. For the artefacts the situation is different; we need a conjecture as to why they may be ordered.

8.5.1 Fashion as an ordering principle That there is an ordering underlying the artefacts rests on the assumption that artefacts are subject to fashion. This hypothesis was probably first posited by Sir William Flinders-Petrie (1899), as mentioned in Kendall (1971). The rationale behind this is that initially a particular form, style or decoration of an artefact (ring, fibula, necklace, sword) does not exist; then it is created, becomes popular, and gradually drops out of favour again. This idea was reiterated by Müller et al. (2008) ‘This led [...] to the recognition that some types of ornament (for example, torcs, anklets or bracelets,[..]) were replaced sequentially over time; similarly, certain combinations of objects of personal adornment proved to be chronologically sensitive’. (p. 463). This process is illustrated in Figure 8.3. The fashion effect is that at one time certain objects in a specific form are not present in the graves, later graves do have the objects with this form, and still later graves will no longer have this form. but its successors in style instead. Given a long enough time span, graves at the beginning and end of the time line will not have any forms of artefacts in common. Naturally this is an idealised sketch of the situation, as some graves will contain precious objects from the past and thus introduce uncertainties in the time line of the graves. However, a sturdy analysis technique such as correspondence analysis should be able to deal with this, and even signal anomalies.

2 There

are at least three almost identical papers under the same name. The 2013 is the latest one. https://escholarship.org/uc/item/3m18x7qp; accessed: 15 June 2020.

182

8 Seriation: Graves in the Münsingen-Rain burial site

Fig. 8.3 Münsingen-Rain chronology How fashion and time create an ordered incidence matrix.

8.5.2 Seriation How effective correspondence analysis is in the seriation of the Münsingen-Rain data is apparent from the right-hand panel of Figure 8.2. The graves and artefacts are now ordered in such a way that there is a steady diagonal from one corner to another. As remarked earlier, this type of seriation only provides a relative dating, not an absolute one, because we do not know whether the time line starts at the bottom or at the top. Also the time intervals between sequential graves are not known. To settle this issue archaeological information is indispensable. In addition, the right-hand panel also shows that at some places the ordering seems arbitrary, and that there are artefacts in graves where one would not expect them given the recovered ordering. Incidentally, it should be remarked that correspondence analysis does not necessarily provide a unique solution for seriation itself, so that in general we will have to validate the results by other means, statistical and/or archaeological, for instance, by looking for parallels from dated fields elsewhere. One further caveat needs to be made. It is common to evaluate the quality of the results from a correspondence analysis on the basis of the proportional amount of fitted variability, referred to as inertia. More details about inertia are presented in the correspondence-analysis section of Section 3.10.1 (p. 84). Each of the coordinate axes of the analysis contributes to this inertia. In most applications the proportion of explained inertia of the first two coordinate axes is substantial, but in a large (59 × 70) and sparse table (6.6% nonzero cells) as we have here this is typically not the case. In this study the proportion of explained inertia of the first two axes is only 14%. Such values seem to be characteristic of sparse tables and do not necessarily reflect the quality of the representation of the patterns in data such as those used here. ‘These percentages do not have the same significance when we are dealing with a contingency table because the binary coding introduces noise that reduces the

8.5 Münsingen-Rain graves: Statistical analysis

183

Fig. 8.4 Münsingen-Rain: Comparison between Hodson’s archaeological order and the statistical order from correspondence analysis. The numbering of the graves is Hodson’s. If the orders were identical, all graves would lie on the straight line. Grave Hodson no. 48 was ordered as no. 22 by the correspondence analysis.

proportion of variance explained associated with each eigenvalue’ [an eigenvalue represents inertia explained by each coordinate axis] (Lebart et al., 1998, p. 77).

8.5.3 Validation of seriation The most obvious validation of the statistical results from the correspondenceanalysis seriation is to compare them with the archaeological information. As mentioned above, the Münsingen-Rain burial site is located on a ridge starting from Münsingen; it is reasonable to suppose (and validated as well) that the oldest graves are closest to the original La Tène settlement near Münsingen, and that the graves become younger further down the spur. On the basis of this and additional evidence, Hodson established a time line for the graves and numbered them according to their presumed age. The details can be found in Hodson (1968, plate 123) and Kendall (1971).

184

8 Seriation: Graves in the Münsingen-Rain burial site

Fig. 8.5 Ordering Münsingen-Rain graves Graves arranged according to their coordinates on the two correspondence-analysis axes.

Figure 8.4 shows the extremely high correlation (0.95) between Hodson’s ordering and the statistical seriation from a correspondence analysis. The numbering of the graves is Hodson’s and shows the time line of his order. This points to valid results from both Hodson’s archeological order and the correspondence analysis. About the same order was also found by Wiedmer-Stern (1908); see Meulman (1982, p. 157) . This figure also shows Grave 48 as a curious outlier. In the correspondenceanalysis seriation this Grave was number 22. Obviously, its content must have been rather ambiguous, as the two positions in the ordering do not agree at all. Inspecting the box in Figure 8.2 we see that Grave 48 contains only two artefacts which are relatively far apart. Further details about the artefacts can be found in Hodson et al. (1971, pp. 15, 16, 47). The ‘suspect’ Grave 48 has been the subject of a discussion between Hodson and Kendall, in which it became crystal clear that statistics alone was not enough, but solid archaeological input was necessary to resolve the anomaly. Hodson also raises the point of heirlooms being present in younger graves. Especially in the early periods of statistical seriation, one of the ways to demonstrate the outcomes was by means of plotting the first two coordinate axes of a correspondence analysis against each other. Kendall (1971, p. 228) made a similar plot based on his multidimensional scaling analysis. As can be seen in the figure in Kendall’s paper and our Figure 8.5, both plots have a horseshoe shape. Technical arguments indicate that horseshoes occur when the objects (graves) have a strong ordering. This strong order can also be seen in Figure 8.2. Diaconis et al. (2008, p. 777) show that,‘in general, a latent ordering of the data gives rise to these patterns [i.e. horseshoes] when one only has local information. That is, when only the

8.6 Other approaches to seriation

185

interpoint distances for nearby points are known accurately’. In the seriation this is clearly the case, as overlapping fashions in artefacts are phenomena found in graves at adjacent time points. What is obviously lacking from the treatment presented here is a proper attention to the artefacts themselves—so far they have only been a vehicle for seriation. Hodson (1970) and Hodson et al. (1971) contain both archaeological and statistical discussions on grouping or clustering the artefacts; see also Wilcock (1995).

8.5.4 Other techniques Our analysis is in a sense a fairly conventional one for this type of seriation data. Many other clustering techniques, often in conjunction with multidimensional scaling or correspondence analysis, have already been applied to the artefact data. These data were also used to evaluate new numeric techniques. For instance, Meulman (1982, p. 66, 156–162) carried out extensive studies with the MünsingenRain data within the context of variants of correspondence analysis, and used these data to evaluate various strategies for handling missing data, referring to both the dimensional representation and the pattern in the rearranged data matrix. At present, in the four years preceding the writing of this chapter, Hodson (1968) was cited 20 times, 8 of which were in purely mathematical treatises using the data as an example. Wilcock (1999) provides a more general overview of statistical techniques in archaeology for the preceding 25 years, including references to the Münsingen-Rain data. Finally, Baxter (1994) produced a general and practical textbook on multivariate statistics for archeologists. Even though this chapter is about seriation, the data from the Münsingen-Rain burial site also raise archeological questions; see Hodson (1968), Harding (2007) on Celtic art, and Müller et al. (2008) on social rankings of families.

8.6 Other approaches to seriation The proceedings of the Anglo-Romanian conference on Mathematics in the archaeological and historical sciences (Hodson et al., 1971, pp. 173–287) contain several examples of seriation, dealing with such diverse topics as the works of Plato, anthropomorphic statuary, epigraphy, and the travelling salesman problem. On p. 263 Hole and Shaw (1967) are cited as having given an extensive account of sequencing methods. A more recent version of the history of seriation was provided by Liiv (2010). She mentions seriation work in several disciplines and lists applications in archaeology, anthropology, cartography, graphics, sociology, sociometry, psychology, psychometrics, ecology, biology, bioinformatics, group technology, cellular manufacturing, and operation research.

186

8 Seriation: Graves in the Münsingen-Rain burial site

8.7 Content summary The aim of this case study was to put Celtic graves from the La Téne period in a proper time order (seriation) by a statistical procedure. It turned out that the results from statistical seriation procedures could indeed be validated with the archeological information about the layout of the burial site, and vice versa. The basis for the statistical seriation was the assumption that people’s attraction to specific objects (for instance, fibulae, brooches) was determined by fashion: at first the objects did not exist, then they became available, later even popular, and finally the objects disappeared again. Artefacts are thus a popularity indicator for the time that the owners were buried. Of course, regularly heirlooms from older fashions were buried as well, thus creating uncertainty about the dating of the graves, as exemplified here in Grave 48: the two artefacts it contained made it difficult to place it unequivocally on the time line of the graves.

Chapter 9

Complex response data: Evaluating Marian art

Abstract Main topics. Four categories of paintings of the Virgin Mary in particular two different types of content and executed in two different styles were evaluated by student judges. Data. Each of 89 students judged 24 paintings on 10 bipolar scales. The words in the scales were descriptive, evaluative, or emotional. Research questions. The central question was whether the judgements were influenced by Style (preRenaissance and post-Renaissance) and type of Content (event and devotional). An additional question was whether students’ Age and Gender affected their judgements of the paintings. Another question was the structure of the scales, in particular whether the scales functioned in the same way for all four painting categories. Statistical techniques. Multivariate repeated-measures analysis of variance, principal component analyses, scale analysis, Cronbach’s alpha. Keywords Marian paintings · Pre- and post-Renaissance paintings · Content versus style · Aesthetic judgements · Devotional style · Analysis of variance · Principal component analysis · Multivariate repeated measures anova · Scale construction · Cronbach’s alpha

9.1 Background At the University of Dayton a group of researchers under the direction of Donald Polzella1 investigated the reactions of individuals towards different painting categories, within the context of experimental aesthetics (Polzella et al., 1998; Polzella, 2000). They were especially interested in scientifically examining viewers’ reactions to works of art. The research and data discussed in this chapter concern 1 This chapter was co-authored by Emeritus Professor Donald Polzella, Department of Psychology,

University of Dayton; [email protected]. Parts of the technical descriptions were adapted from Polzella et al. (1998). © Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_9

187

188

9 Complex response data: Evaluating Marian art

paintings of the Virgin Mary, one of the most frequently portrayed subjects in Western art.2 The aim of the study was to investigate the effects of content and style on a person’s appreciation of pictures of the Virgin Mary: (within-subjects differences), and whether different person characteristics, such as age and gender, had an influence on these judgements (between-subjects differences). Content and Style are referred to as within factors, and Gender and Age as between factors. A common way to carry out such investigations is to ask individuals about their views via variables in the form of rating scales; for more details on rating scales see Section 1.5.3, p. 14. To do this, we may, for instance, ask people whether they consider the paintings beautiful, via one of three types of scales: (1) variables with two answering possibilities: yes or no; (2) variables measuring the extent to which they considered the paintings beautiful, by asking respondents to give a score between ‘not at all beautiful’ and ‘extremely beautiful’; or (3) variables indicating positions between ‘very ugly’ and ‘very beautiful’ with a neutral evaluation in the middle. The first type of question is a binary one; the second type is unipolar, expressing how much of a property the painting possesses, and the third type is bipolar, with opposite adjectives at the scale ends. An example of the last type of questions are the semantic differential scales used in this study; see Section 1.5.3 (p. 14). The aim of this case study is not to make a specific contribution to the study of Marian art as such, but to show at a largely conceptual level what kind of issues may arise in an analysis such as used in the Polzella study, and how to deal with them. Thus, we are here interested in a design with both within-subjects and betweensubjects factors. The design is also multivariate, because there is more than one scale involved. In the terminology of Chapter 3: the design in question is a dependence design (see Section 3.6, p. 56), because we want to predict the scores on the rating scales (response variable) from the factors (predictors).

9.2 Research questions: Appreciation of Marian art As indicated above, for this study the researchers were interested in both withinsubjects and between-subjects comparisons, furthermore, the interactions between within-subjects and between-subjects factors were of interest. Such comparisons are generally examined via differences in the means of the variables. In addition, other questions are relevant for studies such as this; in particular the question which variables receive similar scores from the subjects, i.e. which variables are correlated. Such questions concentrate on the structure of the set of variables rather than on paintings or respondents. Thus, the structure of the questionnaire itself is of interest; specifically we want to know whether the relationships between the rating scales are the same for paintings with different contents and styles. These 2 For

simplicity, we will consistently refer in this chapter to paintings, even though some of the artistic works considered in Polzella’s study were sculptures.

9.3 Data: Appreciation of Marian art across styles and contents

189

issues can be examined via the correlations or covariances between the variables. However, different types and measurement levels of variables require different types of statistical techniques if we are to find the desired answers. It will turn out that a seemingly innocuous straightforward design—ask each respondent to rate each art work on a number of variables—requires careful thought about the resulting data and their statistical analysis; see Section 1.5.2 (p. 11) for a discussion about measurement levels.

9.3 Data: Appreciation of Marian art across styles and contents The paintings in this study were divided into four categories: two artistic Styles (preRenaissance—Old versus Renaissance and post-Renaissance—New), with two kinds of Content for each style, i.e. depictions of Events (Annunciation or Presentation in the temple) versus Devotional images (Madonna with Child). Psychology students were asked to rate digitised colour reproductions of the paintings; from the 105 students, 89 produced usable data. For the rating scales used see Table 9.1. Table 9.1 Bipolar seven-point rating scales Scale Left - 1 s1. Simple s2. Displeasing s3. Ugly s4. Weak s5. Clear s6. Makes me think s7. Masculine s8. Intimate s9. Like us s10. Comforting s5R. Indefinite s7R. Feminine s8R. Remote

Right - 7 – Complex – Pleasing – Beautiful – Powerful – Indefinite – Makes me feel – Feminine – Remote – Superior to us – Stimulating – Clear – Masculine – Intimate

For interpretational reasons the scales 5, 7, and 8 are sometimes used in their reversed forms: s5R, s7R, and s8R; see below.

The paintings (for examples, see Figure 9.1) were judged on ten seven-point bipolar rating scales (see Table 9.1); for more details on rating scales in general see Section 1.5.3, p. 14. The scales are mainly referred to by their right-hand adjective, because then high scores on the scale mean that an attribute is strongly characterised

190

9 Complex response data: Evaluating Marian art

Fig. 9.1 Examples of Marian Art. Characterisation of the paintings. Top row: Style—preRenaissance (Old); Bottom row: Style—post-Renaissance (New). Left column: Content—Event: Annunciation (E); Right-hand column: Content—Devotional: Madonna with Child (D). Below the four picture categories are labelled as OE (Old-Events), NE (New-Events), OD (Old-Devotional), ND (New-Devotional)

9.3 Data: Appreciation of Marian art across styles and contents

191

Fig. 9.2 Marian Art: Three-way data array. (Style – Content combinations) × Rating scales × Judges

by that adjective (say, very beautiful). Some scales are also discussed in reversed form, i.e. the adjectives are swapped from the way they were in the questionnaire. This is indicated by an R: s5R, S7R, S8R. To increase the reliability of the scores, each of the four categories consisted of six paintings. Per judge the six scores were averaged within a category. Therefore, each student supplied a matrix of ratings consisting of 24 rows (paintings) by 10 columns (rating scales). The students were characterised by two between-subjects factors, Gender and Age. Looking back at Section 1.5.1 (p. 11) we see that we are dealing with a three-way dataset, shown here in Figure 9.2. Thus, the data format is a three-way dataset consisting of 4 Painting categories (Style – Content combinations) × 10 Rating scales × 89 Judges.3

Data design The difficulty with this kind of data is how to handle them to obtain answers to the questions formulated. Three-way data and their analysis are generally not in the forefront of many researchers’ minds. Standard program suites are not really geared towards this format, and the familiar way to view such data (for example, in program suites such as spss sas and Stata is as a set of matrices next to each other (see the right-hand arrangement in Figure 9.3), often referred to as a wide arrangement. This arrangement is much more familiar and may be recognised as a multivariate repeatedmeasures design (see also Section 3.6.6, p. 63) . The multivariate aspect is that there are ten response variables, i.e. the ten rating scales, and the term ’repeated’ indicates that the same scales (or measurements) are repeatedly judged by each subject, once for each Style by Content combination (see also the discussion in Chapter 3 on the general linear model). The study is an example of a two-by-two factorial multivariate repeated-measures design 3 We

use ‘students’ and ‘judges’ indiscriminately.

192

9 Complex response data: Evaluating Marian art

Fig. 9.3 Marian Art: Reformatting of the data. Reformatting the data array into a two-way matrix for use in common software packages

Thus, we are here dealing with a dependence design (see Section 3.6, p. 56), because we want to understand (predict is maybe too strong) the students’ ratings (response variables) for each of the four painting categories. Moreover, we want to know whether differences in the judgements can be explained not only by the four painting categories but also by the predicting factors Age and Gender 4 .

9.4 Analysis method The first thing to note is that the measurement of a rating scale can be taken as a discrete scale with ordered response options: there are seven separate values for each rating scale. Such scales are mostly taken as interval scales, i.e. the differences between the consecutive scale points are assumed to be equal, so that each scale is also assumed to measure an underlying numeric concept (Section 1.5.3, p. 14). On the basis of this, the measurements are analysed as if they are numeric variables, so that the calculation of means, standard deviations, correlations, etc. is justified (see for further comments Section 1.5.2). Because Polzella’s primary concern was to discover which rating-scale means were reliably different from other means for the various painting categories, a multivariate analysis of variance was performed (see Chapter 4, p. 104). Another question was whether the rating scales had an internal structure (see Section 3.8.1, p. 66), or, put differently, whether it was possible to concentrate the variance in a few new variables (components) constructed from the observed ones, so that the relationships between the scales could be described efficiently. As explained in Section 3.7 (p. 65), such questions can be dealt with by using principal component analysis if the problem is tackled in an exploratory fashion, or factor analysis if an a priori structure is hypothesised (see Section 3.8.3, p. 73).5 4 Gender

occurs twice in the dataset, once as a between-subjects factor (male and female) and once as a dependent variable, rating scale 7 (feminine-masculine). 5 Note that the word factor has different meanings in analysis of variance and in factor analysis.

9.5 Marian art: Statistical analys

193

In this case study, we will analyse the complete set of 4 × 10 ratings scales via standard principal component analysis, to examine the effect of the design on the relationships between the dependent variables; in other words, to see whether the same rating scales stick together irrespective of the painting category, or whether the structure is different for each Style – Content combination. We will also make a brief excursion into analysing the data in their three-way format, but for a more detailed analysis of that data format the reader could consult Chapter 10 and Chapter 17.

9.5 Marian art: Statistical analysis Before tackling the research questions it is wise to examine the data by simply looking at the distributions of the rating scales and their means. This can be done by examining boxplots (see Section 2.2.4, p. 35), which will show whether we are dealing with more or less normal or with very skewed distributions, and whether any judges are serious outliers. In addition we will check whether there are missing data, i.e. whether there are judges who have not scored all the rating scales.

9.5.1 Basic data inspection Fortunately, a detailed missing-data analysis was unnecessary as there were none. Given that Polzella considered that of the 105 students only 89 produced usable data, those with missing data were presumably not included in the analysis. With respect to outliers, Figure 9.4 shows that there is at least one judge (no. 84) who stands apart with extreme scores on nine of the forty rating scales. Removing this judge reduced the kurtosis for six of the nine scales on which she was an outlier. Therefore, she was excluded from further analyses. Other judges occasionally had outlying scores, and for a nonmethodological paper the question should be asked whether these had a serious influence on the results (see for more information on outliers Section 2.4, p. 40). It should be remarked that it is not a priori true that the judge with nine outliers affected the outcomes, as no comparable analyses were carried out with and without this individual. In the case of more outliers such an investigation is highly recommended, as is the substantive inspection of any outlying observation. Figure 9.4 also demonstrates that all distributions are reasonably well behaved; the lengths of their interquartile-range boxes are more or less equal on both sides of the median, as are their whiskers, which by definition extend maximally to 1.5× the interquartile range beyond the box ends. Thus, whatever there is in terms of skewness and kurtosis looks to be well within the acceptable range for normality. There seems, therefore, little necessity for measures to improve the unimodality and the shape of the distribution to make them amenable to further analyses. If we want to be certain it is probably wise to look at these distributions directly.

194

9 Complex response data: Evaluating Marian art

Fig. 9.4 Marian Art: Boxplots for all 4 × 10 rating scales. The order of the scales is the same for all four panels. The circled judge, no. 94, had 9 outlying scores. The numbers in this plot are a bit difficult to read, but tiny lettering was the only way to present all scales in a single plot. The relative size of the boxes and whiskers are particularly important to inspect globally, as well as the differences in location of the centres (medians) of the boxplots.

9.5 Marian art: Statistical analys

195

The figure also shows that there are real interactions between the two repeatedmeasures factors, as none of the four panels has the same location pattern for the ten rating scales. For instance, one particular variable can have a high value for its median for OE but a low value for ND; s9OE and s9ND are cases in point. However, unravelling whatever systematic patterns are present in the means goes beyond visual inspection. Table 9.2 post-Renaissance depictions of Events (NE). anova results Test statistics Scale adjectives s4. s3. s8R. s5R. s9. s2. s1. s10. s6. s7.

Left Weak Ugly Remote Indefinite Like us Displeasing Simple Comforting Makes me think Masculine

– – – – – – – – – –

Right Powerful Beautiful Intimate Clear Superior to us Pleasing Complex Stimulating Makes me feel Feminine

F 0.2 1.1 1.4 0.2 2.2 1.8 3.6 0.3 2.1 2.1

p .81 .37 .24 .84 .12 .17 .03 .74 .13 .15

R2 Fit >.01 .02 .03 >.01 .05 .04 .08 .01 .05 .04

Age groups 17-18 50 5.7 5.2 4.8 4.8 4.8 4.6 4.4 4.4 4.5 4.2

Total

19 20+ Mean 24 5.6 5.0 5.2 4.9 4.8 4.9 4.9 4.5 4.1 4.2

14 5.6 5.4 4.9 4.8 4.2 4.9 5.1 4.6 4.5 4.6

88 5.7 5.2 4.9 4.8 4.7 4.6 4.6 4.5 4.4 4.2

∗ Scales 5 and 8 are shown in their reversed forms; s5R and s8R. ∗ The table has been ordered by the total means of the scales. ∗ Size of age groups: 50, 24, 14; total = 88. ∗ The neutral point of the seven-point scales is 4. ∗ Multivariate anova test (Wilks Lambda):F = 1.6; p value = .06; df = 10. ∗ Standard deviations of the scales scores vary from 0.7 to 1.3. The standard errors of the total means are slightly below 0.1

9.5.2 A miniature example We will first analyse part of the data so as to explain a few general concepts. To this end we will first perform a few analyses on the NE data, and investigate the betweensubjects factor Age. Age has been reduced to three ordinal categories from the original eight values, because the three highest ages only occur once, which is hardly surprising as older students are scarce; this is an example of binning; see Section 4.4 (p. 105). Thus, Age has now three categories: 17-18 (1), 19 (2), and 20+ (3). Because we now have a factor or categorical predictor variable of more than 2 categories, we need to perform an analysis of variance (anova) to compare the means of the age groups. The design is still multivariate, because we have ten rating scales. What does Table 9.2 tell us about the students’ opinions about the post-Renaissance

196

9 Complex response data: Evaluating Marian art

paintings depicting events? The most obvious conclusion must be that differences in Age do not matter much; p ≥ .12 for all but one test. The overall multivariate test is not significant ( p = .06), and the only significant univariate ones is s1: Simple-Complex ( p = .03), but this is to be expected with multiple testing: purely by chance one rating scale may have a significant mean difference between the age groups. On the basis of the total average the NE paintings are (in decreasing order; see last column in Table 9.2) Powerful, Beautiful, Intimate, Clear, Superior to us, Pleasing, Complex, Stimulating Makes me feel, and Feminine. Of course there is spread among the students, but the spread is not related to their age. Note that several scales have means close to 4 (the neutral point), and on average the students are not very outspoken on the scales Stimulating, Makes me feel, and Feminine. Table 9.3 post-Renaissance depictions of Events (NE). Correlations

s4 s8R s3 s2 s5R s6

s9 s1

s7 s10

s4. s8R. s3. s2. s5R. s6.

Weak Remote Ugly Displeasing Indefinite Makes me think

– – – – – –

Powerful 1.0 .4 .5 .3 .3 .3 Intimate .4 1.0 .4 .3 .3 .2 Beautiful .5 .4 1.0 .7 .3 .3 Pleasing .3 .3 .7 1.0 .3 .3 Clear .3 .3 .3 .3 1.0 .3 Makes me feel .3 .2 .3 .3 .3 1.0

.3 .1 -.1 .0 −.1 .3

.2 .1 .1 .0 .0 .1

−.1 .2 −.1 −.1 .2 .1 .3 −.1 .1 .0 .0 −.1

s9. s1. s7. s10.

Like us Simple Masculine Comforting

– – – –

Superior to us .3 .1 −.1 .0 −.1 .3 Complex .2 .1 .1 .0 .0 .1 Feminine −.1 −.1 .2 .3 .1 .0 Stimulating .2 −.1 .1 −.1 .0 −.1

1.0 .0 −.2 −.1

.0 1.0 .1 .2

−.2 −.1 .1 .2 1.0 .0 .0 1.0

To make its largest correlation positive the scale s5 was reversed and designated s5R. Thus, the adjectives were reversed as well: 7 (indefinite) → 1, 1 (clear) → 7, etc.

From these means we cannot conclude that the students who think that paintings are powerful also think that they are beautiful. It is the correlations which provide this information, and these are shown in Table 9.3. To make the situation clearer we have now rearranged the scales in a such a way that those scales which have reasonably sized correlations among themselves are grouped together, and s5 (Clear-Indefinite) has been reversed so that its larger correlations are positive. The scale should be interpreted as s5R (Indefinite-Clear). The correlation matrix shows that there is a set of scales that have a reasonably high correlation among themselves, in particular Powerful, Intimate, Beautiful, Pleasing, Clear, and Makes me feel which shows that the students scored these scales in a similar manner. Those who scored high on one of the sets also tend to do the others. Several scales have a few negative correlations with other scales but these are too small to interpret, and the last four scales are not really correlated with any other scale. Note

9.5 Marian art: Statistical analys

197

that the above statement could be easily made because of the layout of the correlation matrix; for another example with such a layout see Table 3.6, p. 97.

9.5.3 Evaluating differences in means Because it is difficult to simply look at means and comparing their differences for both types of factors, it is useful to package the means according to the factors and compare these means packages. It should be remembered that people are extremely good at discerning patterns that are not there—just think of Rorschach ink blots and cloud patterns in the sky. Thus, using objective procedures to perform the numeric evaluations is an absolute must. An appropriate statistical procedure for the Marian art data is a sophisticated form of analysis of variance which goes under the name multivariate repeated-measures analysis of variance with two within-subjects and two betweensubjects factors. The within-subjects factors are Content and Style and the betweensubjects factors Gender and Age. This is an analysis with many pitfalls, which are explained in standard multivariate statistic books such as Field (2017), Tabachnick and Fidell (2013), and Hair et al. (2018); see also Chapter 4, p. 104. The books will provide statistical and computational guidance. Here we will concentrate on the outcomes and the kind of information that can be obtained from them. Note by the way that repeated means that the same variable, (here: rating scale), is measured more than once. It does not automatically imply that there is a time factor involved.

Multivariate tests The first overall question is whether there are differences at all, the next one is whether there are main effects and interaction effects for the within-subjects factors. Table 9.4 provides the necessary information. It is tempting to try to interpret the statistics in Table 9.4 and be done with it. Unfortunately, in our case there are several problems. The first is that we do not have a great many subjects. Stevens, for instance, (quoted in Field, 2017, p. 604) recommends using fewer than ten dependent variables unless sample sizes are large. Our 88 subjects do not really form a large sample. Moreover, we do not have a simple repeatedmeasures design, because there are two within-subjects factors, Content and Style as well. Luckily they only have two levels each. Then we have two between-subjects factors, Gender and Age. The latter make up six (2 × 3) groups, after Age has been reduced to the three categories from the original eight, as the three highest ages only occur once; thus, on average there are about 14 individuals per group. From this situation we have to conclude that, whatever results we get, they cannot be very robust and we will have to be careful not to overinterpret them. However, as these data are all we (and Polzella and his colleagues) have, we will just have to proceed with caution.

198

9 Complex response data: Evaluating Marian art

Table 9.4 Marian Art: Multivariate repeated-measures analysis of variance tests Effect w and/or b Value F Hypothesis Error df df p

Partial η2

Main effects Style Content

w w

.8 .8

31 27

10 10

73 73

.00 .00

.81 .79

Two-way interactions Style × Content Content × Gender Style × Gender Style × Age Content × Age

w×w w×b w×b w×b w×b

.5 .2 .2 .3 .2

6 2 2 2 1

10 10 10 20 20

73 73 73 148 148

.00 .06 .08 .09 .67

.46 .20 .20 .17 .10

Three-way interactions Content × Gender × Age Style × Gender × Age Style × Content × Gender Style × Content × Age

w×b×b w×b×b w×w×b w×w×b

.4 .3 .1 .2

2 1 1 1

20 20 10 20

148 148 73 148

.03 .26 .42 .41

.19 .14 .12 .12

Four-way interaction Style × Content × Gender × w × w × b × b .2 Age

1

20

148

.75

.09

w = within-subjects factor - a repeated factor; b = between-subjects factor

Table 9.4 provides the multivariate tests for each factor and their interactions for the ten response variables (rating scales) together. All clearly significant multivariate tests refer to the within-subjects factors and also have reasonable to good effect sizes (Partial η2 ).6 Significance tests tell us only that something is going on, but do not provide the information we really need for interpretation. From an analytical point of view, but not necessarily a substantive one, it turns out that all but one of the main and interaction effects involving the between-subjects factors (Gender and Age) are nonsignificant, with reasonably large p values as well. This leads to some doubt on whether the differences of the between-subjects factors are replicable in a new study, and therefore interpretations of these differences remain uncertain. The smallest p value involving a between-subjects factor is .03 for the three-way interaction Content × Gender × Age; we will look later at the univariate tests for the rating scales to evaluate this interaction. The one but smallest p value involving a between-subjects factor (Content × Gender) is .06, but there are no between-subjects factors that have a univariate significant test, or have a large effect size. Note that there is no large interaction between Style and either of the between-subjects factors. The overall conclusion is that age and gender differences between individuals do not matter too much when people judge Marian paintings. general I prefer Total η2 for effect sizes, but they are not supplied by spss (with which we have carried out the analysis); however, it should be said that they can be calculated from the output.

6 In

9.5 Marian art: Statistical analys

199

Univariate information related to significant multivariate tests The next step is to see which of the response variables are different for the 2 × 2 within-subjects factors, Content and Style. Table 9.5, which only shows the significant and almost significant effects, shows that all univariate main effects are clearly significant, except for s1 (simple-complex) for the Style factor, and s3 (ugly-beautiful) for the Content factor. Three of the variables (s1, simple-complex; s2, displeasingpleasing; s3, ugly-beautiful) do not reach significance for the Style by Content interaction. As measured by partial η2 , the other scales have really small effect sizes (.05 - .13) irrespective of their significance. Only two interactions between the withinsubjects and the between-subjects factors reached significance, but with small effect sizes (partial η2 are only .11 and .13). We still do not have a good view of what is going on with the within-subjects factors. The obvious way to obtain this is to look at the means themselves; analysis of variance is after all a technique to examine means. The most effective way to do this is to examine the Content × Style interactions for each response variable (= Rating scale). Commonly this is done via the estimated means rather than the observed ones, in a so-called interaction plot, also called line graph. In such a plot we can evaluate both the main effect and the interaction. In Figure 9.5 we have presented two examples of such plots as an illustration, in particular those of s1 (simple-complex) and s5 (clear-indefinite). Inspection of Figure 9.5 shows, in accordance with the univariate tests, the following for the rating scales s1 and s5: s1 (Simple-Complex) ∗ There is no Style effect, as evidenced by overlapping confidence intervals for both event and devotional paintings. This means that although for both contents the preRenaissance paintings are judged as slightly more complex than the postRenaissance paintings, this difference is not sizeable enough to be interpretable. ∗ There is a strong Content effect (large difference in complexity of the two contents for each style). In particular, the event paintings are seen as more complex than the devotional ones. ∗ There is no interaction between Style and Content factors, as shown by the parallel lines. Thus, the differences in complexity between pre- and post-Renaissance paintings are the same for paintings depicting devotional and event content. s5 (Clear-Indefinite) ∗ The newer post-Renaissance paintings have significantly higher means than the pre-Renaissance paintings, irrespective of content. Their confidence intervals do not overlap. ∗ The means of event paintings tend more towards indefinite than to devotional paintings, irrespective of style. ∗ There is a clearly visible interaction—the lines do not run parallel. This implies that the difference between the two styles is larger for events paintings than for devotional paintings.

200

9 Complex response data: Evaluating Marian art

Table 9.5 Marian Art: Univariate analysis of variance tests for significant multivariate tests. Only significant effects are shown Partial No.

Scale

F

p

η2

197 303 56 101 62 87 180 75 37

.00 .00 .00 .00 .00 .00 .00 .00 .00

.71 .79 .41 .55 .43 .52 .69 .48 .31

177 59 36 26 47 59 23 71 144

.00 .00 .00 .00 .00 .00 .00 .00 .00

.68 .42 .31 .24 .36 .42 .22 .46 .64

15 12 13 7 5 13 4

.00 .00 .00 .01 .03 .00 .05

.16 .13 .13 .08 .06 .13 .05

Style (Old - New) s2 s3 s4 s5 s6 s7 s8 s9 s10

Pleasing Beautiful Powerful Indefinite Makes me feel Feminine Remote Superior to us Stimulating

Content (Event - Devotional) s1 s2 s4 s5 s6 s7 s8 s9 s10

Complex Pleasing Powerful Indefinite Makes me feel Feminine Remote Superior to us Stimulating

Style by Content s4 s5 s6 s7 s8 s9 s10

Powerful Indefinite Makes me feel Feminine Remote Superior to us Stimulating

Content by Gender by Age s5

Indefinite

6

.00

.13

s6

Makes me feel

5

.01

.11

Interpretation and validity of the repeated-measure analysis The repeated-measures analysis has shown its worth in getting to grips with the differences in the means, but it has been a considerable statistical interpretation effort, and we have not even gone into the heart of discovering the meaning of these differences for a substantive interpretation. One question that has not yet been tackled is whether it is possible to simplify the interpretation by first examining the structure of the ten rating scales, especially in connection with the existing interactions. Moreover, we have not really established that it is worth carrying out a multivariate analysis of variance. Such an analysis relies on the existence of correlations between the response variables.

9.5 Marian art: Statistical analys

201

(a) Interaction plot/Line graph: s1: Simple – Complex

(b) Interaction plot/Line graph: s5: Clear – Indefinite

Fig. 9.5 Marian Art: Interaction plots/Line graphs of within-subjects factors. Style: 1 (blue) = PreRenaissance; 2 (red) = (Post-)Renaissance; black line = Observed grand mean

9.5.4 Examining consistency of relations between the response variables Given these problems, our next step is a foray into the correlational structure of the rating scales. The purpose is twofold. The first is to see whether the relationships between the scales are the same for each of the within-subjects painting categories.

202

9 Complex response data: Evaluating Marian art

Ideally we would want the meaning of the scales to be the same for all four painting categories. If they are the same, their mutual correlations should be roughly the same, and also independent of the different styles or contents of the paintings. The interpretation would become highly complicated if for one painting category high scores on beautiful and on pleasing went together, and for another painting category high scores on beautiful went together with displeasing. This would be unexpected and, of course, it then needs to be explained why the correlation between the two rating scales is different for different painting categories content. The second purpose, related to the first, is to see whether it would be possible to group several rating scales with similar meanings, so that we can interpret the rating scales as a group rather than each one separately. Grouping the scales can increase the reliability of the conclusions and simplify interpretation. Both purposes can be realised by examining the four correlation matrices. However, inspecting four 10 by 10 correlation matrices by eye is not easily done, as we have seen in the miniature example. A better procedure is to carry out a principal component analysis on all 40 scales together and, if necessary, on separate painting categories (see Section 3.8.1, p. 66 for an explanation of principal component analysis).

9.5.5 Principal component analyses: All painting categories First, we performed a principal component analysis on all 40 rating scales together. With respect to the variables-subject ratio (40 variables and only 88 judges) this is not ideal, but it is an acceptable strategy to acquire an overview of the relations between the scales for the four painting categories together.

Fit of principal component analysis Investigating the cumulative fit of the principal component analyses for increasing numbers of components shows that the variance accounted for is 17% for the 1component model, 27% (2 components), 36% (3 components), 43% (4 components) and 48% (5 components). At first sight this does not seem very promising for a relatively simple structure, given that there are only 10 different scales. We would much prefer the first few components to account for nearly all variability in the data.

Overall result Figure 9.6 shows the resulting variable loadings (in principal coordinates; see Chapter 4, p. 124) for the 40 scales. To facilitate inspection, we have coloured the scales as follows: s1 (orange); s2, s3, s4 (red); s6, s7R, s8R (green); s5, s9, s10 (blue; for the meaning of the scales see Table 9.1, p. 189). The problem is that the four

9.5 Marian art: Statistical analys

203

Fig. 9.6 Marian Art: Component plot for the rating scales: all painting categories.

painting categories are rated rather differently and there is an interaction between the within-subjects factors and the rating scales, which makes an easy evaluation of the rating scales as a measuring instrument rather complicated.

Grouping rating scales If we pay attention to the colours, a closer inspection of the graph shows that per colour the rating scales show more or less a group of lines pointing in the same direction for three of the painting categories. However, the behaviour of the rating scales for the (Post-) Renaissance-Event (NE) category is different (its scale names are boxed in the graph).

204

9 Complex response data: Evaluating Marian art

Fig. 9.7 Marian Art: Component plot for the rating scales without the NE category. The numbers outside the circles refer to the scales within the circles.

Grouping of painting categories For each painting category the spread of the scales is somewhat different. The direction of a scale varies with each painting category, although the same scales tend to stick together. Because the NE category of painting seems somewhat different, it might be a good idea to remake the graph for an analysis without this category.

Principal component analysis without the NE category The principal component analysis without the NE category indicates that for increasing numbers of components the variance accounted for is 21% for the 1-component model, 32% (2 components), 44% (3 components), 57% (4 components) and 62% (5 components). Per model this is somewhat more than for the scales from all categories.

9.5 Marian art: Statistical analys

205

Figure 9.7 displays the first two components of the solution without the NE category. In this graph the differences between the painting categories are now much more pronounced. The judgements on the rating scales for OD and OE are more or less similar and are clearly different from the judgements on ND. However, the first rating scale (s1: simple-complex) does not seem to fit well in the patterns, as it is located close to the centre of the plot for all paintings. Overall it seems worthwhile to pursue a detailed analysis of the four painting categories separately and investigate to what extent the rating scales have equal underlying dimensions for all styles of paintings. We should also look somewhat closer at the NE category.

9.5.6 Principal component analysis: Per painting category One of our central questions was whether the students used the rating scales consistently across painting categories. In the previous section we were left with a rather confusing situation and concluded that there is an inconsistency between categories, but it was difficult to gauge the finer details. This we will do by inspecting the principal component analyses for the individual painting categories. In the further evaluation of the situation it is good to emphasise that high loadings (component-scale correlations) for the scales on the same component mean that the order of the students’ judgements on these scales were roughly the same, and the higher the loadings the more these orders were the same. From Table 9.6 we see that scales s2, s3, s4, and s8R do show such high loadings. These four scales are used to express evaluations of the paintings. For all four rating scales low scores indicate a negative evaluation, and high scores a positive evaluation. Considering the high loadings for these four scales on the first component for all painting categories, students used these scales in a consistent way.

9.5.7 Scale analysis: Cronbach’s alpha Because the correlation pattern on the first component is the same across all withinsubjects combinations, it is worthwhile to check whether an evaluative scale can be found common to all types of paintings. Such an analysis is often carried out by evaluating Cronbach’s alpha coefficient; see Cronbach (1951).7 Cronbach’s alpha coefficient for the four evaluative scales gives the following values for: OE, α = .80; OD, α = .85; ND, α = .88; NE, α = .76, respectively. This suggests that we may combine these four scales into a single scale that works the same for all painting categories. An appropriate name for the scale might be Evaluation.

7 Within psychometrics detailed technical critique is available on the use and interpretation of Cron-

bach’s alpha coefficient, especially in connection with test development. It would, however, go too far here to discuss the issue; for further information see, e.g. Sijtsma (2009) and Taber (2018).

206

9 Complex response data: Evaluating Marian art

Table 9.6 Separate pcas for the four painting categories Scale OE s2. Displeasing-Pleasing s3. Ugly-Beautiful s4. Weak-Powerful s8R. Remote-Intimate s1. Simple-Complex s5. Clear-Indefinite s6. Makes me think-Makes me feel s7R. Feminine-Masculine s9. Like us-Superior to us s10. Comforting-Stimulating Proportion Accounted for

.9 .9 .8 .4 .5 .5 .6 .3

.8 .9 .9 .5 .5 .3

OD

.7 .8 -.8

.9 -.3 .4 -.3 .8 .4 .6 -.3 .3 .30 .19 .12 .36 .18

ND

-.3 .9 .9 -.5 .3 .9 -.5 .6 -.4 .8 -.6 .6 .5 -.3 .6 -.3 .8 .7 .3 .7 .5 -.3 .4 .5 .11 .43 .16 .10

NE .8 .8 .6 .4 .6

.4 .7

-.6 .5 .3 .7 .8 .8 .28 .15 .13

Absolute values less than or equal to .2 have been blanked out. Absolute values greater than or equal .5 have been set in bold. For consistency with other scales, both s7 and s8 have been reversed.

The results of the component analyses in Table 9.6 also show that it is not possible to create general scales using a subset of other rating scales, because there is no single consistent correlation pattern across the painting categories. The best available might be a combination of s1, s7R, s9 and s10, which gives alphas around .50 for OE, OD, ND, and .18 for NE. This is insufficient for combining these rating scales into a single concept. Thus we cannot claim that the ten scales consist of a limited number of components or factors with which it will be possible to evaluate paintings independently of the Style of painting and Content that is displayed. Apart from the rating scales forming the evaluative component, all other scales measure aspects that are influenced by the subject matter.

9.5.8 Structure of the questionnaire Evaluation rating scales During the analyses above it turned out that some of the ten scales were used in a consistent way by the judges irrespective of the painting category that was judged. It was especially evaluative scales such as ‘pleasing’ (s2), ‘beautiful’ (s3), ‘powerful’ (s4), and ‘intimate’ (s8R) that were used in the same manner by the judges for any painting category. In other words, the evaluative scales could be used for comparing pictures of different categories. Because the correlations between the four items were all positive

9.5 Marian art: Statistical analys

207

Fig. 9.8 Marian Art: Evaluation of the Painting categories. pre-Renaissance Events (OE) and Devotional (OD); post-Renaissance Events (NE) and Devotional (ND)

for each painting category and formed a consistent scale, the items could be added per category to form a single Evaluation variable via which to compare the appreciations of the four painting categories. It is useful to make an overview of the relationships between the evaluative rating scales per painting category. This can most easily be done via a principal component analysis of these rating scales; see Figure 9.8. What is most striking about their mutual positive correlations is that the items all point in the direction of the first component. In addition, they cluster by category with only few exceptions. Note that they also cluster per Style, which supports the idea of creating an aggregated Evaluation variable for all categories from their four rating scales. The four evaluation variables were judged in the same way by the students irrespective of gender and age. All effects of gender for the post-Renaissance Content

208

9 Complex response data: Evaluating Marian art

Fig. 9.9 Marian Art: Components of the analysis of the evaluative rating scales only

factor were not significant (except for the main effects), but given its low effect size the significant effect will not be discussed and we only show the graph of the evaluation means for the painting categories (Figure 9.9). Because we now only have four evaluation variables we can easily see how the painting categories are generally appreciated, see Figure 9.9. The mean appreciation is higher for the post-Renaissance paintings than for the pre-Renaissance ones, but there is no difference between devotional and event paintings for either style category.

The other rating scales Unfortunately no consistent scale could be found for the other items. Some were used in different ways depending on what was depicted on the painting. The concern is not that painting categories are judged differently—this is only natural. The problem is that for pre-Renaissance/Event Paintings the judges who would give high/low scores on ‘make me feel’ (s6), would also give high/low scores on ‘superior to us’ (s9) and vice versa, leading to a high correlation between the two scales. For postRenaissance/Devotional paintings the same judges gave high/low scores on ‘makes me feel’ but low/high scores on ‘superior to us’ and vice versa, leading to a negative correlation. We could say that the two scales themselves had different meanings and relations to each other, depending on which painting category was judged. However, it becomes impossible to separate the meaning of the scale and the sentiment that the paintings evoke, and comparison between categories becomes rather difficult. Similarly the judges considered devotional paintings ‘feminine’ (s7R), ‘like themselves’ (s9), and ‘comforting’ (s10) in a way that again led to high positive associations between the items, but such consistently high correlations are missing for the event paintings. In addition, for devotional paintings it was especially s1 (simple-

9.7 Content summary

209

complex) and s5 (clear-indefinite) that were used in a similar way, but this was not necessarily true for the event paintings. Polzella’s study in experimental aesthetics shows that designing his questionnaire for evaluating artworks was only partially successful. The development of a compact set of questions for judging paintings is not a trivial task, and more work is needed. In this chapter I have not tried to go deeper into all the interpretations which can be made on the basis of these data, for that the readers should consult the two papers by Polzella: Polzella et al. (1998); Polzella (2000).

9.6 Other approaches to complex response data There are several alternative options for the analysis of this three-way dataset, but they require more advanced skills, in particular a three-mode component analysis for all within-subjects styles together (see Chapter 10 and Section 3.11.1, p. 89). The preliminary three-mode analyses on these data showed that the results of such an analysis did not provide more insights than had already been gained from the analyses presented above. In addition, the technique is also directed towards individual differences between the subjects but they were not the focus of the analysis here. Finally, the analysis assumes that there are common relationships for the ten items of the questionnaire across the four painting categories, but we have seen in the previous paragraphs that this was not the case. In addition, the preliminary three-mode analysis did not account for a larger part of the total variation than did the analyses above. Thus, from that point of view the three-way results were not satisfactory either. A second option would be to perform a multigroup confirmatory factor analysis, which is a special case of structural equation modelling (see also Chapter 12, and Section 3.8.4, p. 75). However, a much large sample is needed for this technique, preferably 200 to 300 persons or more, which makes it not really a feasible option for the present dataset.

9.7 Content summary Apart from several methodological issues, the major conclusion with respect to the Marian paintings is that the students were more positive and appreciative about the post-Renaissance paintings than about the pre-Renaissance ones. Their judgements were not related to gender and age. Whether the paintings were devotional or depicted a specific event related to Maria did not seem to matter much with respect to the overall appreciation. Such an assessment should be measured by other rating scales than used in this study. It was difficult to make clear statements about the appreciation of different painting categories, because it could not unequivocally be established that the adjectives measured by the rating scales meant the same for all painting categories. This is something that merits further investigation.

Chapter 10

Rating scales: Craquelure and pictorial stylometry

Abstract Main topics. Paintings acquire cracks over time, but the way in which they do varies with the surface on which they were painted, and the painting materials used. Cracks in paintings from different traditions in Europe were characterised by both expert and inexperienced judges. Data. Small isolated segments of 40 paintings from four different areas in Europe were photographed and used for the judgements. No information about the paintings was provided. The 50cm2 parts were judged on seven characteristics by the twenty-seven judges, creating a 40 × 7 × 27 data block. Research questions. Can art-historical categories from different countries be distinguished via the judges’ subjective scores? Do the different materials on which the paintings were made influence the cracks in such a way that their differences can be recovered from the judgements? Do experts and inexperienced person judge alike? Statistical techniques. Three-way data analysis, three-mode principal component analysis, discriminant analysis. Keywords Craquelure · Cracks in paintings · 14th –18th -century paintings · Three-way data · Three-mode principal component analysis · Joint biplot · Discriminant analysis

10.1 Background “Craquelure is the pattern of cracks that develop across a painting with age. Cracks in the embrittled paint layer grow as the canvas or wood support of the painting moves in response to fluctuating humidity and temperature. Connoisseurs have long regarded craquelure as a significant feature of old paintings”.[...] “The pattern of cracks is a visible record of the physical tensions within the structure of the painting. The ways in which tensions are generated and dissipated are dependent upon the choice of materials and methods of construction

© Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_10

211

212

10 Rating scales: Craquelure and pictorial stylometry

employed by the artist. Craquelure is therefore related to the technical tradition in which a particular artist worked” Bucklow (1998, pp. 503–504).1 “Craquelure (French: craquelé, Italian: crettatura) is the fine pattern of dense ‘cracking’ formed on the surface of materials, either as part of the process of ageing or of their original formation or production. The term is most often used to refer to tempera or oil paintings, where it is a sign of age that is also sometimes induced in forgeries” (see Wikipedia, entry Craquelure).

From the above citations it is clear that cracks in paintings are more than a sign of age and a nuisance to the museum visitor: they carry important artistic and historical information. Because of the way they develop, patterns of cracks can be considered an element of style for a particular painter or a group of painters, although an involuntary and unconscious one. Thus, patterns of cracks can contribute to the attribution of painting to a particular painter. There is a parallel with function words in linguistic stylometry, where such words can help to make statements about authorship (see Chapters 6 and 13). Both in linguistic and pictorial stylometry the style elements are not the result of conscious decisions of the authors, but still have the idiosyncratic signature of the artist. While forgery is not a primary concern in literature, it certainly is in the art world, especially considering the high value of works by renowned painters on the open market. Therefore, developing a system by which to classify characteristics of cracks is a serious business and an important topic within the world of paintings and pictorial stylometry. An example of the use of craquelure for helping to detect a forgery is the investigation of the Pittsburgh Carnegie Museum of Art’s The Virgin and Child with an Angel by (supposedly) Francia (dated 1492) versus the London National Gallery’s copy of the same painting. Of the latter it was noted ‘What was immediately apparent on looking at the Pittsburgh picture was that the paint had cracked in exactly the way one would expect a fifteenth-century oil painting on wood to crack’, whereas some cracks were painted on in the London picture (Roy & Mancini, 2010, Note 42, p. 77). Given the focus of this book we will concentrate on procedures to evaluate the effectiveness of using craquelure to place paintings within the appropriate, mostly national, context, basing ourselves on the work of Bucklow and others. In other words in this chapter we attempt to show how the study of craquelure functions as a way to establish authorship, and how it can succeed by using multivariate analysis.

1 This

chapter was co-authored by Prof. Spike Bucklow of the Hamilton Kerr Institute, Fitzwilliam Museum, University of Cambridge, UK; [email protected]. Part of the technical descriptions are adapted from Bucklow (1998).

10.3 Data: Craquelure of European paintings

213

10.2 Research questions: Linking craquelure, paintings, and judges When Bucklow (1998) developed a system for characterising cracks by quantifying their shapes and patterns, it was essential to show that his descriptions were indeed able to place paintings within the correct tradition. In terms of multivariate analysis, the questions were whether multivariate techniques such as cluster analysis (see Section 3.9.3, p. 81) and discriminant analysis (see Section 3.6.4, p. 60) were able to verify the a priori groupings, and in addition, whether other internal structure techniques (see Section 3.7, p. 65) such as multidimensional scaling (see Section 3.9.2, p. 80) were able to show which characteristics of the cracks were responsible for the grouping of paintings. In this chapter I will use some further multivariate techniques to reach a similar aim. In particular, one question is whether the different types of cracks can be used as a noninvasive means of classifying paintings. Moreover, I will investigate which types of crack characteristics are used in judging the characteristics of the cracks, and whether these judgements can be used to distinguish between different art-historical categories. The importance of the success of this procedure is that a painting can be assigned to an art-historical group on the basis of its similarities with other paintings. Obviously, it is unrealistic to expect that this can be unequivocally done in all cases.

10.3 Data: Craquelure of European paintings

Table 10.1 Rating Scales and Art-Historical Categories Rating Scales Type

left (1) .....

..... (5) right

Origin

Century

Art-Historical Categories Material

1. Network 2. Network 3. Direction 4. Islands 5. Cracks 6. Cracks 7. Thickness

Connected Ordered Parallel Square Smooth Straight Uniform

– Broken – Random – Perpendicular – Not Square – Jagged – Curved – Secondary

Italian Flemish

14th , 15th 15th , 16th

wood panel wood panel

Dutch French

17th 18th

canvas canvas

Networks = ensemble of cracks; Islands = closed areas delineated by cracks Scale scores for each scale run from 1 to 5

Bucklow devised a detailed system to characterise cracks visible in paintings, the details of which are given in his paper (Bucklow, 1997); a summary is shown in

214

10 Rating scales: Craquelure and pictorial stylometry

Table 10.1. This table also provides brief information on the four art-historical painting categories, which are largely concentrated within a country. Therefore we will refer to categories and countries indiscriminately. These categories represent distinct technical traditions, employing different materials and using different methods of construction. Of 528 paintings belonging to these four categories, photographs were taken of small isolated areas of craquelure which had no painted detail to give stylistic or otherwise informational clues with respect to the painting’s origin; for further details see Bucklow (1998) . To give an indication of the different types of craquelure, Figure 10.1 provides a sample of each of the four art-historical categories.

Fig. 10.1 Craquelure examples of the four art-historical categories. Top row: Wood panels (Italian, Flemish); Bottom row: Canvas (Dutch and French).

The dataset used here consisted of ten representative photographs from each of the four categories. These 40 photographs of painting sections were presented to 27 judges (11 experts and 16 novices), who were requested to score them on the seven five-point scales indicated in Table 10.1. As each judge produced a 40 × 7 (paintings × scales) matrix, the data can be visualised as 40 × 7 × 27 block of data (see Figure 10.2). We will use the term mode for paintings, scales, and judges. and level for each separate judge, painting, and scale. Thus, a mode is made up of levels (see Section 3.2.4, p. 48).

10.4 Analysis methods

215

Fig. 10.2 Three-way rating-scale data 27 Judges judging 40 Paintings on 7 Crack characteristics

Data design With respect to the data design we concentrate on an internal structure design for three-way data (Section 3.11.1, p. 89) with an emphasis on the paintings (mode 1), the way is rated (mode 2), and the possible differences between the judges (mode 3). The rating scales have no particular design so that there is no structure to be explained. They will be used to describe the differences between the paintings. The paintings have a so-called nested design of materials within countries, i.e. each country has typically only one value for material—for example, Italian paintings were commonly executed on wood panels, but not on canvas. There is a one-way between-subjects design for the judges, i.e. they are either experts or non-trained (see Chapter 4, p. 104; analysis of variance designs). The designs on the modes are present but not operative during the analyses. They are, however, used for interpretation, and one of the aims is to discover whether their existence is reflected in the outcomes. This is in contrast with the study on Marian art in Chapter 9 where the data design was an integral part of the analysis procedure. The art-historical categories in this study are used explicitly to validate the earlier results (Section 10.5.3).

10.4 Analysis methods Not uncommonly three-way datasets are first reduced to two-way datasets in order to be able to analyse them with standard statistical techniques (see Figure 10.3). Another option would be in this case to take the average over all judges, and analyse the resulting 40 × 7 matrix of scale means by an internal structure technique such

216

10 Rating scales: Craquelure and pictorial stylometry

Fig. 10.3 Reformatting of a three-way data array (left) into a two-way matrix (right)

as principal component analysis; see Section 3.8.1, p. 66 and Section 10.3; for a full discussion of such reduction techniques. Here we will analyse the 40 × 7 × 27 block of data by means of three-mode principal component analysis, especially because we have an interest in individual differences between judges; see Section 3.11.1, p. 89 for details on the technique. The essence of the technique is as follows: for each of the three modes—paintings, scales, and judges—a dimensional reduction is determined in such a way that there is a component matrix for each. In each case such a matrix consists of a number of components which may be interpreted as compact representations of the common characteristics of the entities, or levels, of a mode. Thus, a component for the paintings (A) will have high values for those paintings which are judged to share common patterns in their cracks. A component for the cracks characteristics (scales) (B) will show high values for those scales which elicit similar judgements on the paintings, and a component for the judges (C) will show high values for those judges which judge the cracks of paintings in a similar way; for further details see Kroonenberg (1983, 2008).

Fig. 10.4 Three-mode principal component analysis. A three-way data block X = (X1 , ..., X K ) is decomposed into separate sets of components for Paintings (A), Crack characteristics (B), and Judges (C) plus the core array (G) linking the components of the three ways. The numbers of components for the three modes are P, Q, and R, respectively.

10.5 Craquelure: Statistical analysis

217

There can be more components per mode because different groups of paintings/scales/judges can have common, but different, patterns. The respective numbers of components for the three modes are P, Q, and R. In addition, the analysis will indicate which components of the different modes are linked to each other and which are not. The links are collected in a small three-way array (G), referred to as the core array, of size P × Q × R (see Figure 10.4). The relationships within and between modes will primarily be presented graphically rather than through numbers, as seems proper for data on paintings.

10.5 Craquelure: Statistical analysis As in many other chapters we will progress from simple to more complicated aspects of the data, which will, I hope, make it easier to recognise and understand what are the major properties of and relationships between the data. If we are interested in groups of objects, we can either use known groups in the analysis or try to discover if the objects can be grouped on the basis of their characteristics. In this case study it is useful to find the characteristics for the four known groups of paintings, and subsequently pretend we do not know beforehand that there are groups and try to recover the groups on the basis of the characteristics found.

10.5.1 Art-historical categories: Scale means As always it is useful to try and first get an overview of what basic information is contained in the data. To this end, we have first averaged the scores of the judges to produce a paintings × characteristics matrix, and subsequently we have averaged the paintings per country of origin, thus producing a country × characteristics matrix, showing the mean painting characteristics per country. The results are displayed in Figure 10.5 in a line graph, also referred to as interaction plots (see Chapter 4, p. 115). There is evidently considerable variability between and within patterns. For instance, the scales random, curved and not square are rather different between arthistorical categories, but the differences are of the same nature for the three scales. They have high scores (≈ 4.5) for French paintings, declining steadily until they are at their lowest value (≈ 1.8) for Flemish paintings. The other scales clearly have very different patterns across the art-historical categories. What cannot be judged from this figure is the differences between judges and the differences between paintings within a country. These differences can in principle be evaluated, but not by only graphing the means; boxplots or line graphs of means with error bars should be employed. However, including the latter in Figure 10.5 will become rather messy (Chapter 4, p. 115; line graphs).

218

10 Rating scales: Craquelure and pictorial stylometry

Fig. 10.5 Means of the crack patterns per country. French and Dutch paintings on wood panels; Italian and Flemish paintings on canvas.

10.5.2 Scales, judges, and paintings: Three-mode component analysis For the main analysis we will use three-mode component analysis. First we will tackle the modes one by one, and subsequently show how the components of the three modes are linked together.

Similar use of the rating scales Above I indicated that with three-mode component analysis the correlations between the rating scales can be examined through the rating-scale components, of which the first three are listed in Table 10.2. Together they account for 67% of the variability in the rating scales. It is also possible to present the other component matrices in the same manner, but this will not be done here in order to avoid duplication with some later analyses that show the primary results of the three-mode analysis. The scale pattern we observed for the means is partially reflected in the scale components, which are based on correlations. The first three scales have similar values, as more or less do the last three, and parallel—perpendicular (which has to do with the material on which the paintings were made) has a pattern of its own. In particular, it is the only scale with a large value on the third component, and as such

10.5 Craquelure: Statistical analysis

219

Table 10.2 The major scale components Scale Left - 1 Right - 5 1 Square Ordered Straight Parallel Uniform Smooth Connected Variance

– Not square – Random – Curved – Perpendicular – Secondary – Jagged – Broken Percentage

1.00 .92 .88 .18 −.22 −.25 −.31 41%

Components 2 .26 .12 −.01 .55 .57 .70 .43 19%

3 .06 .07 .14 −.54 −.07 .26 .31 7%

it shows that the scale represents this specific aspect, in addition to its relationship with the last three scales characterising the second component.

Similarities between judges Only one ‘judges’ component is really interesting. It accounts for 67.0% of the total variability in the dataset. The second component accounts for a mere 0.4% of the variability. The values for the judges on the first component are very similar, ranging from .70 to .93. Thus, their use of the rating scales is largely the same: they all order the paintings on the rating scales in roughly the same way. This may be interpreted as a consensus among the judges on the question which paintings have which types of cracks. The difference between the judges primarily a question of degree, i.e. how strongly they express their judgment. The second component shows some differences between two groups of judges, but the values are really small and run from -.17 to .12. The most strongly deviating judges on this component were experts no. 29 (-.17) and no. 30 (-.12) versus expert no. 27 (.12). What is noteworthy is that neither component shows a contrast between experts versus novices—on the contrary. Thus, the descriptions of the craquelure based on the rating scales do not seem to be dependent on whether the judge is an expert or a novice. The low variability accounted for by the second component indicates that there are minor differences between the judges with respect to the general consensus, but that they have nothing to do with levels of art expertise. In the analysis the judges mode is considered the reference mode and the crack characteristics and the paintings the display modes; see Section 4.13 (p. 115) for an explanation of the terms.

220

10 Rating scales: Craquelure and pictorial stylometry

Similarities between paintings The craquelure characterisations as judged via the rating scales can be described via three components which account for 41%, 19%, and 7% of the variability, respectively, or 67% all together. The key question in this study was whether it would be possible to retrieve the art-historical categories on the basis of the characterisations of the craquelure. To get an insight into this question it does not help to simply present the scores of the paintings on each of the scales. What is necessary is to produce both the graphs for the components (see Figures 10.7 and 10.8, which we will discuss later), and information from grouping procedures such as discriminant analysis (see Figure 10.9), or possibly cluster analysis; see Section 3.9.3, p. 81 for an example with the present data). Table 10.3 Core array G = (g pqr ): Links between components of the three ways Judge Component 1: Consensus r=1

Judge Component 2: Differences r=2

Scale Paintings Components components p=1 p=2 p=3

Scale Paintings Components components p=1 p=2 p=3

q=1 q=2 q=3

7.23 -.01 -.00

-.00 4.96 -.02

-.00 -.01 2.98

q=1 q=2 q=3

-.30 .33 .15

.35 .34 .16

.13 .14 .17

Linking components: Core array As indicated in Section 10.4, a three-mode component model also contains values (parameters) that indicate how the components of the three ways are linked with each other. Here we have three components each for scales and paintings and two for the judges, which means there are 3 × 3 × 2 ( = 18) links between them, whose values are presented in Table 10.3. It is clear from this table that of the 18 links only three are really important, i.e. sizable: g(1, 1, 1) = 7.23 (41%), g(2, 2, 1) = 4.96 (19%), and g(3, 3, 1) = 2.98 (7%), where g( p, q, r ) indicates the size of the link between the p th component of the paintings, the q th component of the scales, and the r th of the judges. The squares of the values, g 2 ( p, q, r ), are the variabilities accounted for by the combination of the three components. The larger values all pertain to the first judge component. Important for the interpretation is that at the same time for the paintings and the scales it is only equal numbered components, for example g(3, 3, 1), that are linked to each other. None of the other links are sizable (on average they are about -.01), and the links between the other components do not contribute to the explanation of the patterns in the data.

10.5 Craquelure: Statistical analysis

221

It is often easiest to look at the relationships between the levels of the different modes (here, the paintings and the scales), rather than trying to try and name the components themselves. Note that we have named the first judges’ component, i.e. consensus, because the respondents all have more or less the same value on that component. This equality signifies that the judges were more or less agreed on what types of cracks are present on which paintings.

Consensus among judges: Joint biplot Figures 10.7 and 10.8 show what the judges actually agree on regarding the cracks in the paintings. Together the two graphs depict a three-dimensional joint biplot; for more details about such biplots, see Chapter 4 (p. 106). Because three-dimensional graphs are difficult to read, they are here presented as one graph showing component 1 versus 2, and another one showing 1 versus 3. Note that the scale of the first dimension is the same in both graphs, so that the second graph can be envisaged as perpendicular to the first one; in particular, the third dimension is perpendicular to the plane of the first and second dimensions. The biplot has a symmetric scaling which implies that the relationships between the types of cracks and the paintings can only be interpreted via inner products. Note that the distances between the scales and between the paintings are not properly represented in this graph (see Section 6.4.2 and Chapter 4, p. 113). Figure 10.6 shows the principle of comparing paintings with respect to a scale, here the straight-curved scale. The arrow for the right-hand adjective of a scale starts in the origin, which is the average of each scale. The blue arrow indicates a characteristic of the cracks, which is labelled with its right-hand value (here curved). The arrows may be extended to the other side of the origin by equal lengths. Then the left-end point of the arrow is the left-hand side of the scale (here straight). The paintings are projected onto the characteristic; the red lines are the projections. The lengths of a painting projection on the arrow of the crack characteristic indicates how straight or curved the cracks are. Thus, in Figure 10.6 the French painting f501 has rather curved cracks, the Dutch painting d317 has rather straight cracks, and d37 is average in this respect. The paintings are identified by their country of origin and a number; the painting numbers are only shown on the plots if relevant. Each painting is located at the letter of its label. For another explanation of this process see Chapter 4, p. 113) In Figures 10.7 and 10.8 most paintings of a specific art-historical category are contained in a coloured convex hull (see for a definition Section 4.8, p. 110) ; the Dutch paintings form an exception, as they are widely scattered. Paintings rather far away from their own category are not included in the hull. To evaluate which paintings have which crack characteristics one should drop a perpendicular line from a painting point onto the relevant arrow (see the demonstration in Figure 10.6); The graph of the first two dimensions (Figure 10.7) shows that the cracks in French paintings are large, random, curved, and not square. Overall they tend to be smooth (projecting on the opposite end of the jagged arrow) and connected (projecting on the opposite end of broken). An exception is f497, which has more jagged

222

10 Rating scales: Craquelure and pictorial stylometry

Fig. 10.6 Inner products of three paintings with the straight-curved scale. The size of the inner product is shown by a dashed red arrow. The origin is the average of all paintings on this scale. The French painting f501 is scored above average because it shows curved cracks rather than straight ones. The situation is reversed for the Dutch painting d317: it is scored below average because of its straight cracks. The Dutch painting d37 is scored about average on the scale, because its cracks are neither very straight nor curved compared to the other paintings. The perpendicular distance of the paintings to the scale is irrelevant for the size of the inner product.

and broken cracks as indicated by its projections on the positive side of these arrows. The Flemish and Italian paintings mostly, but not all, have small, ordered, and square cracks (see, however, Figure 10.8). Perpendicular and secondary do not seem to be discriminating features of the paintings in the plane of the first and second dimensions, as they have short arrows in this plane. We were able to draw a line in the graph which separates the Flemish and Italian paintings on wood panels from the French ones painted on canvas. None of the scales manages to single out the Dutch paintings. These are scattered all over the plot, because as a group they do not have discriminating features. They are more or less average on most characteristics, but tend to side more with the Italian than with the Flemish paintings. However, d49 and d340 are very much like French paintings; d328 is also a bit like the French, but less so. There are a few paintings with unusual craquelure: v264 has cracks that are more random, larger, and curved than the other Flemish paintings. The French painting f508 is different in the sense that its cracks are more smaller and less random and curved than the other French paintings. So far we have only looked at the first two dimensions. If we also look at Figure 10.8 and try to form a three-dimensional picture of the complete biplot, we see that the third dimension is determined by the scales perpendicular and secondary, which had short arrows in Figure 10.7, and thus did not contribute to the wood versus canvas distinction. However, they do separate the Flemish from the Italian paintings, which could not be distinguished on the basis of the first two components. The cracks in the French paintings have average scores on these two characteristics.

10.5 Craquelure: Statistical analysis

223

Fig. 10.7 Consensus joint biplot: Dimensions 1 vs 2 of the display modes. Of the scales only the right-hand markers are shown, except (for demonstration purposes) the straight-curved scale with the paintings d317, d37, and f501 also portrayed in Figure 10.6. Most of the paintings are only indicated by their country of origin: f = France; d = Dutch; v = Flanders; i = Italian. The convex hulls contain most of the paintings of a country

Differences among judges The second judge component accounted for very little variability: 0.4%. As mentioned above, the most deviating judges on this component were expert no. 27 (.12) versus experts no. 29 (-.17) and no. 30 (-.12). As this variability is so small, it is not worth to go into the differences in any detail.

10.5.3 Separation of art-historical categories To finish the analyses it is worthwhile to investigate how well the components of the paintings manage to statistically discriminate between the art-historical categories.

224

10 Rating scales: Craquelure and pictorial stylometry

Fig. 10.8 ‘Consensus’ joint biplot: Dimension 1 versus dimension 3 of display modes. Of the scales only the right-hand markers are shown. Paintings are indicated by their country of origin: f = France; d = Dutch; v = Flanders; i = Italian. The coloured areas contain most of the paintings of a country.

A standard general linear model technique for this is discriminant analysis, a dependence-design technique, also used in Chapter 18 where the technique is discussed and illustrated in more detail; see also Section 3.6.4 (p. 60). The first two discriminant functions have been presented in Figure 10.9, which again shows convex hulls around the paintings from each art-historical tradition, (except for outliers), to emphasise the different cohesion within the groups and the separation between the groups; see Chapter 4, p. 110. Moreover, clear outliers, in particular those located in another domain, have been identified; for comments on outliers see Section 2.4, p. 40. As in various other studies outliers are not necessarily a nuisance, but they might provide an insight into the world depicted. Alternatively, the fact that a painting is not an outlier while expected to be so is also a valuable clue for further investigation.

10.7 Content summary

225

Fig. 10.9 Discriminant analysis based on components from three-mode analysis. Misclassified paintings are circled. For a discussion on the Dutch outliers see Section 10.7.

10.6 Other approaches to pictorial stylometry Assisting in establishing origin and authorship of paintings is, of course, not the sole domain of statistical techniques. Radiology, chemical analysis, and other techniques using the physical materials with which the paintings were realised are also of the utmost importance. One of the nice things about the statistical craquelure analysis is that it is noninvasive and thus does not have an impact on the painting itself. For the evaluation of the art-historical categories other statistical methods have been used, such as cluster analysis, various types of neural networks among others.

10.7 Content summary The analysis in this chapter suggests that different art-historical traditions within countries are recognisable from the cracks appearing on the surface of paintings over time. Lay judges and experts alike were able to rate the characteristics of the

226

10 Rating scales: Craquelure and pictorial stylometry

cracks in such a way that the paintings could be grouped correctly, even without any information about what was depicted. Crack propagation is firstly due to technical variations in artists’ methods and materials, secondly to variation in tensions in particular parts of the paintings, and thirdly to the history of damage. Somebody looking at cracks in the whole painting would see the context and accommodate for these factors when assigning a painting to a particular art-historical tradition. In this case study, the human judgements of the cracks were based only on pieces of approximately 50cm2 . Moreover, the judges, expert or not, had no clue, as to what kind of painting they were dealing with. Given the different bases for the judgements it is amazing that the statistical classification is quite accurate, even in the absence of contextual clues. However, some paintings were grouped in an incorrect art-historical tradition. It was possible to trace some of these incorrect assignments. Atypical usage of materials may, for instance, be due to a Dutch painter learning his trade in Italy, and moving back to Holland only after reaching a certain level of competence. Therefore, his use of materials may have been more like Italian painters than like other Dutch artists, even though he may have been painting Dutch skies rather than Italian hills. Alternatively, the painter may just have been an experimental person who liked to paint with many different materials in many different ways, maybe taught by a visiting Italian master. In Figure 10.9 we see that the Dutch paintings d49 by Claude de Jongh and d340 by Rembrandt are firmly located within the French group of paintings. The reason for this is obviously that their cracks resemble those of the French paintings rather than those of the Dutch ones. The technical explanation is that in d49 the pattern includes an area of cracks with typical spirals caused by external impact. Painting d340 contains an area of particularly thick paint, delocalising the tensions from the orthogonal canvas weave more than one might expect in a Dutch painting, so that the cracks take on a more French look. Similar technical aspects related to paint and tensions from the medium on which the paint was applied can be put forward for the other outliers. Implicit in this query is that a nonperfect assignment by a statistical method does not necessarily mean that the method is not working for the data. As illustrated above, the statistical analysis could be a boon for the study itself, pointing to new, hitherto unforeseen, aspects of the problem and data at hand.

Chapter 11

Pictorial similarity: Rock art images across the world

Abstract Main topic. The aim of this chapter is to show how rock art images all over the world can be analysed so that (dis)similarities of images from different regions can be assessed. Data. A large collection of human rock art images were scored on a number of binary characteristics pertaining to both the person pictured and accompanying attributes, such as clothes and objects. A large part of the images described are of Australian origin. Research questions. Regional differences were the centre of attention. Images were compared from specifically: (1) Algeria and the Kimberley (Australia), (2) Zimbabwe, India, and Algeria, (3) three regions of Northern Australia: Arnhem Land, the Kimberley, and Pilbara. Statistical techniques. Binary data, basic descriptive analysis, comparison of proportions, multiple correspondence analysis, categorical principal component analysis, cluster analysis, discriminant analysis. Keywords Rock art · Pictographs · Petroglyphs · Rock paintings · Australia · India · Algeria · Zimbabwe · Kimberley · Pilbara · Arnhem Land · Data inspection · Comparison of proportions · Multiple correspondence analysis · Categorical principal component analysis · Cluster analysis · Discriminant analysis

11.1 Background People drawing people has been a perennial human occupation whether on paper, canvas, sand, or rocks. Given the durability of rock, ancient drawings on that surface have survived since time immemorial. Many images have been directly painted on to the rock surface (pictographs). Other images were created by removing part of the surface by scratching, pecking, chiseling, etc. (petroglyphs). These last were sometimes painted as well. Such rock art has been a continuous topic of fascination for descendants of the artists, professionals, amateurs, and students alike. In the case of rock art in Northern © Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_11

227

228

11 Rock art images

Australia there are also associated political aspects, as the drawings have played a role in Aboriginal land rights and heritage issues (see, for example, McNiven, 2011). The analyses in this chapter were inspired by the work of Michael Barry, whose data form the basis for all analyses presented here.1 He was especially interested in the shape of rock art images from the Kimberley (North Western Australia), which came to be known as Bradshaws, named after the first European to record and describe them (Bradshaw, 1892).2 The study of rock art, whether as paintings, petroglyphs, or otherwise, knows many aspects, such as what is depicted, how the images were made, when they were created, their historical and sociological significance, etc. A detailed account of several of these aspects can be found in McDonald (2005) and her references. After Bradshaw there has been widespread debate about the proper name for the Kimberley images, and many spoke out in favour of the aboriginal name, Gwion Gwion images; for details about this debate and the philosophies behind it, see, for example, Barry (1997), Barry and White (2004), Pettigrew (2005), and McNiven (2011). The topic of examination in this chapter will be rock art depictions of people from around the world, and especially their relations to the Australian images. The emphasis will be on the question whether it will be possible to distinguish rock art images from different countries and continents, irrespective of how they were made. In addition the issue is raised whether rock art images from different areas in Australia are distinguishable among themselves, and from those from other regions in the world. The analyses are based on Barry’s database of over 2000 black-and-white images collected from the literature. It should be noted that Barry relied exclusively on reproductions published by others. Barry’s collection consists of images from many countries, but his primary interest was to establish whether the rock art in Australia was specific to that continent or whether it could have come from elsewhere. This chapter will focus on the methodology of analysing part of Barry’s data. In part it will be a replication of work presented on his now defunct website,3 but in addition there is an attempt to quantify the differences and similarities between the three principal Australian regions in his database. Barry obtained his results primarily from correspondence analysis (see Section 3.10.1, p. 84), but we will also examine them in various other ways. Given the orientation of this book attention will be paid to the visual presentation of the results.

1 Michael

Barry is the major inspirational source for this case study. this chapter the word image will always refer to the shape of a complete person or to anthropomorphic images, not to the way they were constructed. For research into other types of images, see, for example, Taçon, Wilson, and Chippindale (1996), Chippindale and Taçon (1998), and McDonald (2005). 3 http://acl.arts.usyd.edu.au/~barry/ca.htm. 2 In

11.2 Research questions: Evaluating Rock Art

229

11.2 Research questions: Evaluating Rock Art Notwithstanding the large number of questions and issues related to rock art, in this case study I aim primarily to show how multivariate analyses have been and can be used to shed light on such issues, without claiming that our analyses solve any of the issues themselves. That is clearly the domain of the rock art specialists and geologists who research the relative age of the images and their chronology; see, for example, McDonald (2005), Donaldson (2012), Travers and Ross (2016), and their references. Therefore, given our database, we can only deal here with three specific research questions exclusively based on the shapes of the images themselves, without paying attention to how they were made, for instance, the question whether they were painted onto or scratched into the rocks. Of course this limits the generalisability of the results, but it does provide a pure comparison of the precise shapes of the images.

11.2.1 The Kimberley versus Algerian images The first question is whether the images from the Kimberley and those from the Tassili n’Ajjer in Algeria are so different that they can be clearly separated from each other. Of special interest is the question whether an arbitrary image from one region could reasonably have come from the other region as well. This question will be treated first, because we can approach it by basic statistical methods, whereas the other two need to be tackled with more complex multivariate ones.

11.2.2 The Zimbabwean, Indian, and Algerian images The second question is whether differences can be demonstrated between rock art from three non-Australian countries, in particular Zimbabwe (Southern Africa), India (Asian subcontinent), and Algeria (Magreb). The purpose is to assess whether the data are sufficiently rich to show whether the characteristics can discriminate between far-apart regions.

11.2.3 The Kimberley, Arnhem Land, and Pilbara images The third question is to what extent and on the basis of what kind of attributes the rock art from the three Northern Australian regions can be shown to be similar or different. The leading question in this case is whether the size and nature of the differences the Kimberley, Arnhem Land, and Pilbara are comparable to the differences between those from three totally different and non-adjacent regions in the world. Here it is

230

11 Rock art images

especially relevant that Barry’s images are black-and-white profiles, whereas many original pictographs and petroglyphs were coloured; see the quote below from Barry (1997, pp. 7–8).

11.2.4 General considerations Both the analyses and the interpretations will be carried out at a rather general level, and it will become clear that much more detailed analyses could have been done, and have already been done, but not necessarily in same manner as presented here. In fact, there are two contrasting questions. Given a region, we can describe what kind of images are typical for the region, and conversely, given a specific image we can ask whether it could be allocated to a particular region with a reasonable amount of certainty. Our interpretational efforts are a mixture of both, but certainly not exhaustive. Inevitably, there will be similarities between individual images from different regions. After all, the human form is standard across the regions: all people have a body, two legs, two arms, and a head, and all wear some type of clothing (most of the time). Within this context there are many varieties, especially with respect to completeness, stance, clothing, accoutrements, etc. Notwithstanding, human rock art images will have an a priori chance to look alike across the world, especially when they are in black-and-white.

11.3 Data: Characteristics of Barry’s rock art images The data for this chapter were kindly made available by Michael Barry, Sydney, Australia; he has given permission to use his data for this book (8/10/2019). The database consists of 2230 images (samples) from 21 countries, subdivided into 116 subgroups.4 The images were scored on 79 binary variables, each indicating whether images possessed the attribute (=1) or not (=0); this pertains to characteristics of the persons themselves as well as their accoutrements. A list of attributes is provided in Table 11.1. The full descriptions are contained in his thesis (Barry, 1997). It is not clear to me whether Barry coded the images from available photos or from redrawn images; his thesis suggests the latter. Barry made it quite clear in his thesis (pp. 7-8) which aspects he did and did not include in his database: 4 Barry

collected these images from available publications, and included them in the second part of his thesis; for details of the original publications, see his thesis (Barry, 1997, pp. 195–208). Barry’s website does no longer exists, but at the time I copied the images from his site (see Figure 11.1). They are a subset of the entire dataset. The complete dataset and Barry’s two-part thesis can be obtained from the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS): https://aiatsis.gov.au/.

11.3 Data: Characteristics of Barry’s rock art images

231

[It is an issue] whether a distinction should be made between engravings [petroglyphs], carvings, stencils, prints and paintings [pictographs] or any combinations of these. Selection may have been a matter of choice by the artists, or it could have been imposed on the artists by their culture and so should be included. Possibly no choice was involved and the landscape was the deciding factor [..]. No distinction has been made on these grounds in this analysis.

He furthermore did not consider colour, size, or age of a figure, and also used the picture as imaged, in particular as contour drawings without conjectures about parts having been weathered away, etc. There is obvious space for debate about his choices, but that is not for me to consider. I have taken the database as it is without having seen the original drawings, similar to Barry’s situation, who did not travel across the world to see all 2230 images in situ. This clarification of the nature of the data is essential for what follows, because if in one region only pictographs were made and in another only petroglyphs, viewing the original images will make it abundantly clear where they originated. What remains to be determined is whether the images themselves were different in both areas, and it is this aspect which is at the core of Barry’s investigations and, given his database, will be in this chapter. Similar problems arise for mediæval paintings, some of which were painted on canvas and others on wood panels. In that case it is also an issue whether the paintings are different because of the surface they were painted on (see Chapter 10, p. 211). We are dealing here with an image × binary variables data matrix. The analysis is approached via an internal structure design (Section 3.7, p. 65), as there are no specific dependent variables. Nevertheless, we will also employ a dependence-design technique (Section 3.6, p. 56), in particular discriminant analysis (3.6.4, p. 60), to see whether the outcomes of a component analysis can predict a two-group response variable. From his database Barry created a subset, the Sampled dataset, by random sampling 25 images each from Zimbabwe, India, and Algeria (see Figure 11.1 for the individual images). These figures are here reproduced from Barry’s website. Especially for the analyses in this chapter a second dataset was extracted from the main database, consisting of three Australian regions: Arnhem Land (NorthEast Northern Territory), the Kimberley (North Western Australia), and Pilbara (North-West Western Australia)—see Figure 11.5 for their locations. This dataset will be referred to as the Northern Australia dataset. To investigate the similarities between the Algerian images and the Kimberley ones, a further dataset was created, the Algeria-Kimberley dataset. This is because some authors hypothesised that the images were made by people who immigrated from elsewhere, rather than by forebears of the present aboriginals. We will first turn to the Algeria-Kimberley dataset, then continue with the Sampled dataset and end with the Northern Australia dataset. The order is largely determined by the levels of analytic complexity.

232

11 Rock art images

Fig. 11.1 Sampled dataset: Rock images from Algeria, India and Zimbabwe

11.4 Analysis methods

233

Table 11.1 Barry’s descriptive terms for shapes, accoutrements, etc. Front facing

Standing

Infill

S-form

Large disc

Feather

Right facing

Sitting

Outline

No head

Headdress

Fan

Left facing

Lying

Stick figure

Stick head

Apron small

Mirror

Other right

Splits

Enhanced stick

Twisted

Apron large

Hat

Other left

Other

Free form

In proportion

Skirt

Object

Other facing

Active

Contorted

Bow legs

Boots

Bag

Sexed

Swollen

Pointed feet

Waist band

Ring

Bow

Decoration

Axe

Shaped

Pronounced bottom

Tri-tassel

String

Animal

Sword

Realistic

Bird head

Tri-sash

Rope

Leaf

Eye

Rectangular

Animal head

Nose bone

Basket

Fish

Boomerang

Triangular

Trophy head

Wing

Food

Baby

Stick

Hook

Musical Instrument

Knife

Shield

Spear thrower

Club

Vessel

Spear multipoint Spear barbed

Spear pointed

Beehive headdress

The detailed content descriptions for the 79 attributes can be found in Barry (1997, pp. 37–43). This list is intended to show the amount of detail with which the images were scored and the work that must have been involved scoring all 2230 images. In the text the attribute names will be shown in Small capitals.

11.4 Analysis methods In this section we will discuss the analytical techniques that were used to analyse Barry’s rock art data. Part of the explanations is already contained in Chapter 3, but here we will touch on some issues which are particular to and specifically relevant for the data discussed in this chapter. In this case study we will extensively use internal structure techniques based upon principal component analysis, and only take recourse to a dependent design technique, discriminant analysis, for support of the former.

11.4.1 Comparison of proportions The Algeria-Kimberley dataset was first analysed via comparisons of two proportions. Specifically, for each attribute it was determined whether the proportions for Algeria and the Kimberley were the same. They could also be considered means, because for a binary attribute with the values 0 and 1 the mean is equal to the proportion of the occurrence of the attribute. Thus, a mean of 0.5 indicates that half the images possess the attribute. If on a particular attribute the proportions of Algeria and the Kimberley are nearly equal, the attribute cannot be used for establishing differences between the regions. However, if an attribute, say a boomerang, only occurs in one of the regions, the attribute is characteristic for the region’s images, but not all images have to have it. To select the distinguishing attributes we may informally carry out chi-squared tests

234

11 Rock art images

for the difference of two proportions, but we should remain aware that differences or lack of them may also occur due to pure chance, especially when there are a great many attributes and the differences are not large. It should, therefore, be remarked that in choosing attributes one must take care not to become a victim of cherry picking, i.e. choosing only those items which confirm or negate an a priori opinion.5 In Barry’s data this is particularly relevant, because there is probably a large element of random variability. Admittedly, in this case study it is quite possible that I have fallen into such a pit, but fortunately my aim is to show statistical analyses and not to claim earth-shaking new results in the field of rock art. The knowledgeable reader should take this danger of cherry picking into consideration when choosing differentiating attributes to evaluate outcomes. Another approach to the same question is to perform a cluster analysis appropriate for binary variables (Section 3.9.3, p. 81). In this case, we use an centroid-linkage hierarchical cluster analysis based on the squared Euclidean distance between cluster centres (see also Chapter 4, p. 113). This type of cluster analysis is one of the methods ˇ appropriate for binary variables; see Everitt et al. (2011) and Šulc and Rezanková (2019). A similar technique was used by McDonald (2005, pp. 130–134) in her research on Australian archaic faces. Searching for the optimal cluster technique is not a straightforward procedure, and sometimes it is best to test several methods and evaluate their stability for the dataset at hand. However, this would take us too far afield, and a relatively common procedure was chosen. It is of course not a good idea to keep on searching for a cluster analysis whose results fit your preconceived ideas best.

11.4.2 Principal component analyses for binary variables The Sampled dataset of 75 images was analysed by means of categorical principal component analysis (see Section 3.6.7, p. 64), and the Northern Australia dataset via multiple correspondence analysis (see Section 3.10.2, p. 87). For binary variables these two techniques are essentially the same, but they are rather different for variables with different measurement levels (Section 1.5.2, p. 11). Because the output for the two techniques is somewhat different in places (see Section 3.10.3, p. 88) we will use them for different subsets of the rock art data. For the Sampled dataset a special route was followed, details of which will be explained together with the analysis. In essence it comes down to executing a series of multiple correspondence analyses in which nondiscriminating attributes are successively deleted from the dataset, so that we end up with one set of attributes which cannot, and another set which can discriminate between regions. It is for rock art specialists to decide which of the two sets of attributes are relevant for their studies. The Northern Australia dataset was analysed only once with a categorical principal component analysis (catpca). The aim here was to concentrate on the relation5 https://en.wikipedia.org/wiki/Cherry_picking.

11.5 Rock art: Statistical analysis

235

ships between the attributes and the images by means of a biplot (see Chapter 4, p. 106).

11.5 Rock art: Statistical analysis Before starting the analysis in earnest, it is worth performing the simplest of checks on the data, i.e. inspecting the range of the variables. In the present case, all attributes should be either 0 absent or 1 present. Unfortunately, when handling these data I got caught out myself: after umpteen analyses it turned out there were two errors in the original dataset which would have been picked up by checking the ranges of the variables. There was a score of 10 and a score of 11, clearly coding errors. They were reset at 0 and 1, respectively, based on the scores of the similar images.6

11.5.1 Comparing rock art from Algeria and from the Kimberley One of the hot issues in the Australian rock art studies is whether the images were made by forebears of the present Aboriginals or by migrants from other continents. In particular, a case has been made for immigrants of Algerian origin, as some of their rock art examples are rather similar to the Australian ones. A way to use Barry’s dataset to acquire information about such a hypothesis is to first find attributes which could shed light on the issue. The starting point was that it is better to look at the dissimilarities than at the similarities; in other words, to use those attributes which show the largest differences between images from the two regions. Only if no serious differences can be found in this manner does the hypothesis become likely. The substantive argument is that it would be slightly bizarre if painters from Algeria changed their ways and dropped their style and habit of painting when arriving in a new country. In particular, dropping the habit of making realistic paintings in favour of stick-like figures seems to be very strange indeed. The first step was to compute the attribute proportions for each region and compare them to identify those attributes for which the differences between the regions were the largest. The differences in proportions can be tested by first making a 2 × 2 contingency table for each attribute with the variables region versus present or not. As we are only dealing with 2 × 2 tables, all tests based on the Pearson X 2 have one degree of freedom, so that the X 2 values are directly comparable.7 Out of the 79 6 In

case these data will be publicly available elsewhere: the errors were observed for case 672 (Kimberley - MV; In proportion: 11 ⇒ 1) and case 2049 (Zimbabwe - EI; Pronounced bottom: 10 ⇒ 0). 7 The Pearson X 2 test is often called the χ 2 test but the χ 2 distribution is only an approximation to the exact distribution of the Pearson X 2 test statistic. Thus χ 2 is really a misnomer.

236

11 Rock art images

attributes there were ten with a Pearson X 2 value of 50 or larger, of which five were larger than 100.8 These five attributes were further analysed and are presented in Table 11.2. The attribute Boomerang was added, given its clear Australian origin; its Pearson X 2 value was 71, the largest one below 100. Next, a two-cluster solution using the nine most frequent attributes allocated all 206 Algerian images to the same cluster, but 37 of the Kimberley images were also assigned to this cluster. All remaining Kimberley images made up the second cluster, with no Algerian images joining them. Thus, 37 out of 199 Kimberley images (=19%) were ‘incorrectly’ allocated to the Algerian cluster. A further and deeper series of analyses is necessary to sort out the origin of these ’errors’, and to find out which attributes are particularly responsible for the outcome. To investigate the images in more detail we selected the six most frequently occurring attributes. With these attributes we defined a profile for each image; listed in Table 11.2. An attribute was coded as 1 if it was present and a 0 if it was not. Thus a profile of (000000) indicates that an image did not possess any of the six attributes. This was the case for 21 images. The profile of (111111) indicates that an images possesses all six attributes, but this did not occur in this dataset. For further details see the legend to Table 11.2. Inspection of Table11.2 shows that a general characteristic of the Tassili Algerian human images is that they have a realistic shape and are in proportion, with only a few exceptions, whereas in the sample there are no such shapes in the Kimberley. On the other hand the Kimberley images are almost always accompanied by one or more of the attributes: Enhanced stick, Beehive headdress, Waistband, Boomerang. The top three binary profiles in Table 11.2 (100000, 110000, 000000) occur in both regions: in total 50+55+27 = 132 images of the 405 (= 33%). Thus, about a third of the images cannot unequivocally be assigned to either region on the basis of the six attributes. However if we assign them to the region which has the highest frequency for the binary profile, there would be only 28 errors of the 405 images (14 [to K instead of A: 10000];(8+6) [to A instead of K: 110000 and 000000]), or 7%. This result could be interpreted in two opposite ways. (1) Given an image, a third of them could have come from either region; this cannot be accidental; (2) two-thirds of the images can uniquely be assigned to one of the regions, so they cannot have a common ancestry, especially because 171 Algerian images are realistic and/or in proportion (83%), whereas just 8 of the 199 Kimberley images are (4%). Images this different cannot come from the same origin, or at least we need additional information. On the basis of these results it does not seem a very likely proposition to me as a nonexpert, that the Algerian painters are the godfathers of the Australian ones in the Kimberley, but there is clearly space for debate on the interpretation of the outcomes. For the proposition to be true the necessary chronology and a coherent story about how the Algerian painters would have made the journey is necessary 8 If

there is no relationship between a region and the presence of absence of an attribute, i.e. if they are independent, the Pearson X 2 value will generally be about three or less (see Section 4.25, p. 128). The higher the X 2 the more the attribute is present in one region and not in the other.

11.5 Rock art: Statistical analysis

237

Table 11.2 Occurrence of distinguishing attributes in Algeria (A.) and the Kimberley (K.) Profile A. K. Total Description Algeria and Kimberley - profiles present in both regions 36 50 Enhanced stick only 100000 14 110000 47 8 55 Enhanced stick + In prop. 6 27 None of the distinguishing attributes 000000 21 Algeria - common; profiles not present in the Kimberley 0 115 In prop. + Realistic shape 011000 115 Algeria - rare ≤ 5%; profiles not present in the Kimberley 010000 8 0 8 In prop. 0 1 Realistic shape 001000 1 Kimberley - common; profiles not present in Algeria 10 10 Enhanced stick + Boomerang 100001 0 100010 0 17 17 Enhanced stick + Waistband 15 15 Enhanced stick + Beehive h.d. 100100 0 100110 0 38 38 Enhanced stick + Beehive h.d. + Boomerang 24 24 Enhanced stick + Beehive h.d. + Waistband + 100111 0 Boomerang 110110 0 13 13 Enhanced stick + In prop. + Beehive h.d. + Waistband Kimberley - rare ≤ 5%; profiles not present in Algeria 010001 0 1 1 In prop. + Boomerang 6 6 Enhanced stick + Boomerang 100011 0 100101 0 8 8 Enhanced stick + Beehive h.d. + Boomerang 4 4 Enhanced stick + In prop. + Beehive h.d. 110100 0 110001 0 3 3 Enhanced stick + In prop. + Boomerang 3 3 Enhanced stick + In prop. + Waistband 110010 0 110011 0 2 2 Enhanced stick + In prop. + Waistband + Boomerang 2 2 Enhanced stick + In prop. + Beehive h.d. + 110101 0 Boomerang 110111 0 3 3 Enhanced stick + In prop. + Beehive h.d. + Waistband +Boomerang Total 206 199 405 Total number of images per region ∗ Note profile order: Enhanced stick, In prop(portion), Realistic shape, Beehive headdress (h.d.), Waistband, Boomerang ∗ Some other attributes occasionally depicted in the Kimberley but not in Algeria are: Apron small, Tri- tassel, nonbeehive Headdress ∗ Note that there are almost the same number of images in both regions: 206 and 199.

238

11 Rock art images

Table 11.3 Sampled data set (Algeria, Zimbabwe, India): Attribute discrimination measures for an 8-attribute solution (left) and a 13-attribute solution (right).

Discrimination measures ≥ .30 in red, including the 0.17 of P RONOUNCED BOTTOM (left).

as well. In addition, we at least have to have chronological information about the available images; see McDonald (2005) and her references.

11.5.2 Comparing rock art from Zimbabwe, India, and Algeria As indicated above, the Sampled dataset consists of 75 images, 25 each from Algeria, India, and Zimbabwe (see Figure 11.1, p. 232). Of the possible analyses, we decided to first perform a categorical principal component analysis on all 79 attributes. After pruning the nondiscriminatory attributes we selected both the 8- and the 13-attribute solutions listed in Table 11.3 for further analysis. It may seem striking that we had to include Pronounced Bottom to realise an adequate separation of the Zimbabwean images, as only 20% of the variability of Pronounced Bottom was accounted for by the first two components. The effectiveness of this variable is due to the fact that if an image has a pronounced bottom it is Zimbabwean, irrespective of its scores on the other attributes. In our sample, half of the Zimbabwean images had this attribute. Another point worth noting is that with fewer attributes the variation accounted for by the two components increased by 10%, suggesting that several of the attributes not included introduced more error than systematic information, thus blurring distinctiveness. Again we should be careful here not to cherry pick. It is clearly for the expert

11.5 Rock art: Statistical analysis

239

to decide on the proper interpretation, statistics provides the numeric underpinning but not necessarily the interpretational framework. To understand the quality of the separation and how the attributes are specific to the three countries, we should look at the biplot of the attributes and the images (for biplots see Chapter 4, p. 106). To demonstrate the difference between countries using only few attributes compared to the complete set of 79 attributes, two biplots are presented in Figures 11.2 and 11.3. As a graph Figure 11.3 is rather less clear-cut and the locations of images are less accurately presented because the images replace several actual points. When we include all 79 attributes, Figure 11.3 shows several outliers, whereas they are not present in Figure 11.2, based on only eight attributes. This is an aspect of the technique used which pushes images with (fairly) unique attributes, for example, lying horizontally, to the boundary of the plot. Thus, points on the boundary of a plot do not necessarily indicate that they are influential; it is their unique patterns which places them apart. On the other hand, the separation between the countries seems somewhat more evident in Figure 11.3 based on all 79 attributes. The reason for this is that some of the discriminating attributes are very specific to a country, just as Pronounced bottom for the Zimbabwean images. This leads to the consideration that a further detailed selection of attributes may be worthwhile. Note, however, that this might end up as searching for an a priori conclusion at the cost of a more objective one. The two biplots show that a researcher has to make choices about which aspects of an analysis should be emphasised when presenting the results, for instance, with emphasis on statistical details Figure 11.2 or with pictorial impact, Figure 11.3. In a conference the latter may be more entertaining and dramatic. Using the 8-attribute version (Figure 11.2) we seem to get a good separation between the three countries, suggesting that each region has an idiosyncratic style of painting, with its own distinguishing characteristics indicated by the solid lines: Zimbabwe - Enhanced stick- like, Pronounced bottom; India - [Unrealistically] Shape[d], Headdress and Spear barbed; Algeria - Realistic shape, Hat, Right facing. The visually ’nice’ separation is unfortunately not as well-founded as it seems. When we perform a discriminant analysis (see Section 3.6.4, p. 60) with the variable Origin as the response variable and the two components from the component analysis as predictors, it turns out that the Zimbabwean images are all correctly allocated to their group, but about half the Algerian images and half the Indian images are classified as Zimbabwean as well. The reason for the incorrect impression is that what looks like a single point in Figure 11.2 often consists of a large number of images from all three groups, all having exactly the same scores on the variables, and thus having the same coordinates in the plot. To illustrate this, the component scores have been ’jittered’, i.e. for each point the component co-ordinates have been slightly moved by adding a tiny bit of random error sampled from a normal distribution with a mean of zero and a small standard deviation; see also the Glossary. Via this procedure the images with the same scores are spread out somewhat. To illustrate this, the crowded part

240

11 Rock art images

Fig. 11.2 Biplot of the 75 images from Zimbabwe (Z), India (I), and Algeria (A) and the eight most discriminating attributes

Fig. 11.3 Biplot: 79 attributes and 75 images from Zimbabwe, India, and Algeria. Wherever possible, images are shown by pictographs; alternatively they are represented by red squares. The attributes are portrayed in a separate graph; the eight most discriminating ones are drawn in thick red solid lines.

11.5 Rock art: Statistical analysis

241

Fig. 11.4 ’Jittered’ samples from Zimbabwe (Black squares), India (Green diamonds), and Algeria (Red dots); South-West corner of Figure 11.2.

in the bottom left corner in Figure 11.2 is shown after jittering as Figure 11.4. The results of the discriminant analysis are now understandable, as virtually all points are now shown individually. This emphasises that a simplistic visual inspection may be misleading. A picture may be worth a thousand words, but it should be properly constructed and interpreted. Pictures do not always speak for themselves, they often need assistance from someone in the know. An alternative for the 8-attribute solution would be to allow for more attributes, because, for instance, the 13-attribute solution has around 85% correct allocations for Algeria and Zimbabwe, but still only 44% for India. The overall conclusion from the comparison of the three countries is that there are clear differences between the types of rock art images included in the Sampled dataset, but that there are also types of images which are numerically and visually difficult to allocate unequivocally to one country or another. Moreover, we have not included in the analysis the attributes which do not show differences between the countries.

11.5.3 Comparing rock art images from within Australia The question tackled in this section is how much rock art images differ between three (more or less adjacent) regions in Northern Australia. The relevance of the question is whether it is likely that the images of people are similar and could have a common ancestry or could have influenced each other over time.

242

11 Rock art images

Fig. 11.5 Northern Australia: Arnhem Land, the Kimberley, and Pilbara.

For the within-Australia comparisons we have used the Northern Australia dataset, containing only Northern Australian images, i.e. 315 from Arnhem Land, 199 from the Kimberley, and 21 from Pilbara; for the approximate locations of the regions see Figure 11.5. As before, I decided to reduce the 79 attributes to those which really count, so that the nondiscriminating attributes do not add random noise, obscure the separation between regions, and hinder the interpretation. Note that here, too, we are assuming that differences say more than similarities. In Figure 11.6 discrimination measures for the nine- and six-attribute solutions are shown. Section 3.10.3 (p. 88) presents more details about technical aspects of the discrimination measures. In particular, it is pointed out there that the lengths of the arrows in the plots represent the sizes of the discrimination measure for each attribute, and thus indicate how much of an attribute’s variability is accounted for by the components. A graph with all 79 attributes would be too messy to be of much use. The discrimination measures for the six-attribute solution are listed in Table 11.4. Table 11.4 Discrimination measures for the six-attribute solution Attribute

1

Dimension 2

Stick figure Infilled figure Enhanced stick figure Standing Other position Other facing Region % of Variance

.91 .91 .89 .06 .10 .01 .25 48.14

.02 .04 .05 .79 .67 .54 .00 35.32

Region is a supplementary variable (see text).

Table 11.4 also includes the attribute Region as a supplementary variable. Such a variable does not play a role in the analysis itself, but its discrimination measure is calculated afterwards to find out how highly this supplementary attribute is correlated

11.5 Rock art: Statistical analysis

243

with the components (see also Chapter 4, p. 127). What is striking is that the size of the discrimination measure for Region is low (0.25) for the first component, and virtually zero for the second component ( ≤ 0.002). Thus, while there are differences between the images with respect to Standing, Other position, and Other facing, these cannot be explained to any substantial degree by the region in which they are located. There is some differentiation between the regions with respect to Stick figure, Infilled, and Enhanced stick. This is further illustrated by the biplot of attributes and images (Figure 11.7); for more information on biplots see Chapter 4, p. 106. Incidentally, the attribute graph may be superimposed on the images graph such that their origins and axes coincide. The two graphs are here presented separately to make them more easily readable.

Fig. 11.6 Northern Australia dataset. Discrimination measures of the nine-attribute and sixattribute solutions, respectively.

The impression from Figure 11.7 is that of a neatly 6 × 6 ordered grid where not all 64 points of the grid are present. This impression is correct because six binary values can form 64 binary profiles of zeroes and ones, such as (11100) and (000111), where a 1 indicates presence and 0 absence of an attribute for the image (see also Table 11.2, p. 237). Points of the grid are absent because the corresponding profiles did not occur in the dataset. In Chapter 16 (p. 313) on reading preferences another, similar example with binary profiles is provided. It should be emphasised that there are 323 (60%) images in the bottom row of Figure 11.7. Of all images 69% do not exclusively occur in one region. Therefore, most points in the plot represents images from more than one region. Only 31% of the images have an attribute profile unique to one region, so that roughly one third of the images can be unequivocally assigned to a particular region.

244

11 Rock art images

Fig. 11.7 Biplot. 6 attributes and 535 images: Kimberley, Arnhem Land, Pilbara. The two graphs may be superimposed by matching their coordinate axes. The numbers next to the points indicate how many images per region have the same profiles.

Given the extent of overlap we cannot jitter the points as we did above, because the points represent a range from 1 to 153 images with the same profiles; the number of overlapping images is indicated in the image part of the biplot. From the attribute part on the left of Figure 11.7 we see that the horizontal dimension is dominated by the contrast of the attributes Stick figure versus Infilled figure and Enhanced stick figure, whereas the vertical dimension represents the contrast between Other position and Other facing versus Standing. This was already shown in the left-hand panel of Figure 11.6. Inspection of the binary profiles belonging to all regions shows that all of the 323 images of the bottom row of Figure 11.7 all images are standing and do not possess the attributes Other Facing or Other position, whereas for the 27 images at the top, all but one from the Kimberley, the situation is the reverse. Of Pilbara’s 21 images there are only 3 unique ones, and all other patterns also occur in the other two regions. On the basis of the six attributes which explain the most variability in the Northern Australia dataset it seems that it will not be possible to allocate an arbitrary blackand-white image to only one of the regions with great certainty. A similar result, details of which are not shown, occurs when on the basis of the differences in proportions, we first select those attributes for which we have the largest differences between proportions as we did in the Algeria versus Kimberley case. If we then perform another multiple correspondence analysis (see Section 3.10.2,

11.5 Rock art: Statistical analysis

245

p. 87) using only the selected attributes, there still is a considerable regional overlap between the images. This is true regardless of whether all three regions are included in the analyses or only Arnhem Land and the Kimberley. The conclusion for the Australian images is that there are images which occur uniquely in one of the regions, but overall the regional images show considerable overlap in their attributes. A detailed discussion of various explanations for this observation can be found in McDonald (2005). At the same it should be mentioned that Pilbara images are primarily petroglyphs, whereas in the other two regions they are pictographs (see also the discussion in Section 11.3, p. 230). As mentioned above, in constructing his database Barry explicitly did not consider the construction of the images, but concentrated on their shapes and attributes.

11.5.4 Further analytical considerations A final methodological word about this chapter and its topic. Below is a quote from an unpublished web page by Jack Pettigrew.9 It is supplied in full because his untimely death on 6 May 2019 makes it unlikely that the text will appear in print, and it is highly relevant for the evaluation of the results described above. Interesting is also his view on quantification and statistical analyses of the rock art images, with which not everyone, me included, will necessarily agree. An often-quoted study by Michael Barry purports to show a link between rock art in Arnhem Land and Bradshaw rock art (in the Kimberley). I am highly critical of this study [...]. The first problem is that it over-estimates the abilities of computer pattern recognition to do the job, compared to the human visual system. Great advances in computer pattern recognition have been made, but the human visual system is still superior in general tasks like the identification of art styles. There are many studies that support this point, but crucial refutation comes from within the study itself, whose own data can be used to show that the computer algorithm fails to recognise obvious similarities while at the same time making a poorly-supported claim of similarity between Arnhem Land and Kimberley art. One example of this failing is contained in the study’s own illustrations showing identical icons of a “three-point sash” in Bradshaw and Tassili (N. African) art that are not recognised by the computer algorithm used. On the other hand, one’s own visual system is completely floored by the computer’s conclusion of similarity between the obviously dissimilar samples of Arnhem Land and Bradshaw art that are chosen for illustration. The study also fails to recognise or deal with the extremely fine lines (a fraction of a mm across) found in Bradshaw rock art. It is a mystery how long fine lines a few hundred micra across can be produced on that abrasive granular rock, just as comparable “hair’s breadth” lines have mystified students of African rock art. There is also another severe problem, one that afflicts many other attempts at generalisation about Bradshaw art. This problem is the dramatic evolution that this art has undergone over many millennia, with a high risk that the earlier, complex (“Classic Realistic” and “Classic Abstract”) art will be overlooked because of its rarity. In some formulations, there is later contact, and perhaps even exchange, of the Simple Bradshaw art with Aboriginal culture, emphasising the importance of timing in trying to answer these questions. The timing aspect 9 J.

D. Pettigrew, Queensland Brain Institute; University of Queensland, What’s Wrong with Calling it Gwion Gwion? http://www.uq.edu.au/nuq/jack/Gwion.docx; accessed 15 June 2020.

246

11 Rock art images

is completely absent from Barry’s work, whose conclusions may be relevant, if at all, only to the latest era of the Bradshaw culture just before its demise. The study has impressive numbers, no doubt because of the labour-saving assistance of the computer. However, the analysis falls down completely in many individual cases which are illustrated. (Pettigrew, 2005).

Further discussion of such issues can be found in the thesis by Rainsbury (2009) and the paper by McDonald (2005). The primary aim of the exercise in this case study was to show how data such as Barry’s may be handled. There are several methods which are useful in achieving an overview of what is similar and what is different in the images. In particular, we used basic data inspection to detect any rogue data points, compared proportions of presence or absence of attributes, and assessed their relevance via discrimination measures in the search for informational attributes. Other multivariate analyses included were multiple correspondence analysis and categorical principal component analysis for binary variables to acquire an overview of the attribute patterns and associated structure of the images.

11.6 Other approaches to analysing rock art images The analyses in this chapter leant very much on techniques for binary variables. An alternative could have been to first create a similarity matrix for the attributes and then perform a multidimensional scaling (see Section 3.9.2, p. 80). However, in that case the information about the individual images is lost. The same is true for several cluster analyses, because then the differences due attributes are not included. Thus, a ‘proper’ analysis depends very much on what the questions are, and what can be assumed about data and the cases.

11.7 Content summary Even though it was not my specific intention to substantially contribute to the original discussion, some results seem worth considering in the ongoing statistical debate. First of all the methods used could provide numeric evidence and help in making general statements about the similarities and differences between regions, such as the observation that Algeria and Kimberley images are in general rather different. With respect to the images from Zimbabwe, India, and Algeria the techniques suggested that some attributes are very much region specific. Finally, the images in the three Australian regions have more similar attributes than the other countries in this study, but regional differences occur as well. Obviously, we have to realises that human images are expected to be similar as they all depict the human form. In addition, statistics and data deal not only with systematic but also with random variability. Thus, pure randomness may cause

11.7 Content summary

247

differences in the style and execution of images in unknown amounts, so that some images from different regions may be similar to or different from each other by pure chance. The end of this chapter is marked by Conan Doyle’s ‘dancing men’ from his story The Adventure of the Dancing Men. They are not unlike some rock art human images and carry a specific message. In terms of attributes, they have at least the following: Active, Stick figure, but not Stick head; they are either Front facing, Right facing, or Left facing. Two of them have the attribute Upside Down and three of them have the attribute Wave flag. It is unlikely that they are of Australian origin. The code was deciphered by Sherlock Holmes, but can also be found at Geocaching toolbox.10

10 Geocaching toolbox: https://www.geocachingtoolbox.com/index.php?lang=en&page=codeTabl-

es\&id=dancingMen; accessed 15 June 2020.

Chapter 12

Questionnaires: Public views on deaccessioning

Abstract Main topics. In how far do Italian museum visitors appreciate their museums selling part of their stock (deaccessioning)? Of primary interest was whether people think this should be done at all, which objects are eligible, and who was allowed to buy them. Data. 295 Italians visiting open-air museums in Rome filled out the Italian questionnaire of 22 questions on such topics as which objects could be sold, under which conditions they could be sold, who should be allowed to buy them, and what could be done with the proceeds from the sales. Research questions. Next to an interest in opinions about deaccessioning, a major question was whether the structure of the questionnaire was recognisable in the answers of the respondents. In addition, we would like to know whether within the subdomains the items form a single scale. Statistical techniques. Confirmatory factor analysis, structural equation modelling, scale analysis, Cronbach’s alpha Keywords Deaccessioning · Italian museums · Questionnaire · Scale construction · Confirmatory factor analysis · Structural equation modelling · Cronbach’s alpha

12.1 Background The focus in this chapter1 is on investigating an existing questionnaire designed to understand how the public, or rather museum visitors, feel about deaccessioning. The analysis should be seen as an example of how one may go about evaluating questionnaires and their internal structures in general, rather than an attempt to design the ideal questionnaire to measure deaccession. We have assumed the sample to be representative of the target population. The sample could not be designated as random, even though the questionnaires were

1 This

chapter was co-authored by Prof. Marilena Vecco, CEREN, EA 7477, Burgundy School of Business - Université Bourgogne Franche-Comté; e-mail: [email protected]. © Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_12

249

250

12 Questionnaires: Public views on deaccessioning

handed out randomly to visitors. The fact that only 27% of them were prepared to answer probably destroyed the planned randomness. The primary reason people gave why they were not prepared to answer was lack of time. Whether this affected the randomness of the sample is impossible to say. Thus, in this chapter we are not so much concerned with designing a questionnaire, formulating the items2 in the cleverest way possible, determining the optimum number of answer categories, etc. Such topics may be commented on in passing but it is assumed that a researcher has dealt with such considerations before coming to this chapter.

Deaccessioning museum stock As mentioned above we will focus on a questionnaire concerned with the public view towards deaccessioning. Deaccessioning may be defined as the: ‘permanent removal or disposal of an object from the collection of [a] museum by virtue of its sale, exchange, donation, or transfer by any means to any person’.3 The reasons for removing objects from a museum collection can be manifold, such as streamlining the collection, raising funds for new acquisitions, creating storage space for new objects, stopping gaps in the operational costs, etc. As emphasised by museum directors and curators, deaccessioning is a complex and subtle process because many issues are involved, such as who owns the objects, where the artworks came from, who paid for or donated them, who has the right to acquire the deaccessioned objects, and what the museum may do with the sale proceeds. Many such issues are laid down in the management policies of the museums, and in every case the decisions have to be carefully justified; for an in-depth treatment see Vecco and Piazzai (2015); Vecco, Srakar, and Piazzai (2017) and their references; for a further fascinating exposure of the dilemmas and issues see Montias (1995).

Deaccessioning: Role of the public Deaccessioning decisions are initially made by the museums themselves, but go past many layers of various other institutions. However, understandably the general public is rarely consulted. It is also rare for the general public to be asked about their opinions about deaccessioning in general, let alone that these views are heeded. Crucial in such public opinion seeking is what views the people filling out the questionnaire have on the purpose of museums. Crudely stated: (1) Is the purpose of 2 The

word item always refers to an item in the questionnaire; a museum ’item’ will always be referred to as an ’object’. 3 New York Consolidated Laws, Education Law - EDN §233-a. Property of the state museum; http:// codes.findlaw.com/ny/education-law/edn-sect-233-a.html; accessed 15 June 2020.

12.2 Research questions: Public views on deaccessioning

251

museums collection oriented, i.e. to preserve the cultural heritage per se, or (2) public oriented, i.e. to inform the public about their culture and their past, and to optimally present their collection to the public, or (3) both? Public trust in the proper management of museum collections and proper behaviour with respect to deaccessioning are issues that require consulting the public itself, rather than making assumptions about them. As stated in Vecco et al. (2017, p. 35) ‘The perspective of the public as the primary stakeholder of the museum collections has not received sufficient attention in scholarly and professional literature’. In other words, creating the means to investigate public opinion about deaccessioning is clearly a necessity.

12.2 Research questions: Public views on deaccessioning As explained above, the major methodological topic here is the analysis of the questionnaire. Naturally, we assumed that the items in the questionnaire were representative for all subdomains of deaccessioning, and we were interested in whether the division into these specific subdomains could be empirically confirmed, put differently, can the subdivision of the questionnaire into specific subdomains of deaccessioning be recovered from the respondents’ answers. If confirmed, the next question is whether the items within the subdomains form consistent scales. Therefore, the approach in this chapter is different from that in Chapters 9 and 10. There, no explicit a priori assumptions were made about the structure underlying the questions or rating scales; moreover, in this study we are ignoring individual differences. On the basis of the literature, Vecco et al. (2017, p. 36) formulated hypotheses about deaccessioning in which among other things the following aspects were mentioned: (1) using proceeds from deaccessioning, (2) improving museum infrastructure, (3) using the proceeds for commercial purposes, (4) relocating objects in order to maintain or assure public accessibility. Following the first part of the Vecco et al. paper, our purpose here is to empirically evaluate the design of the questionnaire. In particular, it is our intention to evaluate whether the questions were indeed tapping into the constructs underlying the items (scale construction). Full details about the exact questions (translated from Italian), hypotheses, the interpretation of, and conclusions from the outcomes can be found in the original paper (Vecco et al., 2017). We will, however, start by supplying some information about the items themselves.

252

12 Questionnaires: Public views on deaccessioning

12.3 Data: Public views about deaccessioning 12.3.1 Questionnaire respondents A random selection of Italian visitors entering the Colosseum and/or the Roman Imperial Fora were approached for filling out the Deaccession Questionnaire (Table 12.1). A total of 310 valid forms were obtained, of which 295 had no missing data. We will restrict ourselves to forms with complete data and refer to a discussion of missing data to McKnight et al. (2007); see also Section 2.3, p. 36.

12.3.2 Questionnaire structure The deaccession questionnaire consisted of three parts. Part A contained questions about types of objects that could be sold: ‘A sale can be acceptable if ...’; Part B referred to the respondents’ attitudes towards aspects of selling objects, such as the actual procedure of a sale, and where the object was going to end up: ‘Selling is acceptable only if ...’; Part C referred to the views of the respondents on the purposes to which the proceeds of the sales would be put: ‘Proceeds should be used for ...’ (Table 12.1). Validating the structure of the Questionnaire Even though the research itself was directed towards the aspects mentioned above, because the questionnaire was new it was also important to establish whether the way the respondents answered was in line with the a priori grouping of the items. After all, the intention was to measure the latent variables underlying the three parts, so that hypotheses about the public view towards deaccessioning could be investigated. A problem in this kind of research is the wish to simultaneously develop and validate the questionnaire, and establish the validity of the hypotheses under study. Ideally one would like to approach the two goals separately, but it would be a waste of resources not to use the outcomes of the questionnaire to answer the hypotheses as well. Of course, such results need a replication study to confirm the outcomes obtained.

12.3.3 Type of data design The design of the dataset was such that it corresponds with an internal structure design (see p. 54), although it also has something of a dependence design, because we assume that the answers in the three sections are driven by a common factor hypothesised a priori. Thus, we could consider the three factors as latent predic-

12.3 Data: Public views about deaccessioning

253

tors for the manifest (observed) variables, i.e. the items in the questionnaire (see Section 3.8.3, p. 73). Our aim in this chapter was to confirm this theoretical factor structure in practice. We also take one step further by investigation whether single consistent scales can be constructed within each subdomain to measure the factors more directly. The items of the questionnaire were five-point bipolar scales, also referred to as Likert scales, with values 1 (= completely disagree; left-end point) to 5 (= complete agree; right end point); 3 is the neutral point. All items were arranged such that all high values express the same sentiment, i.e. all high values indicate favourable opinions towards deaccessioning (for data types see Section 1.5.3, p. 14) Table 12.1 Italian deaccession questionnaire Item

Abbreviation

Mean

St.Error St.Dev.

Group A: Sale can be acceptable if... 8. The object is not within theme museum2 4. The object is not within theme museum 6. The object was not recently exhibited 2. The object has minor interest 3. The object has close substitutes 7. The object is of a lesser art 1. The object is not local 5. The object has more recent origin

Out of theme2 Out Of theme Not exhibited Minor interest Substitutes Lesser arts Provenance Age Of object

3.01 2.97 2.94 2.89 2.88 2.66 2.58 2.43

.08 .08 .02 .08 .08 .08 .08 .07

1.39 1.35 1.41 1.33 1.38 1.33 1.33 1.27

Group B: Selling is acceptable only if... 15. Future public accessibility guaranteed 09. Object goes to a public collection 11. Sales are transparent to the public 12. Sales are for limited number ofobjects 10. Object was not a gift or legacy 14. Relevance destination museum is same 13. Destination museum is nearby

Open to public2 Open to public Transparency Set a limit Not a gift Same relevance Same territory

4.24 4.20 4.11 3.52 3.42 3.40 3.08

.06 .07 .07 .08 .08 .07 .08

1.10 1.14 1.24 1.42 1.36 1.27 1.38

Group C: Proceeds should be used for... 19. Restoration of other objects 22. Education activities 17. Building maintenance 18. Building improvement 16. New relevant acquisitions 21. Lower admission 20. New museum services

Restorations New education Maintenance Expansions Acquisitions Entrance fees New services

4.14 3.80 3.74 3.71 3.65 3.63 3.17

.06 .07 .08 .07 .07 .08 .08

1.04 1.23 1.29 1.26 1.22 1.36 1.33

∗ The item numbers correspond to those in the questionnaire. ∗ The questions are ordered according to their means. ∗ Note that all means but one of the A items are below the neutral point 3 of the 5-point scales, and thus unfavourable towards deaccessioning. B and C items are above 3, and thus more favourable towards deaccessioning.

254

12 Questionnaires: Public views on deaccessioning

12.4 Analysis methods Thus, for our investigation we have an internal structure design for continuous variables (see Section 3.8, p. 65), and depending on the precise purpose we could use principal component analysis (Section 3.8.1, p. 66), confirmatory factor analysis (Section 3.8.3, p. 73), and/or structural equation models (see Section 3.8.4, p. 75) for our analysis. These three techniques were also used in the original paper. In this chapter we will only report on confirmatory factor analyses within the context of structural equation models using the program eqs.4 A good introduction with many practical examples is Byrne (2013).

12.5 Public views on deaccessioning: Statistical analysis Even though the research questions concentrate on the relationships between variables, it is a good idea to first have a good look at more elementary statistics such as means and variances as well. The analysis of the means already provides considerable information about the opinions of the respondents with respect to deaccessioning. Moreover, insufficient variation in the scores could prevent a proper analysis of the structure of a questionnaire, because correlations and/or covariances may not be stable; this should be inspected. A lack of variability occurs if all respondents provide (almost) the same answer. It would do wonders for the accuracy of the means, though.

12.5.1 Item distributions The four items B09, B11, B15, and C19) had a skewness around −1.5, because a large majority of the respondents agreed with the items (scored 4 or 5). The items had a mean of four or more on the five-point scales. This was deemed acceptable for a sample size of around 300. Both numeric (Pearson) correlations and ordinal (Spearman) correlations were computed and resulted in essentially the same values, so that subsequent analyses were based on numeric coefficients.

12.5.2 Item means Figure 12.1 shows the means of the 22 items grouped by the three questionnaire sections, but rearranged according to the degrees of acceptance of deaccessioning. The intervals around the means are equal to 1.5 × the standard error of the mean, 4 http://www.mvsoft.com/index.htm.

12.5 Public views on deaccessioning: Statistical analysis

Fig. 12.1 Means of the items, grouped by parts of the questionnaire

255

256

12 Questionnaires: Public views on deaccessioning

as, for instance, recommended by Basford and Tukey (1999, p. 266). If two intervals around two means do not overlap, then the means are not significantly different at an α level of approximately .05. Thus, such a line graph can be used to get an overall view of the reliable differences between the means; see also Chapter 4, p. 115. The means of the questionnaire and conclusions based upon them are shown on p. 253 in Table 12.1 (third column) and in Figure 12.1 below each panel. As can be seen the variabilities of all items are more or less equal at roughly a little over one scale point, which points to sufficient variability for further analyses. A caveat. It should be realised that comparable item means do not imply that items are scored similarly by the respondents; we will need the item correlations for this. Thus, even though the means in Figure 12.1 roughly seem to fall into four groups, this does not mean that the group structure of the items as such is validated; items within a similar means group may just as well be uncorrelated or negatively correlated. Clearly, such information is particularly necessary if we are to decide whether the items belong to their designated subdomains, and within this subdomain form one or more consistent scales. Because the item distributions are reasonable normal, we can base the analyses of the questionnaire structure on the item covariances or correlations. For more information on normal distributions see Section 2.2.1 (p. 24).

12.5.3 Item correlations A measurement model is a model concerned with the relationship between observed items and one or more assumed underlying latent variables. In particular, the question is whether ’scores’ on latent variables are responsible for the observed scores on the items, and whether correlations between the items are also the result of scores on a common latent variable. In the present case, this question has the shape of wondering whether the three groups of scales measure three latent variables, each of which represents a particular type of opinion about deaccessioning. Before going into the confirmatory factor analysis it is useful to inspect the correlations to get a feel of the relationships between the variables themselves, without assuming beforehand that there are underlying factors. We will show this for each part of the questionnaire separately. To interpret the correlations in Table 12.2 we can use the fact that with N = 295 a confidence interval around a correlation of .40 runs roughly from .28 to .52. In the table the correlations above .52 are in bold, those below .28 in italics. The correlations between items of different groups are as follows: ∗ Group A with Group B: average item correlation is .11; Range: -.09 – .28 ∗ Group A with Group C: average item correlation is .15; Range: -.04 – .27 ∗ Group B with Group C: average item correlation is .14; Range: -.03 – .25 Thus, the between-group item correlations are consistently lower than the withingroup correlations, so that the existence of at least three subdomains is a reasonable

12.5 Public views on deaccessioning: Statistical analysis

257

Table 12.2 Correlations per item group

The correlations within a dashed-line box form a group of comparatively high correlations, i.e. the items within the box were answered in the same manner. Note that the information in the table is enhanced by rearranging the order of the items judiciously. Using the order in the questionnaire is less effective.

assumption, each of which represented by a latent factor, and the latent factors may be correlated among themselves. What remains an open question is whether we can speak of an overarching domain of deaccessioning. We may and should hypothesise that the three subdomains measure three different latent variables or different aspects

258

12 Questionnaires: Public views on deaccessioning

of a central latent variable (here, deaccessioning). This hypothesis will be tackled by comparing different confirmatory models. From the confidence interval (.28 – .52) around the mean correlation (0.40) of the Group A items given above, we may informally consider all correlations within this interval as more or less equal to .40 so that such values indicate a considerable relation between the items. On the basis of this idea, there is a clear consistency among the Group A items. Something similar can be said for five of the seven items in Group B, although the correlations are consistently somewhat lower in Group B than in Group A. The two items referring to the museum’s characterisation (B13 and B15) are not really part of this. Finally, Group C items are largely a single group if ‘Acquiring new objects’ (C16) is not included. However, note that there is some distinction between C17–C19 and C20–C22, indicated by the size of their between-group and within-group correlations. Furthermore, we see, not surprisingly, that original and control questions in Groups A and B (marked with an added 2) are highly correlated, in particular r (A04,A08) = .74 (Sale can be acceptable: Out of theme) and r (B09,B15) = .76 (Selling is acceptable only if: Open to the public). The fact that the correlations are not closer to 1.00 indicates light inconsistencies in respondents’ answers. What is surprising is the even higher correlation (.83) between ‘Proceeds should be used for building maintenance’ (C16) and ‘Proceeds should be used for expansion’ (C18). In other words, these items were seen as two sides of the same coin, using the proceeds for institutional benefits..

12.5.4 Measurement models: Preliminaries In Section 3.13 (p. 91) a distinction with respect to model selection was made between classes of models appropriate for the research question, and model selection within a model class. From our discussion in the previous paragraphs it has become clear that we are interested in a confirmatory model with latent predictors. This would lead us towards confirmatory factor analysis, or possibly towards a slightly more complex structural equation model. Before we start the search for an appropriate confirmatory factor analysis, there is a problem that should be solved, primarily at the substantive level. Was it really the intention of the researchers to weight the content of the items with control items (i.e. item 4 and 8, and item 09 and 15; see Table 12.1) as double that of the content of the other items? In this chapter we will assume that this is not the case, and therefore the control items are not included in the further analysis. Another question is whether it makes sense to include C16, as it has no large correlations with any of the other Group C items. It obviously appeals in a different ways to the respondent; it is for the experts to investigates this further. One may argue that the analysis will show that it does not fit well, and therefore degrade the model fit. Consequently it was decided to not include this item either.

12.5 Public views on deaccessioning: Statistical analysis

259

Fig. 12.2 Confirmatory factor models: Prototypes. The factors are the latent variables; the items the observed variables. Arrows indicate the direction of the influence; the straight dashed line between two factors represents their correlation. The es and ds represent the random errors or residuals, i.e. the part of the data that is not fitted by the factor model.

Another point which should be mentioned is that confirmatory factor analyses are customarily carried out on covariance rather than correlation matrices. The reasons for this are largely technical, but some discussion can be found in Section 3.8.4 (p. 75). For interpretation correlations are much more effective, because they have a fixed range from −1 to +1, whereas covariances depend on the sizes of the variances, which can be, and often are, in different measurement units. Therefore, even though our analyses are performed on the covariance matrices, when reporting the outcomes it will provide more insight if we interpret the results in terms of correlations. Even though researchers often have a clear idea of the structure of their questionnaire, and which items are indicators for which latent variables, it is vital that these ideas are tested in practice, i.e. against the opinions of the people for whom the questions are meant. After all, academic experts may have very different ideas from those of people of other walks of life when they are filling in the questionnaire. Thus, for a particular questionnaire a number of confirmatory factor models should be formulated to check whether the researcher’s theoretical ideas about its structure concur with those of the general public. Such structural models may differ in aspects such

260

12 Questionnaires: Public views on deaccessioning

as the number of (first-order) factors, which factors predict which items, whether the factors are correlated, whether the factors in turn may be predicted by other factors (within confirmatory factor analyses called second-order factors), and whether some of the residuals remaining after the regression on the items are themselves correlated or not. Incidentally, within this context residuals are generally referred to as errors. Figure 12.2 offers a graphical representation of some prototype confirmatory factor models. Given our discussion of Table 12.2, we have to become more specific as to which type of confirmatory models should be involved in our search. It seems sensible to check whether each of the three groups can be reasonably modelled by a single factor, or whether further factors may be necessary to get a well fitting model. A further question is whether it makes sense to include a second-order factor representing Deaccessioning in the model. The existence of such a factor would mean that the respondents’ answers to all items contained an aspect of their more general attitude towards deaccessioning. A similar situation exists in intelligence research, where often a general intelligence second-order factor is modelled next to first-order factors such as verbal intelligence and numeric intelligence.

12.5.5 Measurement models: Confirmatory factor analysis An acceptable procedure is to first work out what kind of factor analysis models will answer the questions about structure. Such a model consists of a number of regression-type equations, one for each item (the response variable), with one or more factors as the latent predictors plus a residual. Of course not all the variability in an item will be exactly be predicted by the factors, so that the predicted scores for the items based on the regression equations are not exactly the same as the original ones. Once the predicted scores have been estimated we can use them to calculate a covariance matrix based only on the predicted or factor scores. Such a matrix is called the model-based or implied covariance matrix (see Chapter 4, p. 125). If the factor model is a good one, the implied and the observed covariance matrices should be very similar. How similar can be expressed by several fit measures. Statisticians have devised a large number of these, each sensitive to another aspect of the fit. For an overview, see Chapter 4, p. 125; see also Byrne (2013). Some of these fit measures can be formally tested. The desire is that these model tests will not lead to rejection for substantively plausible models. In contrast, null hypotheses tests or tests for specific parameters are generally carried to reject them, i.e. to conclude that the means are different, the correlations are most likely not equal to zero, the inclusion of the parameter is warranted, etc.

12.5 Public views on deaccessioning: Statistical analysis

261

12.5.6 Measurement models: Deaccessioning data To verify a plausible model via confirmatory factor analysis, we need to investigate how different an implied covariance matrix constructed using this model is from the observed covariance matrix. As discussed above several measures of fit are used for this purpose (see Chapter 4, p. 125: structural equation models, and Section 2.5.2, p. 43). Here we will use some of the more common fit measures. These are provided for the models considered in Table 12.3; below the table their recommended minimum or maximum values for a good fit are indicated in square brackets. See also Schermelleh-Engel et al. (2003, p. 52).

Model comparison: Fit measures Table 12.3 presents an overview of fit measures for the following models: (I) onefactor model; [II] two-factor model; (III) three-factor model with one factor for each group with correlations between the factors (this model is technically equivalent to a second-order factor model); (IV) four-factor model (two factors for Group C) with a second-order factor; (V) four-factor model with correlations between the factors; (VI) a variant of model (VI), and (VII) a variant of model (V), both with an extra correlation between the errors of item B13 (New collection is nearby) and B14 (New collection is relevant). Models with correlated factors, (II), (III), (V), and (VII) are like the right-hand diagram in Figure 12.2, which by the way also has two correlated errors. Models with a second-order factor, (IV) and (VI), are similar to the bottom left-hand diagram Figure 12.2, except for the item (item 3) influenced by both firstorder factors. For a brief introduction into structural equation models see Chapter 4, p. 125, and for confirmatory factor models Chapter 3, p. 73. Even without a detailed explanation of the various fit measures, comparisons of the obtained fit measures (with their commonly accepted values of good fit indicated between square brackets), show that four-factor models are a great improvement over models with fewer factors. Moreover, adding a correlation between the B13 and B14 items is an effective means to get more attractive four-factor models. The choice that remains is whether to include correlations between the factors or postulate a second-order factor influencing all first-order factors, Model (VI) or (VII). This final decision is partly one of interpretability but also one of personal preference. In our opinion the better fit of the correlated factor models does not compensate for the increase in interpretability due to the second-order factor, which actually conforms to a priori design considerations that include a single deaccessioning factor. However, the support for a general second-order deaccessioning factor is not so strong as we would like, and further experience with the questionnaire or some additional or rephrased questions will be useful to settle the matter.

262

12 Questionnaires: Public views on deaccessioning

Table 12.3 Confirmatory Factor Analysis Model (factors) code χ2 (I) 1f (II) 2f corr. (III) 3f corr. = 3f + 2e (IV) 4f + 2e (V) 4f corr. (VI) 4f + 2e + e(B13,B14) (VII) 4f + corr + e(B13,B14)

1f 2f 3f 4f-2e 4f-cor 4f-2e-E 4f-corE

1087 670 463 337 337 285 227

df

AIC

SRMR CFI

152 151 149 148 146 147 145

783 670 165 41 39 -9 −13

.09 .08 .08 .07 .07 .06 .06

.48 .71 .83 .90 .90 .92 .93

RMSEA .145 .108 .085 .066 .066 .056 .056

∗ Description of models: # f = # factors; 2e order = second-order factor. ∗ e(B13,B14) = correlated error term for items B13 and B14; ∗ Below, values for not rejecting a model are given between square brackets ([]), where the first threshold is the lenient one. ∗ χ 2 = Measure of fit of the implied covariance matrix to the observed one [χ 2 ≤ 3];[[≤ 2]] df = the degrees of freedom, [χ 2 /d f ≤ 3][[≤ 2]]; AIC = Akaike information criterion [[smallest]]; SRMR = Standardised Root Mean-square Residual [≤ .08][[≤.05]]; CFI = Comparative Fit Index [≥ .95];[[≥ .97]] RMSEA = Root Mean-Square Error of Approximation [≤ .08] [[≤ .05]]

Model comparison: Deviance plot To assist model selection, we will inspect a deviance plot in which the lack of fit or deviance is plotted against the degrees of freedom (df ), where df = num – parm = ‘total number of independent covariances’ minus ‘number of parameters in the model’. Models with many degrees of freedom ( = few parameters) are cheap, simple models that are mostly easier to interpret (see, Kroonenberg, 2017, for details). If we equate good fit with high quality, and degrees of freedom with cheapness (low costs), a good-fitting model with many degrees of freedom is equivalent to a high-quality bargain. The deviance plot is a graph which shows the quality of models against their price. Figure 12.3 shows on the right-hand side cheap models which have many degrees of freedom but they are of low quality. To the left are the highquality models, but they are expensive. To make model comparisons easier, we have connected the models in the graph via a convex hull running from the cheapest, worst-fitting model on the right (Model 1f) to the most-expensive best-fitting model on the left (Model 4f-cor-E). The red X in Figure 12.3 shows roughly the location of what would be a low-cost high-quality model with few parameters and many degrees of freedom, but such bargains rarely exist. Nearly always a compromise has to be struck between quality and cost. In this particular case, moving from the right-hand to the left-hand side of Figure 12.3, we see the curve connecting the models getting flatter and flatter, which means that there is less and less gain in fit for investing more and more degrees of freedom, i.e. having more and more parameters in the model. The final decision is

12.5 Public views on deaccessioning: Statistical analysis

263

Fig. 12.3 Deviance plot for selecting a confirmatory factor model. Preferred models lie on the dashed line. They have a favourable fit/degree-of-freedom ratio. 4f-2e-E is our preferred model. X is the approximate location of an optimal model, if it existed.

naturally subjective but the deviance plot confirms our earlier choice: Model (VI) with 4 factors + a second-order factor + correlated error terms (Model 4f-2e-E).

12.5.7 Item loadings The regression coefficients or loadings for predicting the items by the factors are provided in Table 12.4, p. 264. With only one-factor predicting an item, the regression coefficients represent at the same time the correlations between an item and a factor, and may thus be interpreted as such. The correlations of the second-order factor with the first-order factors are shown in the table as well. From the same table it becomes clear that the Group A items correlate more or less equally with the first factor (mean correlation ≈ .64). This is less so for Group B items, as the correlations of B13 and B14 are much lower than those between the other Group B items; thus it is primarily B09, B10, B11, and B12 that determine the interpretation of the factor, whereas the judgements on B13 and B14 have something specific in common as shown by the correlation between their error terms. The items of the two benefit factors of Group C are each substantially influenced by their factors as can be deduced their high correlations with these factors. Finally, the general or second-order factor is a strong predictor of the first-order factors (mean correlation ≈ .62).

264

12 Questionnaires: Public views on deaccessioning

Table 12.4 Confirmatory Factor Analysis Item

Predictor

Loading

Standard

(Factor)

r(Item,F)

Error

R2

Error Term

Coefficient

Error Correlation

Factor 1 (F1): Deaccession acceptance depends on characteristics of the object A01 Provenance

F1

.69

X

e1

.47

.73



A02 Minor interest

F1

.68

.10

e2

.46

.74



A05 Age of object

F1

.67

.09

e5

.45

.74



A04 Out of theme

F1

.66

.10

e4

.44

.75



A07 Lesser art

F1

.61

.10

e7

.37

.80



A03 Substitutes

F1

.58

.10

e3

.34

.81



A06 Not exhibited

F1

.58

.10

e6

.33

.82





Factor 2 (F2): Deaccession acceptance depends on characteristics of the transaction B11 Transparency

F2

.74

.16

e11

.55

.67

B09 Open to Public

F2

.62

X

e09

.38

.79



B10 Not a gift

F2

.60

.15

e10

.36

.80



B12 Set a limit



F2

.51

.15

e12

.26

.86

B14 Equal relevance F2

.35

.13

e14

.12

.94

.42

F2

.23

.14

e13

.06

.97

.42

B13 Equal territory

Factor 3 (F3): Proceeds used for institutional benefit C17 Maintenance

F3

.93

X

e17

.86

.37



C18 Expansions

F3

.89

.05

e18

.78

.47



C19 Restorations

F3

.66

.05

e19

.43

.75



Factor 4 (F4): Proceeds used for public benefit C20 New services

F4

.73

X

e20

.53

.68



C22 Entrance fees

F4

.70

.10

e22

.49

.72



C21 New education

F4

.65

.10

e21

.43

.76



Factor 5 (F5): General acceptability of deaccessioning F4

F5

.74

.09

d4

.54

.68



F3

F5

.68

.09

d3

.47

.73



F2

F5

.57

.07

d2

.33

.82



F1

F5

.50

.08

d1

.25

.87



∗ An X in the standard error column indicates that for technical reasons this value is not available due to fixing of the relevant parameter. ∗ The items have been sorted per factor on the basis of their correlations with that factor. ∗ r (item,F) = Correlation between an item and the first-order factor. ∗ R 2 = Fit of an item; communality.

12.5.8 Interpretation What is the interpretation of the preferred model? ∗ The first factor concerns only the Group A items, and so can be characterised as Deaccession acceptance depends on characteristics of the object.

12.5 Public views on deaccessioning: Statistical analysis

265

∗ The second factor concerns the Group B items and could be named Deaccession acceptance depends on characteristics of the transaction. The extra correlation between the error terms of B13 and B14 indicates technically that the two items are more highly correlated (.46 - see Table 12.4) than could be modelled by the B factor. At least two explanations are feasible: (1) item formulation, i.e. the items were formulated in a similar way, inducing a higher correlation unrelated to their content, or (2) regional integrity, i.e. both items are the only ones referring to the regional or territorial importance of Italian museum collections, which should not be downgraded through deaccessioning. ∗ The Group C items split into two groups, each influenced by a factor. The third and fourth factors may be interpreted as Proceeds used for institutional benefit and Proceeds used for public benefit, respectively. ∗ The second-order factor has correlations of .50 to .74 with the first-order factors, and can thus safely be assumed to represent General acceptability of deaccessioning, with high values indicating a favourable attitude. The fact that all coefficients for predicting the first-order factors by the second-order factor are positive indicates that high scores on all factors represent a favourable attitude towards deaccessioning. A final question we should deal with is whether each of the factors forms a single scale, so that we can simply add the items and use the sum scores over the items of a factor as scores on the latent variables. If so, we can use these latent variables in further research without impunity for populations similar to the one from which the respondents of this study were drawn. For assessing the size of the consistency we have included the three items that were put aside for the confirmatory factor analysis, to investigate whether they harm or contribute to the reliabilities of the scales. The consistency of the factors as a scale was investigated using Cronbach’s alpha; see Cronbach (1951), Taber (2018), and the footnote on p. 205. The alphas of the deaccession factors were ∗ F1, Deaccession acceptance depends on the characteristic of the object, alpha = .85 ∗ F2, Deaccession acceptance depends on characteristics of the transaction, alpha = .53 ∗ F3, Proceeds used for institutional benefit, alpha = .86 (not including C16 Acquisitions) ∗ F4, Proceeds used for public benefit, alpha = .73 ∗ F5 (second-order factor), General acceptability of deaccessioning, alpha =.84 (inclusion or exclusion of C16 is immaterial). Thus, the overall conclusion is that Deaccession acceptance depends on characteristics of the transaction is the weakest scale and actually below par. However, if we eliminate the weak items B13 and B14, alpha increases to .76, providing a much more consistent scale. For another application of scale construction using Cronbach’s alpha see Section 9.5.7 (p. 205).

266

12 Questionnaires: Public views on deaccessioning

12.6 Other approaches in deaccessioning studies In this chapter we have discussed the procedure often followed within the social and behavioural sciences for evaluating questionnaires. Clearly, the basis for carrying out such an analysis is a solid understanding of the subject matter, so that a particular a priori factor structure has been incorporated in the design of the questionnaire. The empirical study was designed to validate this construction. Such validation is particularly necessary when there is no complete agreement among researchers about the precise extent of the central construct. A case study such as this may also point to weaknesses in the questionnaire, either due to wording or to different opinions among researchers themselves and between the researchers and their respondents. An issue not treated in this chapter is the sample size necessary to carry out the analyses presented here, and what to do when the distributions of the items are very skewed, i.e. strongly concentrated on one side of the items (see, for example, Wolf, Harrington, Clark, & Miller, 2013). In fact, there was sufficient skewness in the present items to worry about this, so that we have run parallel confirmatory factor analyses using the robust methods contained in the eqs6.2 computer program.5 Fortunately, the results were sufficiently similar to make it unnecessary to report the details of these analyses here as well. In fact, this similarity was the basis of our earlier remark that the distributions were sufficiently normal for our analysis.

12.7 Content summary To gain insight into the public’s attitude towards museums deaccessioning cultural objects, such as paintings, drawings, and statues, about 300 Italians were asked to fill in a questionnaire. The items in the questionnaire were grouped as (1) Deaccession acceptance depends on the characteristic of the object, (2) Deaccession acceptance depends on characteristics of the transaction, (3) Proceeds used for institutional benefit and (4) Proceeds used for public benefit. The emphasis is on the technical quality of the questionnaire, rather than the content quality. The latter is left for the experts in the field. A number of conclusions could already be drawn from the responses. Based on the means of the relevant items we concluded that deaccessioning was not really popular with the Italian respondents. The answers to all acceptability questions specifically concerned with selling artefacts were largely unfavourable, independently of the reason cited. If objects were going to be sold anyway respondents felt, it should take place in all openness and be transparent. Moreover, selling was only acceptable under certain conditions. If objects were sold, there was some preference for spending the money on the restoration of existing objects, but most people thought proceeds should be used for the benefit of both the public and the museum. The way respondents 5 http://www.mvsoft.com/.

12.7 Content summary

267

thought about using the proceeds of deaccessioning seemed to be driven by a specific attitude towards deaccessioning in general. In the study that forms the basis of this chapter the respondents were a sample from visitors to (open air) museums in Rome. An interesting follow-up would be to test the questionnaire on visitors to regular museum buildings and from other countries and compare the results with those from this case study. This is particularly interesting because there is a possible other reason to extend the questionnaire’s validity. The resistance to deaccessioning is typical of Southern European museums, which belong to a distinctively conservative museological tradition.6 In fact, there does exist a divide in Europe between the Anglo-Saxon and the Southern European, or Napoleonic, museum models. The Anglo-Saxon model is generally visitor-oriented, with a focus on exhibition and display, whereas the Southern European model is more oriented towards care and conservation (Vecco and Piazzai, 2015). It is an open question whether the differences in these two models as seen by the administrators and legislators are reflected in the public opinion. With the present questionnaire, assuming its validity is secure, it might be possible to gauge the public opinion on this issue without explicitly referring to the official points of view.

6 This

was demonstrated when a top-rated Italian museum refused to cooperate with the research project, because it did not want to be associated with the practice of deaccessioning.

Chapter 13

Stylometry: The Royal Book of Oz - Baum or Thompson?

Abstract Main topics. Linguistic stylometry is concerned with the quantitative analysis of language and style. Authors have idiosyncratic styles in the way they build their sentences, which often makes it possible to determine with some probability who has written a particular text. In this chapter the focus is on one particular book from the Wizard of Oz series. Data. Occurrences of 50 frequently used function words were recorded for 21 Wizard of Oz books. The first 14 were written by L. Frank Baum, the last 6 by Ruth Plumly Thompson, and the authorship of one (the 15th in the series) was contested at the time. Research questions. The central question is who of the two authors wrote the 15th book - The Royal Book of Oz, which appeared two years after Baum’s death. Statistical techniques. Principal component analysis, cluster analysis. Keywords Wizard of Oz · Contested authorship · Function words · Stylometry · Principal component analysis · Cluster andalysis · Centroid-linkage clustering

13.1 Background Linguists1 have long struggled with authorship questions, basing themselves on historical records and stylistic studies. A famous example is that of Shakespeare’s plays and poems.2 However, in this chapter we will only deal with stylometric and statistical arguments. Binongo and Smith (1996, p. 452) define stylometry as ‘the quantitative analysis of the recurrence of particular features of [writing] style for the purpose of deducing authorship and/or the chronology of texts’; for an introduction to the

1 This

chapter was inspired by the work of José Binongo on the authorship question of The Royal Book of Oz; see Binongo (2003) and Binongo and Mays III (2005). With his permission several sections are based on these papers. 2 https://en.wikipedia.org/wiki/Shakespeare_authorship_question; accessed 15 June 2020. © Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_13

269

270

13 Stylometry: The Royal Book of Oz - Baum or Thompson?

statistical approach used here see their paper and Holmes and Kardos (2003). Stylometry can be considered a part of computational linguistics, i.e. ‘a field of science that deals with computational processing of a natural language’ (Hajiˇc, 2004, p. 81). For an informal story discussing the question whether J. K. Rowling (author of the Harry Potter books) is also the author of The Cuckoo’s Calling written under the pseudonym Richard Galbraith, see Juola (2013).3 Juola (2015) also suggests a procedure for standardising analytic protocols for authorship questions. An interesting comment on ‘adversial’ stylometry can be found in The Chronicle of Higher Education, 4 where Greenstadt is quoted about the limitations of stylometry; see also Brennan, Afroz, and Greenstadt (2012) ‘Stylometry works really well if nobody’s trying to fool it, [b]ut if they are, it can be fooled quite effectively, and by people who are not particularly trained’. The argument suggests that authors can first use stylometry to recognise idiosyncrasies in their writing and subsequently avoid them, so that their work is no longer identifiable. Juola’s stylometry program could thus be ‘used to give cover to authors who have good reason to remain anonymous, like whistle-blowers and political dissidents’. The substantive part of this chapter deals with the authorship question of one of the "Oz books", The Royal Book of Oz, in particular the question whether it was written by L. Frank Baum or Ruth Plumly Thompson.5 In this case we are clearly dealing with a closed set of possible authors as there are only two candidates. Some considerations about disputed authorship and the role of stylometry in arbitrating between competing attributions are also discussed in the chapter on the authorship of the Pauline Epistles (Chapter 6) and will not be repeated here. The major difference between these two chapters is that in the case of the Pauline Epistles there is an open set of possible authors, where as in the Oz case there is a closed set of contenders . The data for this case study are counts of the most common function words, and those for the Pauline Epistles study consist of the numbers of times specific function words are contained in a sentence. In the two case studies, we have chosen to use different multivariate techniques to analyse the data.

13.2 Research questions: Competitive authorship L. Frank Baum had written 14 books on the land of Oz, starting with the Wonderful Wizard of Oz, when he died in 1919. However, in 1921 the 15th Oz book, The Royal Book of Oz , was published under his name. At the publisher’s request Ruth Plumly Thompson continued the Oz chronicles, and naturally the question arose 3 The

author of the headline in The Scientific American did not quite deliver: ‘How a computer program helped show J.K. Rowling write A cuckoo’s calling’. Apart from the slightly odd grammar the book is called The cuckoo’s calling, as is correctly cited in the article itself.

4 https://www.chronicle.com/article/The-Professor-Who-Declared/140595. 5 Despite inspecting various sources I have not been able to unequivocally determine whether Plumly

is a first name or part of the family name of Ruth Plumly Thompson. I have therefore used mainly ‘Ruth Plumly Thompson’ and ‘Thompson’.

13.3 Data: Occurrence of function words

271

whether she has had a hand in the transitional 15th Oz book.6 For further details on its history see, for instance, Binongo and Mays III (2005, p. 295) In this chapter we will largely follow Binongo’s approach towards answering this question, but using a somewhat differently constituted dataset due to the unavailability of his original one.

13.3 Data: Occurrence of function words Function words are high-frequency words that express a grammatical or structural relationship with other words in a text. They include pronouns, auxiliary verbs, prepositions, conjunctions, determiners, and degree adverbs. Such words have a more grammatical than a lexical function and have no bearing on the topic of a text. The core of the argument for using function words to establish authorship is that authors do not consciously employ such words in their work, so that a great similarity in their usage can be an indication of unconscious personal style. For instance, Grieve (2007, p. 260) says that ‘[a] common assumption of authorship attribution thus appears to be true; function words are better indicators of authorship than content words’; cited in Oakes (2014, p. 9); see also Binongo (2003, p. 11, Figure 1) for a list of reasons for using function words to decide authorship issues.

13.3.1 Preprocessing Departing from the approach I followed in most other chapters in this book, I think it is appropriate to spend a few words on the database used in the analyses. To compare books a database (or corpus) must be created. This is not trivial: the original printed books have to be digitised and preprocessed. ‘Examples of preprocessing include tokenization (i.e. splitting a stream of text into words, phrases, etc.), stemming (i.e. only retaining the root or base form of a word), tagging (i.e. replacing words with their grammatical type), removing nonalphabetic characters and spaces, and converting uppercase letters to lowercase’. (Neal et al., 2017, p. 6). To make such a database for the Oz books I needed to follow a similar process. The original texts as available from the Gutenberg project7 were downloaded in Plain Text UTL-8, and stripped from the Gutenberg description and copyright text. The total number of words was counted in MsWord, and the individual word frequencies were counted via the WriteWords website.8 Only the frequencies of the 6 For

a list of Oz books written by Baum and other authors, see https://en.wikipedia.org/wiki/ accessed 15 June 2020. 7 Baum: http://www.gutenberg.org/ebooks/author/42; accessed 15 June 2020. Thompson: http://www.gutenberg.org/ebooks/author/34661; accessed 15 June 2020. 8 http://www.writewords.org.uk/word_count.asp, accessed 15 June 2020. List_of_Oz_books;

272

13 Stylometry: The Royal Book of Oz - Baum or Thompson?

50 function words selected by Binongo (2003) were included in this study (see Figure 13.1). No checks were carried out as to the accuracy and sensitivity of the two programs. A complication in these processes is the spoken word and its idiosyncrasies. Both L. Frank Baum’s and Ruth Plumly Thompson’s books contain extensive direct speech which sometimes is written in dialect as well. Binongo and Mays III (2005, p. 297, 298) describe these difficulties and how they handled them in some detail. Especially the many contractions in the spoken text such don’t (do not) and I’d (I had or I would) required much manual intervention. For my dataset I followed a similar procedure, but as the preparation of the texts for counting was done by hand, full accuracy and reproducibility can unfortunately not be completely assured. Given the methodological aim of this book, some imprecision I considered acceptable; moreover, full accuracy would have required an inordinate amount of time. Table 13.1 Binongo’s fifty function words and their average occurrence percentages in the Oz books the (6.7%) with (0.7%) up (0.3%) into (0.2%) just (0.2%) and (3.7%) but (0.7%) no (0.3%) now (0.2%) very (0.2%) for (0.7%) out (0.3%) down (0.2%) where (0.2%) to (2.6%) a/an (2.3%) at (0.6%) what (0.3%) over (0.2%) before (0.2%) this/these (0.5%) then (0.3%) back (0.2%) upon (0.1%) of (2.1%) in (1.3%) so (0.5%) if (0.3%) or (0.2%) about (0.1%) there (0.3%) well (0.2%) after (0.1%) that/those (1.0%) all (0.5%) it (1.0%) on (0.5%) to (0.3%) which(0.2%) more (0.1%) from (0.4%) who (0.3%) how (0.2%) why (0.1%) not (0.9%) as (0.7%) one/ones (0.3%) when (0.2%) here (0.2%) some (0.1%)

13.3.2 Dataset The dataset has an internal structure design of 21 books as rows (the ’subjects’) and 50 function words as columns (the ’variables’), referred to as the books × words matrix. For several standard statistical analysis techniques it is a somewhat uncomfortable situation that there are more variables than subjects, as in such techniques the situation is expected to be reversed, i.e. (many) more subjects than variables. This is especially true for techniques based on correlation or covariance matrices. For comparisons among books it was useful to have the transposed data matrix available for investigation as well; i.e. the words × books matrix with words as ’subjects’ (rows) and books as ’variables’ (columns). Each Oz book has a different number of words, so that raw word counts cannot be compared. Therefore, the word counts in each book were divided by the total number of words in that book, which means that the basic data to be analysed are

13.4 Analysis methods

273

proportions. The per-book correlations of the proportions of the occurrence of function words with Binongo’s mean percentages were on average 0.995 (range: 0.991 0.997) for the Baum books, 0.992 (range: 0.988 - 0.997) for the Thompson books, and .990 for the disputed book. These correlations are somewhat biased because of the extreme skewness of the word counts of the function words. However, after elimination of the 6 highest frequencies the lowest correlation is around 0.90, and after elimination of the 17 highest frequencies the lowest correlation is around 0.70. These correlations lend credibility to the validity of the counts used in this study.

13.4 Analysis methods Which methods should we employ to try and settle the authorship of The Royal Book of Oz? Is the author L. Frank Baum, Ruth Plumly Thompson, or have they both contributed? Does the created dataset with books as ’subjects’ and words as ’variables’ contain sufficient information for this purpose?

13.4.1 Significance testing One option could be to first carry out a multivariate (Hotelling’s) T 2 test with the words as variables, and test whether there are mean differences in word usage between the two authors. Next, we would check whether on discriminating variables the means for The Royal Book are closer to the means of either one of the authors. However, a problem is that the significance test requires normal distributions for the words and a proper covariance matrix. The test is almost impossible to carry out because of the limited number of ’subjects’ (= books), compared to the number of variables (= words).

13.4.2 Distributions and graphics Rather than relying on tests, stylometricians have already very early on approached the problem via graphical displays. Tests often require distributional assumptions which may not necessarily be true. On the other hand many multivariate techniques were developed on the basis of multivariate normal distributions, but in practice they have proved to be very useful in their descriptive powers, provided the distributions do not have extreme outliers and are reasonably symmetrical. To inspect these distributions in a fairly uncomplicated way I have constructed two figures with boxplots (Section 2.2.4, p. 35) one with the high-frequency words (Figure 13.1; upright boxes), and one with the mid-frequency words (Figure 13.2; sideways boxes).

274

13 Stylometry: The Royal Book of Oz - Baum or Thompson?

Fig. 13.1 Boxplots of the proportions of the high-frequency words. Outliers are labelled with their book numbers. Baum: Books 1-14; Thompson: Books 16-21; contested: Book 15.

For 11 of the 51 words eight of the books have from one to three outliers, but none of them seem severe. On this basis I have decided to ignore them. The graphs also

Fig. 13.2 Boxplots of the proportions of mid-frequency words. Labels for words are not indicated due to space constraints. Outliers are labelled with their book numbers. Baum: Books 1-14; Thompson: Books 16-21; contested: Book 15.

13.4 Analysis methods

275

show that the proportions of the function words tend to vary enormously, but that is only to be expected.

13.4.3 Principal component analysis and graphics Two variables can be displayed in a two-dimensional graph; three variables in three dimensions, or three-dimensional space. Similarly, the 51 function words (variables) can in principle be displayed in a 51-dimensional space. That number of dimensions is, of course, impossible to visualise and interpret. In practice, in many datasets we need only two or three dimensions to get a good impression of the mutual relationships of the variables if they are highly correlated. This is the case here for the function words. Two or three dimensions should be sufficient here, and they can be created by constructing linear combinations of the variables. These linear combinations (or components) are constructed in such a way that the maximal amount of systematic variability is represented by the first two or three components. Of all the variability in the high-dimensional space, about 50% can be accounted for by two components: 39% and 10% of the variability, respectively. The remaining components account for the other half of the variability, most of which probably is random error. The technique to find the appropriate components is principal component analysis (pca) - see Section 3.8.1 (p. 66) . The function words and the books can be displayed in separate two-dimensional graphs, but they can also be shown in the same plot, a so-called biplot. In this chapter we will use two separate graphs for the biplot (Figures 13.4 and 13.5). However, they may be superimposed by placing the origins and the coordinate axes of both graphs on top of each other. Biplots are ubiquitous in this book, see, for instance, Chapter 6 for an example; for more technical details see Chapter 4, p. 106.

13.4.4 Cluster analysis The problem of obtaining an insight into which book was written by whom can be addressed by discriminant analysis (see Section 3.6.4, p. 60), provided the possible authors are known beforehand; thus it requires a closed set of authors. If the number of authors is unknown (an open set), discriminant analysis cannot be used. An example of the latter situation is the disputed authorship of the ‘Neapolitan novels’ by the author using the pseudonym Elena Ferrante (see Tuzzi & Cortelazzo, 2018). If the number of authors is unknown a viable option is to group the works via a cluster analysis; see Section 3.9.3, p. 81. After the cluster analysis each group of works can hopefully be associated with a single author, but additional information must be available to make such an attribution: it will help if the authorship of some of the works under consideration is already known. In our case, we will for

276

13 Stylometry: The Royal Book of Oz - Baum or Thompson?

demonstration purposes use a clustering procedure to determine the authorship of the Royal Book of Oz, even though we have a closed set consisting only of Baum and Thompson. Ideally a group of books found by cluster analysis should have been written by a single author. If there is no consistency on the part of the authors in their use of function words no meaningful clustering may be found. There is a plethora of techniques for performing clustering, but we will only use a relatively simple form in this case study to illustrate the principal ideas. In particular we will use centroid linkage, but it should be noted that according to Everitt et al. (2011, p. 83) ‘[..] no [hierarchical clustering] method can be recommended above all others’.

13.5 Wizard of Oz: Statistical analyses To answer the question of the authorship The Royal Book of Oz I will use function word data from all Oz books by Baum and Thompson. The authorship was already sorted out by Binongo and Mays III (2005), but here I will sometimes proceed slightly differently. The main tools will be principal component and cluster analysis.

13.5.1 Principal component analysis The first thing to look at in a principal component analysis is the number of components we need to capture a reasonable amount of variability. As indicated earlier the first two derived components account for about half of the variance in the data. Given the type of data and probably much variability between the books due to different stories, this seems a reasonable number. Even though it is assumed that function words are employed more or less irrespective of content, some content variability seems unavoidable, especially as there is so much direct speech in the Oz books reflecting idiosyncrasies of the characters. Variance of the words accounted for Most principal component analysis programs supply information on how well the variance of the variables is accounted for. This quantity is usually referred to as the communality, which in fact is nothing but the multiple correlation coefficient of the regression of a word on the fitted components. Thus, we can use the communalities to assess how well the variabilities of the individual function words are represented in the analysis. A confusing aspect is that function words may show a large variability due to their different use in books, but also because there could be multiple authors of a single book. For instance, it could be that there was an incomplete

13.5 Wizard of Oz: Statistical analyses

277

Fig. 13.3 Communalities of function words in the Oz books. Variances of function words accounted for by the first two principal components. Note: That/tho = that/those.

manuscript of The Royal Book of Oz by Baum, which was finalised by Thompson. This aspect seems worthy of further investigation. Figure 13.3 shows the communalities for the twenty words that were fitted for 60% or more. The lowest explained variance among the remaining 31 words was just 2%, and of the three most common words, the (Mean occurrence = 6.6%; fit = 19%), and (mean = 3.7%; fit = 9%), and to (mean = 2.5%; fit = 82%) only to is well fitted, indicating a systematically differential use in the Oz books. Differences among function words Our next step is to inspect the relationships among the function words. The linear combinations (components) in principal component analysis are customarily scaled such that their values on the components indicate the correlations between the variables (here: function words) and the components; these are commonly called loadings; see also Section 3.8.1, p. 66. Thus, a high loading indicates that the component

278

13 Stylometry: The Royal Book of Oz - Baum or Thompson?

Fig. 13.4 Word space. Co-ordinates are correlations between function words and each of the first two components

and the variable have much in common (are highly correlated). Positive values indicate that high values on the variable correspond with high values on the component, and negative values indicate the reverse, i.e. high values on the variable correspond to low values on the component and vice versa. These correlations are often the basis for equating the components with a (theoretical) concept or name (reification). However, this does not seem to work for the function words, as there is no obvious underlying concept for reification. It is, for instance, difficult to envisage which concept underlies the function words so, that, those, which, and but in a positive sense and up, down, on, over, and back in a negative sense. The above function words have high positive and negative correlations with the first component, respectively, and can be found at opposite ends of the first component. Figure 13.4 displays the correlations between the words and the components. When a word is located at the edge of the graph, there are large differences in its usage, i.e. in some books the word is used often, in other books hardly used at all. Words in the middle, close to the origin, are used more or less equally in all books. The common usage of words in texts determines where they are located in the graph;

13.5 Wizard of Oz: Statistical analyses

279

Fig. 13.5 Book space. Co-ordinates are standardised scores of the books on the first two components

we have to look at the graph for the books (Figure 13.5) to find out which words are used in which texts. Differences between the books Figure 13.5 shows that the first component is determined by the contrast between Baum and Thompson. All Baum’s books are on the right and those by Ruth Plumly Thompson on the left, with a clear distance between them. Our central question seems to be answered: The Royal Book of Oz lies firmly in the Thompson part of the graph. It is tempting to argue that because of the Thompson books The Royal Book of Oz is closest to the Baum area, in this book she tried to emulate Baum more than in her later books. Without further information this thought should, however, be labelled as speculation. A better argument in favour

280

13 Stylometry: The Royal Book of Oz - Baum or Thompson?

of Thompson is that her Oz book Kabumpo in Oz which appeared one year after The Royal Book is located at exactly the same spot as the earlier book, suggesting a similar use of function words. The second component shows that The Marvellous Land of Oz and RinkiTink in Oz have considerable negative values on the second component. Thus, there must be some words which are not often used in the other Baum books, especially not in The Patchwork Girl of Oz which has the highest positive score on the second component. Words that make the difference between the books To understand what makes the Oz books different from each other, and in particular which words differentiate between the Baum and Thompson books, we have to look at Figures 13.5 and 13.4 together. In one sense this would have been easier in a biplot, but it would have been a very crowded one (see Chapter 4, p. 106). Without going into too much detail we can conclude that the words on the right-hand side of Figure 13.4 are typical Baum words, such as but, so, when, which, that, and those, while up, down, over, on, and back are typical Thompson words. By, of, and from are typical Baum words which characterise The Marvellous Land of Oz and RinkiTink in Oz, but not The Patchwork Girl of Oz, etc. There is clearly space for further research, if only by reading the books themselves to obtain full information.

13.5.2 Cluster analysis Centroid linkage is a hierarchical agglomerate cluster procedure. Step by step a tree is built, for which at each step the two objects most similar in word usage are joined, whether the object is a single book or a cluster of books.9 Once an object is part of a cluster it cannot ’escape’ anymore. Figure 13.6 shows the results from centroid-linkage clustering for the Wizard of Oz data. The dendrogram indicates how the hierarchical tree was created. Several books merge very early on, for example, 16, 18, 20, 21 by Thompson and 3, 12, 13 by Baum. Clearly, some of Baum’s initial books do not cluster closely (for example, 1, 2, 5, 6). This might indicate that initially his style was still under development. A two-cluster solution provides a perfect separation between Baum and Thompson, with The Royal Book of Oz clearly going to the Thompson branch, albeit as its last member. Inspecting the three-cluster solution shows that the separation between the authors is still perfect. The third cluster consists of nos. 10 (RinkiTink in Oz) and 14 (Glinda of Oz), which did not join the Baum branch until later. Note that we would expect no. 2 (The Marvellous Land of Oz) to join these two, but it does not; it actually only merges with the Baum books at almost the last stage, but after 10 and 14. There is no indication of a separate third cluster to suggest the presence

9 Strictly

speaking, the tree in the graph does not have its roots at the bottom and a crown on top. The tree is presented sideways to make the book titles more easily readable.

13.6 Other approaches in authorship studies

281

Fig. 13.6 Centroid-linkage clustering. Agglomeration tree of the Oz books. Numbers next to the vertical axis indicate the order of publication.

of a third author. In theory the cluster analysis could have shown this, because no a priori number of authors has to be specified. It is possible to make a really detailed evaluation of the reasons for the different positioning of the books, but that would take us too far afield. Binongo’s papers are a better guide for this. Another, truly detailed, analysis in another context within another domain based on a series of cluster analyses can be found in Burrows (2004), who examines similarities and dissimilarities, or ‘affinities’ and ‘disaffinities’, as he calls them, to classify a large number of poems on the basis of common words.

13.6 Other approaches in authorship studies One of the strong points of the original studies by Binongo (2003) and Binongo and Mays III (2005) is that the discriminatory power of their analyses was confirmed by the fact that several non-Oz books by both authors could be located in their proper

282

13 Stylometry: The Royal Book of Oz - Baum or Thompson?

realms. In addition they showed how very different the single Oz book written by Martin Gardner (1998) was from the two ‘official’ Oz chroniclers. As mentioned above, another technique eminently suitable for answering the original research question is discriminant analysis (see Section 3.6.4, p. 60), but logistic regression is also suitable (see Section 3.6.5, p. 61). Whereas principal component analysis attempts to portray as much of the variability as possible with a few components, discriminant analysis attempts to find those linear combinations of the variables (called discriminant functions) that maximise the mean differences between the two a priori groups of subjects (here: Oz books) on those functions.10 Moreover, the technique may be used to sort out which words are particularly discriminating, and which words can be deleted from the analyses without harming the quality of the separation. A chapter on authorship attributions is not complete without at least listing some of the more famous cases in which statistics has been successfully used to examine authorship issues. In particular interesting are: Ledger (1995), Pauline Epistles; Mosteller and Wallace (1984), The Federalist Papers; Craig (2004); Craig and Kinney (2009), Shakespeare; Tuzzi and Cortelazzo (2018) and Savoy (2018), Elena Ferrante; see also Chapter 6, The Pauline Epistles. The paper by Savoy (2018) also provides a detailed discussion of statistical techniques to establish authorship, and appropriate references. What is a bit surprising from the perspective of the present book is that Savoy does not represent his results graphically. All conclusions are presented as lists of distances between the works of possible authors and the contested novels by Ferrante. Another extensive study of methods for authorship attribution can be found in Grieve (2007) , but he does not consider graphical representations of the results from the multivariate analyses either. A fascinating study on the authorship of Beatles songs was published by Glickman, Brown, and Song (2019), but they concentrated more on musical features than on texts (see also Chapter 18).

13.7 Content summary The central question in this case study was: ‘Who has written The Royal Book of OZ? Baum or Thompson’? The basic data for the analyses consisted of a data matrix of books (subjects) × function words (variables). The data showed that the two author rather differed in their use of function words. The application of two multivariate techniques, principal component analysis and cluster analysis, pointed convincingly to Thompson as the author. The data clearly suggested that if Thompson worked from a preliminary manuscript by Baum, not many traces of this were visible. The numeric analysis was supported by the fact that the book in question appeared over time with three different covers; see Figure 13.711 . The first version 10 Note

the two different meanings of function; there is a statistical and a linguistic meaning of the

word. 11 https://en.wikipedia.org/wiki/The_Royal_Book_of_Oz.;

accessed 12/5/2019.

13.7 Content summary

(a) Baum

283

(b) Baum and Thompson

(c) Thompson

Fig. 13.7 The Royal Book of Oz: Three different authors.

(a) listed Baum as the author, the second (b) Thompson and Baum, and the third (c) just Thompson. Apparently the Del Rey edition (c) has the following copyright notice ‘The Royal Book of Oz was originally published in 1921 under L. Frank Baum’s name, although it was entirely the work of Ruth Plumly Thompson. This is the first time the book appears with Thompson as sole author’.12 Using these data made it possible not only to identify the most likely author, but also to obtain an impression of the characteristics of function word usage by the authors. As such words have primarily a grammatical function in texts, it is Baum’s and Thompson’s styles rather than the content that makes the difference in their way they tell their stories. This was explored in this study, but the data are clearly a base for a more detailed study of this aspect. In musical stylometry, genre and content play a much more important role; see Chapter 18 on the musical works of Bach, Beethoven, and Haydn. Several subordinate questions have arisen from the analyses. A clear distinction in function word usage emerged between Baum’s The Marvellous Land of Oz and RinkiTink in Oz versus his The Patchwork Girl of Oz. An in-depth study of these books may point out why this is so, and if it is related to their content, in other words, is it truly the case that genre, book content, plays no role in defining writer’s style?

12 https://www.tor.com/2010/02/04/reimagining-fairyland-the-royal-book-of-oz Comment 2.

Chapter 14

Linguistics: Accentual prose rhythm in mediæval Latin

Abstract Main Topic. In this chapter, the focus is on accentual rhythms at the end of prose sentences. In mediæval Latin, prose rhythm is a style element used differently by different authors in different countries and periods. Data: Cursus frequencies from various mediæval authors writing in Latin. Research questions. What kinds of accentual patterns were used by and popular with various authors? Are these patterns the same through time and place? Statistical techniques. Contingency table analysis, ordinal principal component analysis. Keywords Mediæval Latin · Cursus · Sentence endings · Clausulae · Prose rhythm · Ecclesiastic prose · Contingency tables · Ordinal principal component analysis

14.1 Background Accentual prose rhythm in mediæval Latin Tore Janson1 investigated accentual prose rhythm of sentence endings in mediæval Latin (often referred to as clausulae),2 from the 9th century to the beginning of the 13th century. Prose rhythms are formed by the emphasis which is given to particular syllables within words. These patterns are commonly referred to as cursus (Janson

1 My

thanks go to Tore Janson for his generous permission to use his data and for his valuable comments on the text of an earlier version of this chapter, and to Adriaan Rademaker and Casper de Jonge who also provided extremely useful comments. 2 Wikipedia: In Roman rhetoric, a clausula was a rhythmic figure used to add finesse and finality to the end of a sentence or phrase. https://en.wikipedia.org/wiki/Clausula_(rhetoric); accessed: 15 June2020. © Springer Nature Switzerland AG 2021 P. M. Kroonenberg, Multivariate Humanities, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-69150-9_14

285

286

14 Linguistics: Accentual prose rhythm in mediæval Latin

1975, p. 7).3 In his book, Janson examines data on the use of cursus in great detail within the context of historical information about various authors and their works. For the arguments why he has chosen this particular period and the specific authors and their work, see Janson (1975, p. 8). An alternative historical overview of the use of cursus can also be found in Denholm-Young’s contribution to the New Catholic Encyclopedia.4 Janson’s Appendix contains his counts of the occurrence of a particular form of cursus in a cross section of authors, some of whom are represented with several datasets; on the other hand, the authors for some of the works are unknown. One of the characteristics of Janson’s analyses is that they are primarily bivariate. Authors and works are mainly compared one against the other in pairs, but in this chapter, I have attempted to bring Janson’s data within a single framework via multivariate analysis, with the intent of providing a backdrop to the different use of cursus across space (Western Europe) and time (four centuries), and to show how multivariate analysis may contribute to the cursus study. It should be remarked that Latin was no longer a common language in the Middle Ages, but primarily used within the church and universities. Therefore, it had to be taught rather than acquired naturally, as had the use of rhythmic Latin prose. To understand the similarities between authors’ use of prose rhythm it is therefore important to trace the origin of their education. My information is largely based on Janson’s book and Wikipedia, but unfortunately not all such information was available, which at times hampers full interpretation. A review of Janson’s book and his methods described therein can be found in Winterbottom (1976), and a discussion of his contribution to cursus research is presented in Stancliffe (1997).

Characteristics of accentual prose rhythm The core of Janson’s work are the rhythmic patterns of sentence endings. The cadences mentioned in this chapter are determined by the number of unaccented syllables in between accented syllables of the two last words, which may have different numbers of syllables. In order to classify a cadence, a notation is necessary to indicate precisely where the accents of the two words lie. Sometimes, two (or even three) words have to be counted as one accented graphical unit, which then consists of a word with accent preceded by one or two unaccented (proclitic) words.5 The cursus designation is explicitly based on the number of unaccented syllables between the two accented ones. Table 14.1 gives the names and accent patterns of the four most common forms of cursus. It has been argued that rhythmic prose was 3 https://www.academia.edu/7049867/Prose_Rhythm_in_Medieval_Latin_from_the_9th_to_the_13th_Century;

accessed 15 June 2020. The word cursus is of the fourth declension in Latin, and therefore, its plural is cursus as well. 4 New Catholic Encyclopedia: ‘cursus’. https://www.encyclopedia.com/religion/encyclopedias-almanacstranscripts-and-maps/cursus; accessed 15 June 2020. 5 Proclitic: a word pronounced with so little emphasis that it is shortened and forms part of the following word, e.g. at in at home or ‘t for it in ‘twas.

14.2 Research questions: Accentual prose rhythm in mediæval Latin Table 14.1 Types of cursus Cursus Translation planus tardus velox trispondiacus

plain slow fast three spondees

Notation p 3p p 4pp pp 4p p 4p

287

Proclitic p (1 2) p (1 3pp) pp (1 3p), p (2 2), p (1 1 2)

Example íllum dedúcit resilíre tentáverit hóminem recepístis ágnos admittátis

∗ ∗ ∗ ∗ ∗ ∗

1 = monosyllabic, oxytone (accent on the only or last syllable) oxy- = sharp; oxytone = sharp tone = accented p = paroxytone (accent on the penultimate (last but one) syllable) pp = proparoxytone (accent on the antepenultimate (last but two) syllable) numbers in the notation indicate the number of syllables in the word. cursus tardus is sometimes called cursus ecclesiasticus on account of its preponderance in church texts. ∗ This table is a summary of Janson’s pp. 12–15.

propagated especially among literary circles to promote elegant prose texts; see for example, Janson (1975, p. 53). Actually, Cicero already considered clausulae rhythmically the most important part of sentences see Boneva, (1971, p. 173), although he based himself on the number of syllables rather than on emphasis. Janson (1975, p. 15) classifies at least Cicero’s letters to Atticus as unrhythmical texts in the sense used here. Janson distinguishes 25 possible two-word sentence endings, ignoring proclitic words. He moreover combines all patterns containing last words of six or more syllables. Janson’s data tables in his Appendix contain a few more cursus options, intended to facilitate differently organised cursus research by others. In this chapter, I use his condensed forms as specified in Appendix 2; moreover, I ignore the patterns with six or more syllables. On p. 117, Janson demonstrates how he combined certain rare cadences with more common ones to create his tables.

14.2 Research questions: Accentual prose rhythm in mediæval Latin As indicated above, we will focus on the question of whether an analysis of Janson’s entire database can be constructed using multivariate methods so that the whole collection of texts can be examined jointly. Such analyses will naturally not be as detailed as Janson’s, but the results can provide a backdrop for further study of rhythmic prose, and show what the added value of multivariate analysis can be for this and similarly designed studies. Moreover, such an analysis could be used to generate new questions and provide confirmations for conjectures, for instance, on the question where certain authors learned their use of cursus. However, in this chapter, I will concentrate on presenting the overall results of the techniques used on these data rather than enter into substantive detailed explanations of similarities shown.

288

14 Linguistics: Accentual prose rhythm in mediæval Latin

14.3 Data: Janson’s data tables In his Appendix 1, Janson (1975, 109–115) presents the cursus tables of 58 authors for 29 cadences. Janson merged 5 cadences with others so that there are 24 left. In addition I ignore the first two rows, containing word patterns with six or more syllables, as very few of such words exist in Latin Janson (1975, p. 14), as well as the cadence indicated by 1, so that 21 cadences remain. These can be seen to consist of combinations of three patterns for penultimate words (1, p, pp) and seven for ultimate words (5p, 5pp, 4p, 4pp, 3p, 3pp, 2). The tabulated authors originate mainly from Italy, France, and Germany (in that order), and many of them are directly related to or are popes. The works analysed by Janson were primarily letters and biographies of saints or other holy persons (hagiographies), but other texts were included whenever relevant Janson (1975, p. 37).

14.4 Analysis methods A cursus may be defined as the combination of the stress patterns of the penultimate and ultimate word in a sentence. In writing sentence endings, an author can have preferences for stress patterns in each of the words, and a preference for particular combinations of the two. The preference for a specific cursus should become evident in the use of a particular combination of the stress patterns of the two last words. To establish whether such a preference indeed exists, we may compare the observed cursus usage with the frequencies that would be observed if the author had randomly combined the stress patterns of the last two words. This comparison is done by examining differences between observed and random combinations of the stress patterns. The differences are referred to as residuals. Such residuals are commonly standardised: standardised residuals. If the residuals are small, we may conclude that the author was not particularly interested in cursus. The random combinations are in statistical terminology called expected values, i.e. expected if the choice of stress patterns for the two words is independent. If the standardised residuals are positive the frequencies of the cursus are higher than expected under the independence. If the standardised residuals are negative, the frequencies of the cursus are lower than expected under independence, which means that these cursus are less used by the author. To compare the cursus use of authors, we may also try to formulate external criteria to compare them with, but it is more logical to use an internal criterion, i.e. comparing the actual use of an author’s cursus with the expected values under independence. Note, by the way, that a lack of interest in cursus on the part of an author does not mean that there are no differences in the stress patterns for each of the last two words separately.

14.4 Analysis methods

289

14.4.1 Contingency tables In his Appendix 2, Janson (1975, pp. 116–117) outlines how he calculated his comparisons between authors regarding their usage of cursus. The method is not described explicitly in this way, but consists of constructing a contingency table (see Chapter 4, p. 108) and applying a Pearson X 2 test of independence to this table (see Chapter 4, p. 128). In statistics, the test is also referred to as the χ 2 test, as the χ 2 distribution is customarily used to calculate the probability, or p value, under independence of rows (penultimate word cadences) and columns (ultimate word cadences). If the p value is small the hypothesis of random combination of the cadences of the last two words is rejected, and it is assumed that the author has a preference for particular cursus. Given a significant overall test, we may investigate the standardised residuals. If in a cell of the table the value of the standardised residual is substantially above 2, there is a preference for the cursus defined by the cell, and the value is seriously below -2 the author is assumed not to have an interest in the cursus defined by the cell. Janson puts up an argument for not using cells with very small expected values due to the instability of such small numbers. The statistical argument is correct, but in practice the statistical emphasis is mostly on the unreliability of the p value if small expected values are present. However, statistics has moved on since Janson’s work, and at present, we can carry out a so-called exact test of independence that takes heed of small expected numbers (see Verbeek & Kroonenberg, Verbeek and Kroonenberg 1985, for a review of the calculations involved and the relevant literature). In the following, I will not explicitly show the exact p values, as for Janson’s data they turned out to be virtually identical to those based on the χ 2 distribution.

14.4.2 Ordinal principal component analysis The total dataset forms a matrix of 21 cursus × 58 authors. A crucial question is whether to consider the cursus as subjects and the authors as variables, or vice versa. We might be inclined to automatically think that authors are people (subjects or participants) and the cursus are variables, in particular because the authors might be considered a target sample from all possible authors and the cursus are an exhaustive collection. The problem here is that the research question is not conducive to such a point of view. We are interested in how authors use cursus and we want to compare the authors on their usage; we seek similarities between authors, not similarities between cursus. In visual representations of similarities, we typically use correlations between variables and assess per individual what their relative position is on the variables. This conceptualisation suggests that in this particular case the authors are the variables and the cursus the individuals.

290

14 Linguistics: Accentual prose rhythm in mediæval Latin

The next question to be answered before we can decide on the proper analysis method is that of the measurement level of the variables (for a discussion on measurement levels, see Section 1.5.2, p. 11). In this case: how do we treat the frequencies with which the cursus are used? Considering the large differences between authors’ usages, it seems best not to rely too much on the counts having a ratio scale as is common for frequencies, but instead limit ourselves by assuming that in this study the frequencies are at most ordinal: higher numbers indicate that particular cursus are used more often. We can let the analysis sort out what is a good way to handle the ordinal numbers. Characteristics of the Ordinal pca For the analysis of our matrix of accentual patterns × authors, an acceptable approach would be to carry out a principal component analysis (pca), assuming that the variables are ordinal. The technique is sometimes called, although rather inappropriately, factor analysis (see Section 3.8.1, p. 66). The basic aim of the technique is to summarise the similarities between the authors via an analysis of their correlation matrix. Without going into details, I can state here that ordinal pca can be performed by using categorical pca; see Section 3.8.2, p. 71. A particularly useful graph by which to examine the results is the biplot, which shows the cadences in the same graph as the authors so that it is immediately clear which authors are particularly fond of which type of cursus. For a general explanation of biplots, see Chapter 4, (p. 106), and also Greenacre (2010). I will here present the general outcomes of such an analysis and point out how to approach the core parts of the output of an ordinal pca. Its technical, computational, and usage aspects are treated in Gifi (1990), and the program used here is part of the Categories package in spss Meulman and Heiser (2012); a detailed, fully interpreted example can be found in Linting et al. (2007). Here, I will only discuss that part of the outcomes that is relevant for the substantive interpretation of the similarities. Even though we are not dealing with numeric or ratio measurement variables but with ordinally scaled variables, and analyse the data by an ordinal pca, this does not alter the interpretation. The motor under the bonnet is different but the car reaches the same destination.

14.5 Accentual prose rhythm: Statistical analysis My two analyses of Janson’s data are based on two different ways of looking at them. The first way focusses on a per-author perspective, in particular, the way individual authors formed their sentence endings by combining the accents of the penultimate (first) word with those of the ultimate (final, second) word. The next analysis is based on the second way to consider the data: we forget about the internal structure of a cadence, and treat each cadence as a single entity. In that analysis, the frequencies of the cursus are analysed as scores on a set of authors, which function

14.5 Accentual prose rhythm: Statistical analysis

291

as variables. The intention is to find out the similarity between authors based on their cursus use.

14.5.1 Internal structure of individual authors’ cadences A brief word about Janson’s choice of authors. He indicates that his major aim was to show the differences between various ‘schools’ in various countries across four centuries and to indicate what rhythmic prose exactly signifies. To do this, he also needed authors who did not seem to have an interest in rhythmic prose, i.e. no interest in combining particular penultimate words with ultimate words. This does not necessarily mean that such an author had no preference at all for a cadence in either word. He6 might prefer a particular emphasis for his penultimate word, but have no intention of coupling it with a particular cadence of an ultimate word pattern. This lack of coupling implies independence of the two-word emphases. Insight into cursus use by the authors can be obtained by first constructing contingency tables, with categories for penultimate words as rows, and categories for the ultimate words as columns.7 , I will demonstrate this with the data from five authors: Hincmar (806; France; treated in detail in Janson, pp. 37–39), Peter Damiani (1007; Italy; Janson, pp. 43–45), Pope Urbanus II (1035, Italy; pp. 60–61), Meinhard von Bamberg (1058; Germany; pp. 24–25), and Pope Lucius III (1100, Italy, p. 51, 90).8 They cover the entire period of Janson’s selection, as well as the three major nations. The emphasis is on how these authors use the combinations of the penultimate and ultimate words, and which type of standard cursus they favour. They have also been selected on the basis of the results from Section 14.5.2 as having different cursus patterns. Showing different patterns will hopefully make it easier to understand the biplots in Section 14.5.2. Meinhard von Bamberg; 1058, Germany We start the investigation with the cursus table of the German Meinhard von Bamberg (Table 14.2), and its marginal totals: column totals in the bottom row, and row totals in the last column. The sizes of the row totals (34, 126, 44) show that he had a clear preference for paroxytone (p) penultimate words, i.e. 126/204 × 100% = 62%, compared to 17% for the monosyllabic words (1) and 22% for the proparoxytone words (pp). Roughly, the same percentages can be observed within each of the columns, which is indicative for independence between the penultimate and ultimate words (see Chapter 4, p. 108). Standardised residuals can point out possible discrepancies, but they are all 2 or less, which we have taken as an approximate threshold. 6 All

authors in Janson’s database are men. for a brief discussion of contingency tables in general, and specifically the X 2 test for independence of the row and column variables: Chapter 4, p. 108 and p. 128. 8 Generally, the birth years are approximate, or roughly derived from the dates of their publications by deducting 40 years. 7 See

292

14 Linguistics: Accentual prose rhythm in mediæval Latin

Table 14.2 Meinhard von Bamberg; Use of prose rhythm

The last row is a margin of the table; it contains the sums of the frequencies of each of the columns (the marginal column totals). The last column is the other margin of the table; it contains the sums of the frequencies of each of the rows (the marginal row totals). Table 14.3 Cursus—Meinhard von Bamberg Cursus Notation Frequency planus tardus velox trispondiacus

p 3p p 4pp pp 4p p 4p

30 18 9 25

St. res

% of total

1 0 1 1

14% 9% 4% 12%

X 2 = 19; d f = 12; pχ 2 = 0.10

Turning our attention to the column totals, we see limited differences between the column totals; the smallest column totals are 11 and 15 (5% and 7%), and others are more or less equal around 36 (18%). The same percentages can also be seen within each of the three rows. This concordance of the row/column patterns within the table with the patterns in their respective margins is a tell-tale indication for independence of the row and column variable. The independence is further confirmed by the small standardised residuals throughout the table. The sum of the squared values of the standard residuals is the test statistic X 2 . The value of this statistic is X 2 = 18.5, comparable to its 12 degrees of freedom, and therefore indicative of independence. Furthermore, the p value of the statistic is .10 which also points towards independence (all these properties are related); see Chapter 4, p. 128. In contrast with the authors discussed below, Meinhard did not particularly shun the use of the trispondiacus; it seems for him just another combination of the penultimate and ultimate words (Table 14.3). These observations show that Meinhard von Bamberg was not an outspoken cursus user. In fact, Janson chose him explicitly as an example of a nonrhythmic author (p. 24).

14.5 Accentual prose rhythm: Statistical analysis

293

Peter Damiani; 1007, Italy A very different table is that for the Italian Peter Damiani (Table 14.4). There are striking differences with Meinhard in the marginal totals for the cadences of both the ultimate and penultimate words. For the ultimate words, Damiani favours the 4p and 3 p patterns, and for the penultimate words, he has a preference for proparoxytone (pp) and paroxytone (p) patterns. Some standardised residuals have large values, such as -7, -8, 7, 8. Thus, the observed frequencies are far away from the expected frequencies under the model of independence of the row (penultimate-word accent) and column (ultimate-word accent) variables. Moreover, there is no indication at all of any proportionality of the cadence patterns in the table with those of the margins. Table 14.4 Peter Damiani; Use of prose rhythm

The last row is a margin of the table; it contains the total frequencies of the columns (the marginal column totals). The last column is the other margin of the table; it contains the total frequencies of the rows (the marginal row totals).

With respect to the four standard cursus, Damiani shows a clear preference for the cursus velox (pp 4p; st. res.=8; use=43%), the cursus planus (p 3p; st.res.=7, use=24%), and maybe the cursus tardus (p 4pp; st.res.=6, use=only 8%). However, he had no interest at all in the cursus trispondiacus (p 4p; st.res= -7; use=< 1%) and in the pp 3p cadence, which he never used at all (st.res. = -8) (Table 14.5).

Table 14.5 Cursus—Peter Damiani Cursus Notation planus tardus velox trispondiacus

p 3p p 4pp pp 4p p 4p

X 2 = 319; d f = 12; pχ 2 < 0.001

Frequency 82 26 145 1

St. res 7 6 8 -7

% of total 24% 8% 43%