192 27 12MB
English Pages 411 [412] Year 1993
Statistical Techniques for the Study of Language and Language Behaviour
Toni Rietveld Roeland van Hout
Statistical Techniques for the Study of Language and Language Behaviour
Mouton de Gruyter Berlin · New York
1993
Mouton de Gruyter (formerly Mouton, The Hague) is a division of Walter de Gruyter & Co., Berlin.
© Printed on acid-free paper which falls within the guidelines of the ANSI to ensure permanence and durability.
Library of Congress Cataloging-in-Publication
Data
Rietveld, Toni, 1 9 4 9 Statistical techniques for the study of language and language behaviour / Toni Rietveld, Roeland van Hout. p. cm. Includes bibliographical references and index. ISBN 3-11-013663-5 (alk. paper) 1. Linguistics — Statistical methods. I. van Hout, R., 1952. II. Title. P138.5.R54 1992 AW 2\ — dc20 92-35677 CIP
Die Deutsche Bibliothek — Cataloging-in-Publication
Data
Rietveld, Toni: Statistical techniques for the study of language and language behaviour / Toni Rietveld ; Roeland van Hout. — Berlin ; New York : Mouton de Gruyter, 1993 ISBN 3-11-013663-5 NE: Hout, Roeland van:
© Copyright 1993 by Walter de Gruyter & Co., D-1000 Berlin 30. All rights reserved, including those of translation into foreign languages. No part of this book may be reproduced in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Printing: Werner Hildebrand, Berlin. — Binding: Lüderitz & Bauer, Berlin. — Printed in Germany.
Acknowledgements We would like to thank our colleagues and friends for the help they have given us in the course of the production of this book. We would especially like to mention Ton de Haan, Jan van Leeuwen, Han Oud, Steven Ralston and Lieselotte Toelle. We are indebted to several generations of students for their comments on earlier drafts of the book. Needless to say, we alone are responsible for any remaining errors or infelicities. Special thanks are due to Yvonne Detmers of the Audiovisual Department of the University of Nijmegen for producing the art work. Finally, we want to thank our departments for their support: the Department of Language and Speech, University of Nijmegen, and the Department of Language and Minorities, University of Tilburg.
Permission of reproduction Permission to reproduce copyright material was obtained from the following: BMDP Statistical Ltd, Ireland, for output from BMDP procedures and corresponding command lines. SAS Institute Inc., Cary, NC 27511-8000, for output from SAS procedures and corresponding command lines. SPSS Inc., Chicago, Illinois 60611, for output from SPSS procedures and corresponding command lines. The statistical tables in Appendices Β and F are reproduced with permission of the copyright holders. The sources are mentioned next to the corresponding tables.
Contents Acknowledgements Permission of reproduction
ν ν
1
Statistics and the study of language behaviour 1.1 The structure of the book 1.2 Basic statistical concepts 1.3 A few preliminary remarks 1.4 Other techniques 1.5 Language statistics
1 1 3 6 7 10
2
Analysis of variance 2.1 Introduction 2.2 A simple example 2.3 One-way analysis of variance 2.4 Testing effects: the F distribution 2.5 Multi-factorial designs and interaction 2.6 Random and fixed factors 2.7 Testing effects in a two-factor design 2.8 The interpretation of interactions 2.9 Summary of the procedure 2.10 Other design types 2.11 A hierarchical three-factor design 2.12 Simple main effects 2.13 Post hoc comparisons and contrasts 2.14 Strength of association 2.15 Strange F ratios: testing hypotheses made difficult 2.16 Pooling and unequal cell frequencies 2.17 Overview of the steps in an analysis of variance Basic terms and concepts Exercises References
13 13 14 16 22 26 30 33 35 40 41 47 49 51 58 60 63 64 66 67 69
3
Multiple regression 3.1 Introduction 3.2 Simple regression 3.3 Tests of significance
71 71 73 79
viii
Contents
3.4 3.5 3.6 3.7 3.8 3.9
Two independent variables Selecting a model Multicollinearity Coding nominal independent variables Comparing regression lines Measurement and specification errors Basic terms and concepts Exercises References
81 90 98 101 107 113 114 115 117
4
More ANOVA and MR: assumptions and alternatives 4.1 Introduction 4.2 Assumptions in analysis of variance 4.3 Homoscedasticity 4.4 The impact of transformations 4.5 Unbalanced designs: unequal cell frequencies 4.6 Repeated measures and MANOVA 4.7 A nonparametric alternative: randomization tests 4.8 Bootstrapping 4.9 Assumptions in linear multiple regression 4.10 Plots and diagnostics 4.11 Outliers and influential observations 4.12 Linearizing transformations 4.13 Time series analysis 4.14 Alternative regression techniques Basic terms and concepts Exercises References
119 119 120 122 124 128 133 142 149 157 159 163 168 174 177 179 180 182
5
Reliability and agreement among raters 5.1 Introduction 5.2 Reliability: true scores and the error component 5.3 Interrater reliability 5.3.1 Reliability and score models 5.3.2 Testing reliability 5.3.3 Testing differences between reliability coefficients 5.3.4 An example from recent research 5.3.5 Low reliability coefficients 5.4 Interrater agreement 5.4.1 Scale type
187 187 190 193 193 201 204 205 207 210 210
Contents
5.5
5.4.2 Ordinal scales 5.4.3 Nominal scales Intrarater reliability and agreement Basic terms and concepts Exercises References
ix 213 217 224 225 226 227
β
I n t r o d u c t o r y m a t r i x algebra 6.1 Introduction 6.2 Matrices and vectors 6.3 Matrix operations 6.4 Three applications 6.5 Some special matrices 6.6 Some key concepts in matrix algebra 6.6.1 The transpose 6.6.2 The determinant and the rank of matrices 6.6.3 The inverse of a matrix 6.6.4 The eigenvalues of a matrix 6.7 Some matrix operations for statistical data Basic terms and concepts Exercises References
229 229 229 232 234 239 240 240 241 242 242 243 246 247 248
7
Factor analysis 7.1 Introduction 7.2 Dimensionality reduction 7.3 Axis rotation and linear tranformation 7.4 Criteria for dimensionality reduction 7.5 The role of eigenvalues in principal component analysis 7.6 Loadings 7.7 A PC analysis with SPSS: the two-variable example 7.8 Principal component analysis or factor analysis? 7.9 How many factors/components are to be retained? 7.10 Which factor loadings are significant? 7.11 Data problems 7.12 Rotations revisited: rotating factors 7.13 Oblique factors 7.14 Options 7.15 An example of factor analysis 7.16 Factor scores
251 251 252 256 258 259 263 265 267 272 274 275 278 280 282 284 289
X
Contents
7.17 Overview of the steps in a factor analysis Basic terms and concepts Exercises References
291 292 292 293
8
A n a l y s i s of f r e q u e n c i e s 8.1 Introduction 8.2 Log-linear analysis of a 2 χ 2 table 8.3 Multi-dimensional contingency tables 8.3.1 A three-dimensional example 8.3.2 Carrying out a log-linear analysis with SPSS 8.3.3 A less successful model 8.3.4 Another example 8.3.5 Low frequencies and structural zeros 8.4 Logit analysis 8.5 Overview of the steps in a log-linear analysis Basic terms and concepts Exercises References
297 297 298 307 307 310 315 315 317 317 323 324 324 325
g
Logistic regression 9.1 Introduction 9.2 Regression and the logistic function 9.3 Evaluating a model 9.4 Polytomous independent variables 9.5 Stepwise selection 9.6 The model matrix approach and GSK 9.7 Comparing different techniques Basic terms and concepts Exercises References
327 327 328 334 339 343 350 357 358 360 361
Appendices Logarithms A: B: Characteristic root test Eigenvalues of random data correlation matrices C: D: Command lines statistical programs E: Key to the exercises
363 365 367 369 371 379
Contents
F:
Index
Statistical tables Α. Normal Distribution B. Critical values of F C. Critical values of χ 2
xi
387 388 389 393 395
Chapter 1
Statistics a n d the s t u d y of l a n g u a g e behaviour 1.1 Introduction This book is different from most textbooks on statistical techniques. One difference is the wide range of techniques and subjects covered. The subjects in this book vary from classic techniques like analysis of variance and multiple regression to reliability and agreement analysis, matrix algebra, factor analysis, loglinear modelling and logistic regression. Another difference is that we refrained from focussing in detail on mathematical aspects. We tried to emphasize how statistical techniques are related to the kind of conceptual problems a language researcher has to face in empirical research. Many researchers in the field of language and speech behaviour have only attended an introductory course in statistics, perhaps based on specific textbooks for language research like Butler (1985) and Woods, Fletcher & Hughes (1986). In practice, though, they will face a great variety of techniques when reading research reports or doing empirical work themselves. This book is not meant to completely and conclusively bridge the gap between a knowledge of basic and advanced statistics. A selection had to be made among a great and ever increasing number of statistical techniques. Moreover, the choice of presenting a wide range of statistical techniques prohibits an exhaustive treatment of individual techniques. For instance, the modest number of pages spent in explaining analysis of variance in this book can in no way compete with the more than 900 pages in standard textbooks like Winer (1971) and Kirk (1982). We tried to avoid making too many concessions to superficiality or oversimplification. The result may be that readers will sometimes feel that we set quite a high pace in explaining technical topics. But we felt it was inappropriate to curtail the treatment of the various statistical techniques, given our belief that there is a significant and positive correlation between understanding mathematical and conceptual aspects of a statistical technique. The eight subsequent chapters are intended to elucidate the main principles that underlie the various techniques, and to give enough information to
2
Statistics and the study of language behaviour
handle standard situations and to understand the output of relevant computer programs. The consequence is, for example, that in Chapter 2, the chapter on analysis of variance, no formulas will be found with double or triple summation signs to calculate the sums of squares of different sources of variation in more complex designs. The interested reader can find these in the comprehensive handbook of Kirk (1982). We merely give the numerical details of a one-way analysis of variance, in order to demonstrate the main principles. Most calculations in this book have been carried out with one of the following three software packages: BMDP, SAS, and SPSS. Relevant parts of the output of specific statistical procedures are reproduced in the subsequent chapters. Important command lines for running specific procedures are given in Appendix D. This appendix may serve as an introductory guide to the use of the relevant statistical routines. It is self-evident, that analysis of variance is included in this book. This is probably the most frequently used statistical technique in empirical research. Basic aspects of this technique are discussed in Chapter 2; in Chapter 3 there follows a treatment of multiple regression. The final part of Chapter 3 emphasizes the relationship between multiple regression and analysis of variance. In fact, both techniques are based on the same principles of linear modelling and variance decomposition. A separate chapter, Chapter 4, is devoted to assumptions underlying analysis of variance and multiple regression. This chapter adds crucial information to gain an understanding of the possibilities and drawbacks of these two techniques. Furthermore, Chapter 4 gave us the opportunity to discuss the use of multivariate analysis of variance in repeated measurement designs, and to briefly go into two recent developments in statistical analysis, viz. randomization tests and bootstrap procedures. Although both approaches deserve to be discussed in separate chapters, we opted for a context where they offer solutions to situations in which assumptions are not met, or where estimates of confidence intervals are difficult to obtain. A relatively long chapter, Chapter 5, is dedicated to interobserver reliability and agreement. There are excellent introductions to these topics, but most of them remain in the context of constructing and evaluating tests. Many researchers of speech behaviour carry out experiments in which panels of judges have to use scales for their ratings of persons, speech fragments etc. The classic concepts of 'reliability' and 'agreement' are treated in the context ofthat kind of experiment. We tried to distinguish clearly between both concepts and to indicate in which respects they are related. Chapter 6 is an 'interlude', in the sense that no specific statistical technique is discussed. The main objective of this chapter is to provide the
Basic statistical concepts
3
reader with some basic knowledge on matrices. Matrices and related concepts play an important role, not only in multivariate analysis but also in conceptualizing sociolinguistic or psycholinguistic processes. In fact, an understanding of matrix algebra is required to achieve a deeper understanding of modern statistical analysis. We did not begin this book with a chapter on matrix algebra because we were afraid t h a t this would have stopped many readers from continuing. Chapter 7 goes into factor analysis and principal component analysis. This chapter presupposes a basic knowledge of matrix algebra. Chapter 8 discusses rather recent techniques for the analysis of multiway contingency tables by means of loglinear and logit analysis. This technique has proved extremely useful for the analysis of frequencies. Applications are first and foremost found in dialectology and sociolinguistics, but in phonetics there is a growing interest in this technique too. In Chapter 9, logistic regression is discussed, a special regression technique which can be used when the dependent variable is dichotomous. This technique is related to logit analysis but it is more powerful. The structure of most of the chapters in this book can be outlined in t h e following way: • a short introduction based on a simple example; • more on the statistical/numerical details used in the technique being discussed; • a more elaborate example, sometimes borrowed from published d a t a sets; • further details on assumptions, specific cases, etc.; • an occasional box containing some basic questions. If relevant, a flow diagram is given of the successive steps necessary to handle the technique. Each chapter concludes with the following three sections: (1) an overview of basic terms and concepts; (2) exercises (the answers are given in Appendix E); (3) references to relevant literature. 1.2 Basic statistical concepts The reader is assumed to be familiar with basic statistical concepts and procedures used in empirical research. The most important basic concepts are summarized in the following alphabetical list 1 : 1
The symbol Σ
be used in this list without the index or limit of summation.
4
Statistics and the study of language behaviour
Alpha level ( = a ) : The alpha level is the probability of rejecting the null hypothesis ( = Ho) when this hypothesis is in fact true. The corresponding error is called the Type I error. See significance level B e t a level ( = β)ι The beta level is the probability of accepting the null hypothesis when this hypothesis is in fact false. The corresponding error is called the Type II error. See power of a test. Confidence interval: A confidence interval gives the boundary values of a parameter (e.g. the mean). The parameter in question has a high probability of lying between these boundaries. The interval is based on the observations in the random sample drawn. The probability value applied can be defined as (1 — at). The smaller a is, the larger the confidence interval becomes. The size of the confidence interval also depends on the size of the sample. See standard error. Critical value: The critical value of a test statistic is the value that the test statistic in question has to exceed to reject the null hypothesis. Degrees of freedom ( = df)i The number of observations that are free to vary when restrictions on the data are taken into account. The estimate of the variance in the population, for instance, is obtained by dividing the sum of squared deviations from the mean by its number of degrees of freedom, which is Ν — 1: ~~ X ) 2 / ( N ~ 1)· One degree of freedom is lost by using X as an estimator of the population mean μ. Given X and the observations X\ to X ^ - i » the value of Xjf is determined. Only Ν — 1 observations are free to vary. Directional testing: If there are grounds to hypothesize that a difference has a specified direction, one may include the alternative direction in the null hypothesis. This procedure will result in a lower critical value of the test statistic. Estimate: An estimate or estimator is a function based on the observations in a random sample, which is used to estimate a population parameter. For instance, the population mean μ is estimated by X = ΣΧί/Ν, and 2 2 the variance σ by - X) /(N - 1). Least squares estimate: If an observation Y is supposed to be a function of a number of parameters 0;: Y = / ( # ι , θ2,..., #*.), then the are least squares estimates of if they result in a minimum value of the — following squared sum: · · · > &k))2· An obvious example is the sample mean. When the observations Χι, X2,..., X f f constitute a random sample, we can write X,· as μ + e^, with μ as population mean and £{ as random variation with a mean of zero. The sample mean X is a least square estimate of μ, as — θ)2 is a minimum if θ is X (= Σ
Basic statistical concepts
5
M a x i m u m likelihood estimate: A maximum likelihood estimate θ of a parameter θ is an estimate that maximizes the likelihood of getting the sample that is actually observed. For instance, when 7 out of 10 subjects have a 'uvular' / r / (instead of an 'apical' / r / ) , the maximum likelihood estimate of ρ to observe 'uvular' / r / is ρ = .70. This estimate yields the highest probability of getting the observed data. If ρ = .60 is the estimate, the probability of getting the observed percentage is, by using the binomial theorem: ί 1 ? ) χ .60 7 χ .40 3 = .215. By setting ρ = .70, this probability increases to .267. With the value ρ = .80 it decreases to .201. Thus, the maximum likelihood estimate in this simple example is
ρ =rii/N.
Measurement level: In statistical analysis four measurement levels are distinguished: nominal, ordinal, interval, and ratio. The level of measurement is directly related to the kind of algebraic manipulations that make sense. Statistical tests often presuppose a specific level of measurement. Null hypothesis: The null hypothesis (Ho) is the hypothesis actually tested in a statistical testing procedure. A null hypothesis is formulated in such a way that we can calculate the probability that a test statistic is equal to or exceeds a critical value under the assumption that Ho is true. Consequently, the null hypothesis contains, for instance, the statement that a parameter of the population distribution has a specific value (for instance zero), or that parameters of the different population distributions compared are equal. If, on the basis of the testing procedure, H 0 is rejected, the alternative hypothesis (Hi) is automatically accepted. The Ho and Η χ are mutually exclusive and exhaustive: if the one is true, the other must be false. Power of a test: The power of a test is defined as 1 — β , where β equals the probability of a Type II error. We can increase the power of a test by increasing the number of observations and by increasing the alpha level. Significance level: The significance level of a test is the maximum probability with which one is willing to risk a Type I error. S t a n d a r d error ( = SE): The standard error is the standard deviation of the distribution of a sample statistic. For instance, the standard deviation of the distribution of the means X i is the SE of X : s s (which is equal to
sJyfN).
Statistic: A statistic is a measure computed from observations in a sample, for instance, X (the mean), s 2 (the variance), and τ (the correlation between two variables). Test statistic: A test statistic is a measure used to test hypotheses about population parameters. Test statistics like ζ, έ, χ 2 , F have known distri-
6
Statistics and the study of language behaviour
butions, that are used to evaluate the probability of getting the observed test statistic when the null hypothesis (Ho) is true. In addition, we assume the reader is familiar with: • the procedures for basic statistical tests; these include the t test (testing differences between means), and the χ2 test (testing goodness of fit and dependency in frequency tables); • the properties of the statistical distributions z, t and χ 2 ; • the definition of important statistics like standard deviation, variance, covariance and correlation. 1 . 3 A few preliminary remarks In the chapters to follow, most examples will be based on relatively small samples. Using such small sample sizes may suggest that we do not care about the power of statistical tests and the sample size needed to achieve acceptable power. Quite the contrary! We believe that the question whether the chance of committing T y p e II errors is sufficiently small, or whether estimates of statistics are sufficiently stable given the planned sample size, needs careful consideration before the actual experiment or survey is carried out. Establishing the sample size necessary should not be guided by considerations of 'round numbers', like 5, 10, 20 or 100. In general, one should base sample sizes on statistical considerations, such as: ( a ) what is the standard deviation of the scores we m a y expect in a particular study? (b) what is the magnitude of the effect that is considered to be relevant and that should be detected if present? (c) does a particular technique demand relatively large sample sizes (see the distinction, for instance, between maximum likelihood estimators and weighted least squares estimators, Chapter 8). We refer to Cohen (1977) for a comprehensive study of statistical power, and to Henry (1990) for an introductory discussion on the role of sample sizes. Unfortunately, we cannot deal with methodological questions. For instance, aspects of causality in relation to a particular research design or the meaning of correlations in relation to causality will not be discussed. Neither do we discuss, for instance, how interactions in an analysis of variance or a loglinear analysis should be interpreted in relation to the theoretical framework of a study and what their role could be in specific designs. There is a somewhat biased emphasis on the statistical techniques as such, which are placed in a linguistic context as much as possible to give the reader an impression of the merits of the technique in question. For methodological questions we refer to standard textbooks like Cook L· Campbell (1979).
Other techniques
7
Most techniques treated in this book, like analysis of variance, multiple regression, factor analysis and reliability analysis are said to assume an interval level of measurement. We know that many measurement scales in fact do not meet the requirements of an interval level of measurement in the strictest sense. That is to say, we are not always sure whether the difference between scores like 3 and 4 represent exact the same difference in the underlying process or state as the difference between say, 1 and 2. Frequently, these differences will not be exactly the same. A number of nonparametric techniques are available for scores on a lower level than the interval level (the ordinal and nominal ones; see Siegel L· Castellan 1988). The claim of a strict relationship between the level of measurement and the kind of statistical procedures allowed goes back to Stevens (1946). There is a general feeling, however (cf. Gaito 1980), that Stevens' strict recommendations need some revision. W h a t actually counts is the set of assumptions underlying a technique rather than the measurement level. We will not and cannot participate in this debate, but we would like to quote Lord (1953) who made the essential point that "the numbers do not know where they came from" (Lord 1953:751). Despite this quotation, the reader may blame us for relating statistical techniques to the level of measurement. We have done so in accordance with the tradition in quantitative empirical research. 1.4 Other techniques There are many valuable statistical techniques which are not discussed in this book. This does not imply that they are not relevant in the context of the study of language and language behaviour, on the contrary! We will restrict ourselves here to short descriptions of some of these additional techniques. Statistical techniques not discussed here include in alphabetical order: Canonical analysis: This technique is an extension of multiple regression analysis. Whereas multiple regression analysis takes only a single dependent criterion variable, canonical analysis can handle a set of criterion variables. The aim of canonical analysis is to find a weighted sum of criterion variables which has a maximum correlation with a weighted sum of predictor variables. Canonical analysis is in fact quite a general approach which subsumes not only multiple regression but also discriminant analysis and (multivariate) analysis of variance ((M)ANOVA). Cluster analysis: This type of analysis covers a wide range of techniques which are used to classify cases (persons, objects) into groups in which
8
Statistics and the study of language behaviour
the distances or dissimilarities between the cases (members) within a group (cluster) are smaller than between cases that belong to different groups (clusters). Cluster analysis is often used as an exploratory tool to discover patterns of similarity. Covariance analysis: This technique makes it possible to incorporate a continuous variable (the covariate) in an analysis of variance. The scores of the dependent variable are first 'corrected' for the effect(s) of the covariate(s). A standard analysis of variance is applied to the corrected scores. This implies that covariates are in fact not part of the ANOVA design, but are known or supposed to affect the dependent variable. A candidate covariate in achievement tests could be 'age'. Discriminant analysis: This multivariate technique can be applied to cases (persons,objects) that are classified as belonging to different groups, of which scores on a set of variables are available. The aim of this technique is to find out whether the groups can be distinguished ('discriminated') on the basis of a weighted sum of the original variables. See canonical analysis. Linear structural relations (LISREL): General techniques for testing causal models have been developed quite recently. These approaches are often subsumed under the label of structural equation models. The approach most popular nowadays in the behavioural sciences was developed by Joreskog under the name of LISREL. Two components can be distinguished: (1) the structural equation model and (2) the measurement model. The structural equation model specifies the relationships between the so-called exogenous and endogenous variables. An exogenous variable is a variable whose variability is determined by factors outside the hypothesized causal model. The variability of an endogenous variable is determined by exogenous and/or other endogenous variables within the causal model. Normally, the variables in the structural equation model are constructed, i.e. they are latent or unobserved. The measurement model specifies the relationships between unobserved and observed variables (resp. latent and manifest variables). P a t h analysis can be seen as a special and (much) more restricted case of LISREL. Meta-analysis: A statistical technique used to combine and integrate (statistical) results from different studies on the same subject. A simple example is the ζ statistic obtained by combining the outcomes of a number of t tests: EU
y / E d f i / i d f i - 2)
( 1 )
Other techniques
9
Here ti represents the i different t values obtained in i separate studies; dfc represents the corresponding number of degrees of freedom. The statistic ζ can be used for assessing the statistical significance of the results of the whole collection of studies, even when some of the t tests contradict the general trend, be it by lack of significance or even by showing significant differences with opposite signs. Multidimensional scaling ( M D S C A L ) : In multidimensional scaling objects are represented in a (reduced) metric space, in such a way that the distances in that space still correspond closely to the empirically obtained (dis)similarities between those objects. This technique is used, for instance, in order to discover structural relationships between objects that have been rated on the extent to which they are (dis)similar. P a t h analysis: This technique aims at decomposing relationships between variables in terms of causal models. Causal analysis and testing causal models is especially relevant in nonexperimental research, where the researcher cannot manipulate or randomize variables. The pattern of causal relationships among a set of variables can be displayed graphically by a path diagram in which arrows symbolize the causal flow. The basic statistical technique applied in path analysis is multiple regression analyis. On the basis of the hypothesized causal model a series of multiple regression equations can be formulated which reflect the direct and indirect relationships between the variables in the model. The next step is to test whether the outcomes of the hypothesized model are consistent with the observed correlations between the variables. See L I S R E L . R a s c h analysis: In an achievement test, the achievement level of a subject is related to his performance on a test item. A subject will perform well on items which scale under his achievement level. If there is a deterministic relationship between the level of achievement of a subject and the required achievement level of an item, an implicational scale (also called a Guttman scale) may apply. This type of scale is often used in language acquisition research and sociolinguistics (including pidgin and Creole linguistics). One can also apply a probabilistic model. A fairly recent technique used in achievement testing is the so-called Rasch analysis. In this model the ability parameter θυ of subject ν is related to the difficulty Si of item i and the probability of ν giving a correct response to that item. The formula is: ρ (correct response
=
-z 1 + e0"
—
j-
(2)
10
Statistics and the study of language behaviour
This model is used in estimation procedures to assess t h e change in t h e ability θυ at different points in time. 1.5 Language statistics T h e statistical techniques discussed in this book are not especially related to language research; they are typical of quantitative empirical analysis in general. Nevertheless, it should be clear t h a t we think they are relevant for t h e study of language and language behaviour too, as we have tried to demonstrate through t h e examples given in this book. Does this mean t h a t there is no special relationship between language and statistics? T h e existence of textbooks on the subject of language statistics (e.g. Guiraud 1959) seems to suggest t h e contrary. For another counterargument one may refer to the famous relationship between t h e frequency distribution of words and their rank order as discovered by Zipf (see Zipf 1949), a relationship which is nowadays known as Zipf's Law. Note t h a t it is no coincidence t h a t the subject of language statistics is especially associated with the lexicon (lexico-statistics), the most arbitrary part of t h e linguistic system. Nevertheless, t h e statistics applied in lexico-statistics are not uniquely related to language phenomena either, as becomes clear, for instance, in Baayen (1989). He gives an excellent review of the kind of statistical distributions which may underlie word frequencies. We avoided dealing with these kinds of problems, because a full t r e a t m e n t would require a basic explanation of a series of complex statistical distributions which would take us beyond the scope of this book. References Baayen, R.H. 1989 A Corpus-Based. Approach to Morphological Productivity. Statistical Analyis and Interpretation. PhD dissertation, University of Amsterdam. Butler, C. 1985 Statistics in Linguistics. Oxford: Basil Blackwell. Cohen, J. 1977 Statistical Power Analysis for the Behavioral Sciences. New York: Academic Press. Cook, Th.D. & Campbell, D.T. 1979 Quasi-Experimentation. Design and Analysis Issues for Field Settings. Chicago: College Publ. Company.
References
11
Gaito, J. 1980 Measurement Scales and Statistics: Resurgence of an Old Conception. Psychological Bulletin 87, 564-567. Guiraud, P. 1959 Problemes et Methodes de la Statistique Linguistique. Dordrecht: Reidel. Henry, G.T. 1990 Practical Sampling. Newbury Park, CA: Sage. Kirk, R.E. 1982 Experimental Designs: Procedures for the Behavioral Sciences (2nd ed.). Belmont, CA: Brooks/Cole. Lord, F.M. 1953 On the Statistical Treatment of Football Numbers. Psychological Review 8, 750-751. Siegel, S. & Castellan, N.J. Jr. 1988 Nonparametric Statistics for the Behavioral Sciences (2nd ed.). New York: McGraw-Hill. Stevens, S.S. 1946 On the Theory of Scales of Measurement. Science 103, 677-680. Winer, B. 1971 Statistical Principles in Experimental Design. New York: McGraw-Hill Woods, Α., Fletcher, P. & Hughes, A. 1986 Statistics in Language Studies. Cambridge: Cambridge University Press. Zipf, G.K. 1949 Human Behavior and the Principle of Least Effort. An Introduction to Human Ecology. New York: Hafner.
Chapter 2
Analysis of variance
2 . 1 Introduction Analysis of variance is a statistical technique which is often applied in the analysis of language data. Its frequent use does not imply that it can be regarded as a simple technique. Nevertheless, the basic concepts are quite straightforward. Analysis of variance can be applied if two or more groups of informants or subjects are compared on the basis of a so-called dependent variable. This may be the reaction time in a word recognition experiment, the performance on a language proficiency test, or the acceptability scores of different types of computer speech. Two characteristics are essential. First, a subject or an informant belongs to one of the groups defined. Second, the performance of the informants or subjects, the dependent variable, is measured or determined by an index or score at the interval level. However, the application of analysis of variance often becomes more complicated, through, for instance, the introduction of more than one factor along which groups of subjects or informants are defined, the occurrence of repeated measures on the same subjects, the introduction of 'random' factors instead of 'fixed' ones, and the use of 'nested' factors. In addition, the researcher may need information to compare groups which is more detailed than the global information provided by the analysis of variance. These questions will be discussed in this chapter. We will restrict ourselves, however, to a presentation of the main components of analysis of variance. We cannot endeavour to give a full account of this powerful technique. T h e reader who is interested in a complete overview is referred to the standard textbooks of Winer (1971) and Kirk (1982). Analysis of variance is based on several fundamental assumptions with respect to the distributional characteristics of the data. These assumptions will be mentioned in this chapter. In Chapter 4 we discuss how these assumptions can be tested and what one should do if they are not met. Chapter 4 is therefore an indispensable supplement to the matters discussed here.
14
Analysis of variance
2.2 A simple example T h e basic ideas in analysis of variance can be illustrated by a simple example. Let us assume that we are interested in the effects of three methods of vocabulary learning in a second language. Suppose that three subjects, having no knowledge of the second language in question, are randomly assigned to each of the methods ( 3 x 3 = 9 subjects). After the training period a vocabulary test is carried out. Table 2.1 contains two hypothetical data sets with test scores. Table 2.1 Two hypothetical data sets of test scores, both for three groups of three subjects
X
data set 1 group I II III 1 9 10 1 2 5 2 6 0 2 4 6
X
data set 2 group I II III 3 7 1 4 6 2 5 5 3 6 2 4
Both data sets in Table 2.1 have the same mean scores for the three groups. Given the outcomes in both data sets, it is tempting to conclude t h a t the subjects belonging to group II performed better and, consequently, t h a t the method used for that group is the method to be preferred in vocabulary learning. Given the outcomes of the first data set only, however, such a conclusion is hardly warranted. Especially the large differences or variation in scores between the subjects within the groups compared to the differences or variation in the mean scores between the groups cast serious doubts on the value of the differences between the mean scores. It is not unlikely that the variation in mean scores is brought about by the variation between the individual subjects. In that case, the variation between the mean scores only reflects individual differences, not differences between the three methods used. The second data set offers a sounder basis for the conclusion that the methods used produce different results, because the variation within the groups is limited. Because of the restricted amount of variation within the groups in comparison with the differences or variation between the groups, it is more acceptable to reject the hypothesis t h a t no differences occur between the groups. This line of reasoning brings us to the essence of analysis of variance: it involves the comparison of sources of variation. In our example two possible
A simple example
15
sources of variation can be distinguished: variation brought about by effects induced by the so-called independent variable (the method), and variation which is not influenced by the independent variable and which is brought about by individual differences between subjects: the error. In the following section (2.3) the elements are presented which are necessary to draw a meaningful comparison between the two types of variation mentioned here. Obviously, it is not enough to just compare the magnitudes of variation measures (whatever these may be). In statistics we are not satisfied with statements like: 'variation A exceeds variation B \ Instead, we look for a statistic that enables us to say with a specified (low) degree of uncertainty that the variation between group means is not based on chance, but on a genuine difference between two or more groups. However, before giving more details of analysis of variance, we first have to answer an obvious question: why should analysis of variance be used when it also possible to carry out a series of t tests to detect differences between groups? The difference in mean scores of two groups can be statistically evaluated by means of a t test. If the t value calculated passes a specified probability level (normally .05), the so-called null hypothesis, which states that no difference exists between the two mean scores observed, is rejected and the so-called alternative hypothesis that a real difference exists is accepted instead. Indeed, a t test can be applied to all pairs of groups if more than two groups are examined. If at least one of the t tests proves to be significant, the null hypothesis can be rejected. However sound this strategy may seem at first sight, something goes wrong with the so-called type I error. In applying a statistical test, a t test for instance, we run the risk of rejecting the null hypothesis while the null hypothesis is in fact true. The probability of a type I error is equal to the significance or α level (for instance .05). What happens to the a level if the test in question is repeated? The α level will not be .05, but much higher. This effect can be illustrated with an example based on elementary principles of probability theory. Suppose that the probability of the first letter of a word being an 'e' is .10. What is the probability that at least one out of five words randomly selected from a text begins with the letter 'e'? It is: 1 - p(not 'e')5 = 1 - (1 - .10) 5 = 1 - .905 = .41
(1)
An analogous mechanism is present in the repeated use of a t test on the same data set. Suppose we have gathered the scores of three groups of individuals and that t tests are applied at the .05 level ( = a). The .05 level or 5% level is the probability of incorrectly rejecting the null hypothesis.
16
Analysis of variance
If there are three groups we have to apply t h e t test three times. T h e probability of incorrectly rejecting at least one null hypothesis in k t tests is: 1 — p(not rejecting t h e null hypothesis)* = 1 — (1 — a) fc (2) If a is .05 and k = 3, t h a t probability is 1 - (.95) 3 = 1 - .86 = .14. This value is well above t h e originally planned level of .05. In this last example it is assumed t h a t the t tests applied use independent information. This is clearly not t h e case if t h e same samples are used in different tests: t tests are carried out on t h e d a t a of samples I and II, I and ΙΠ, and Π and ΙΠ. T h a t implies t h a t the real a level cannot be calculated precisely and t h a t is the reason why a different approach should be chosen, analysis of variance. 2 . 3 O n e - w a y analysis of v a r i a n c e In analysis of variance several terms are used to indicate t h e independent variable (representing the different groups compared) and its values. W h e n we say t h a t treatment j (representing one of t h e values of t h e independent variable) influences the magnitude of t h e scores, it is said to constitute an effect. If t h a t t r e a t m e n t is one of a series of t r e a t m e n t s , t h e series constitute a factory and t h e treatments are called t h e levels of t h a t factor. Factors or independent variables are generally referred to by Greek letters: α, β etc.; t h e levels of t h e factors are indicated by subscripts. T h e first level or value of factor α is referred to by αχ; α ι represents the second level of t h a t factor and so on. T h e symbols are used to formulate models for the d a t a or scores observed. In fact, formulating and testing models for t h e observed scores constitutes t h e underlying idea of analysis of variance, although t h e reports in which this technique is used often suggest otherwise. W h a t is a plausible model of the d a t a in Table 2.2? T h e scores can be symbolized by X i j , in which i refers to the score of individual i and j to t h e level or t r e a t m e n t of t h e factor or independent variable investigated. So, every score has a unique identity. We can assume a model in which every score X ^ equals an overall mean called μ. In such a model all d a t a or scores observed are equal and the variation found is supposed to be caused by irrelevant individual differences between t h e subjects. Clearly, the model X^ = μ does not suffice for the d a t a of Table 2.2, as t h e three subpopulation means differ f r o m each other fairly systematically. A term or effect has to be included which represents t h e value added to or subtracted from the overall average, according to the
One-way analysis of variance
17
Table 2.2 Hypothetical data set for three groups of subjects or treatments data set 3 group I group Π Xu = 4 Xu = 6 X21 = 4 X22 = 6 X31 = 4 X32 — 6 4 6
group III X13 = 2 X23 = 2 X33 = 2 2
specific treatment: aj = μ, — μ. This yields the following extended model: Xii
(3)
= μ + aj
Given the scores in Table 2.2, we can assign the a ' s the values 0, + 2 and - 2 ; a perfect match results between the scores predicted by the model and the scores observed. In real life, however, scores do exhibit variation within a group of subjects. The model needs to be extended further by a supplementary term representing an error component: etJ·. It indicates the amount of random variation for subject i in group (or t r e a t m e n t ) j . T h e full model of the scores takes on the following form: Xij = μ + oij + eij
(4)
The error can take any value, whereas the treatment effects are assumed to have a constant value in each group. This last model can be applied to the two data sets of Table 2.1 using «ι = 0, α 2 = + 2 and α 3 = —2 as estimated values. These values are obtained by Xj — X. The results are given in Table 2.3. Table 2.3. The full model (= M O D E L 2) specified for two hypothetical data sets of test scores (see Table 2.1), assuming that αχ = 0, = +2 and 03 = —2
I 4+0+5 = 9 4+0-3 = 1 4+0-2 = 2 4
data set 1 data set II III I II 4+2+4 = 10 4 - 2 - 1 = 1 4 + 0 - 1 = 3 4+2+1 = 4 + 2 - 4 = 2 4 - 2 + 3 = 5 4+0+0 = 4 4+2+0 = 4 + 2 - 0 = 6 4 - 2 - 2 = 0 4+0+1 = 5 4 + 2 - 1 = 6 2 4
2 III 7 4- 2-1 = 1 6 4 - 2+0 = 2 5 4 - 2+1 = 3 6 2
Table 2.3 shows how all individual scores are partitioned or decomposed. A score is considered to be the sum of separate components. In this table
18
Analysis of variance
we assumed specific values for μ and atj. In empirical research, however, we do not know these values. They can only be estimated on the basis of the patterns present in the data. If the research design contains one factor or one series of treatments only, as in the examples given, the crucial question is whether an effect a j exists or not. It has to be decided which model underlies the observed data, model 1 or model 2: Xi:j
=
μ + eij
MODEL 1
(5)
Xij
=
μ + atj + eij
MODEL 2
(6)
W h a t can be observed in the data is (1) the variation between the subjects or informants within the groups and (2) the variation in mean scores or values between the groups. If there is no effect a j , the variation observed between the mean scores of the groups (the between group variation) is brought about by the variation in the error component (the within group variation). If an effect a j is present, the variation between the groups increases because of the differences between the groups. The variation between the groups now consists of error variation plus variation caused by the effect a j . How can the variation between the groups be compared with the variation within the groups? The common measure for variation is the so-called variance, which in the case of a random sample of scores is defined as follows: 2_ ~ K-Y _ Σ £ ΐ χ 2 7 3χ Ν - 1 Ν - 1 where Σ χ 2 is the sum of the squared deviations of X from the mean (x = X — X) and where Ν is the sample size. Normally, the term Sum of Squares (= SS) is used to refer to the deviation sum of squares. The sum of squares for the within group variation in the first group of data set 1 (see Table 2-1) is Σ?=ι(*.·ι " X \ Υ = (9 - 4 ) 2 + (1 - 4) 2 + (2 - 4) 2 = 38; the SS for the second group is 32, and 14 for the third group. The SS between the groups is Σ ' = ι ( X j ~ X)2 = (4 - 4) 2 + (6 - 4) 2 + (2 - 4) 2 = 8. In this example it is obvious that the sum of squares for the between group variation is smaller than the sum of squares for the within group variation. But how can both sums of squares be compared exactly, and how should the number of observations on which the sums of squares are based be taken into account? We have to find a statistic which applies to our d a t a and whose distribution is known if the null hypothesis is true ( = model 1: there is no effect a j and the between group variation is equal to the within group variation).
One-way analysis of variance
19
In order to arrive at this statistic, we have to decompose the observed (deviation) scores into parts. This can be done in the following way: Xii-X
=
{Xij - Xj) + (Xj -
X)
=
(Xj - X ) + (X{j - Xj)
(8)
X is the general or 'grand' mean of all the observations in the different groups. The above equation is in fact a simple identity, but the interesting point is t h a t the deviation score (on the left) is partitioned into a part which contains the difference between the group mean and the grand mean (= Xj—X), plus a part which contains the difference between the individual score and its group mean ( = Xij — Xj). The first part is the between group component, the second part is the within group component. Squaring and summing this identity over n j informants in the j t h group results in the following:
i=l
t=l
»'=1
i=l
The right-hand part of the equation is obtained by using simple algebra: (a + b)2 = a2 + b2 + 2a6, where a refers to the first term in the right-hand part of equation (8), and b to the second term. The last term on the right of formula (9) equals zero, as can be demonstrated in the following way: Σ(Χί-Χ){Χϋ-Χί) = { X j - X ) E ( ^ O - ^ i ) ; by definition E ( ^ i - ^ i ) = 0 and, consequently, the whole last term is 0. The first term of the righthand part can be rewritten as nj(Xj — X). The whole equation can then be rewritten as: - X)2 = nj(Xj - x)2 + t=l
- Xj)2
(10)
t=l
By summing over k groups we obtain:
=
j=1»=1
^total
=
Σ
j=l
^
-
^between
ϊ
Γ
+ Σ
Σ
J-1i=l
^
-
^
(ii)
+ -^within
The term in the left-hand part of the equation is the total sum of squares: the sum of squares of all the observations with respect to the grand mean X . The right-hand part of the equation contains two terms: the first term represents the sum of squares between groups, the second one represents the
20
Analysis of variance
sum of squares within groups. The total sum of squares can be partitioned into two additive and independent components: the between groups sum of squares («^between) a n d the within groups sum of squares (SS'within)· Each sum of squares has to be divided by its associated degrees of freedom to obtain a variance estimate. The result is a so-called mean square (MS). The number of degrees of freedom for the total sum of squares is the total number of observations minus 1: Ν — 1. One degree is lost because of the presence of the grand mean. The number of degrees of freedom for the within groups sum of squares is the number of observations within a group minus 1. Given k groups the number of degrees of freedom is Ν — k] k degrees are lost because of the presence of k group means. The number of degrees of freedom associated with the between groups sum of squares is the number of groups minus 1: k — 1. One degree is lost here because of the presence of the grand mean. The degrees of freedom are additive too: Ν — 1 = (k — 1) + (Ν — k). The total number of degrees of freedom is equal to the number of degrees of freedom associated with the between groups sum of squares plus the number of degrees of freedom associated with the within groups sum of squares. The variance estimates for the between and within groups parts are obtained by dividing the respective sum of squares by their degrees of freedom: Ί
= M5 b e l w e e „ =
1/c sw2 ~ M5 within -
(12) —
Χ]) 2
For the first data set of Table 2.3 MS\>ttween anwithin has a n expected value of 1. If, however, Hi is true, the expected value is higher than 1. In order to clarify the important concept of the expected value of mean squares, we will consider the actual sampling process a little longer. To t h a t end we assume a population from which three samples are drawn; each sample consists o f t subjects. The population is characterized by a μ of 10, and a variance σ\ of 1.2. Let us assume that a very great number of experiments are carried out, of the type discussed above, with three treatments applied to three samples drawn from the population (μ = 10, σ* = 1.2). We assume that the treatments do not have any effect on the scores (Ho: α ι = α 2 = a 3 = 0). T h a t is why we only deal with a single population. The situation can be described as follows:
22
Analysis of variance
population: μ = 10, σ\ = 1.2 sample: exp. 1 Χι X2 X3 ^between = M5 between = exp. 2 Xx X2 X2 M5 between = exp. 3 XxX2Xz exp . N . . . (Ν = very large)
1 0 0 8 1 3
mean M 5 b e t w e e n = 1.2
^ w i t h i n = 0.9 MSwithm - 11 MSwithin = 12
mean M 5 w i t h i n = 1 . 2
When the assumption that there are no effects holds, the means of both ^ b e t w e e n a n ( ^ -^^within o v e r a very large number of samples (experiments) are equal to the error variance in the population, which is 1.2. In other words: ^(M5between) = ^(-^^within) = σ \ - ^ the assumption does not hold, however, the mean of Af5"b e t ween will exceed the mean of AfS w jthi n (given that Ν is large). In that case the expected value of MS^tween
σ\ + ησ2α. In actual experiments the means of the MSs are not available, but we do know the distribution of the ratio of both mean squares under discussion if the Ho is true, i.e. the F distribution. The F distribution permits us to determine to what extent the observed ratio of MSs is a probable result under the Ho- If that result is not 'probable enough', the H 0 is rejected, and the alternative hypothesis is accepted. - Decompose the deviation score X{j — X into two components. - What is the expected value of the random variable X , when it takes the values 1, 2, 3 and 4, and when p1 = p2 = p3 — p4 = .25? - Give the model of the score Xij in the presence of one effect and error. - Is MStotal = M 5 b e t w e e n + M S w i t h i n ? - Assume t tests are carried out on the scores of all pairs of four different samples; an α level of .01 is adopted. What is the probability of incorrectly rejecting at least one null hypothesis? 2.4 Testing effects: the F distribution The statistical distribution of the quotient of two variances is known as the F distribution. This distribution is in fact a family of distributions. The form of the F distribution is a function of two elements, the degrees of freedom associated with the mean square in the numerator and the degrees of freedom associated with the mean square in the denominator, dfi and dfz respectively. The distribution can be computed if both mean squares
Testing effects: the F distribution
23
or variance estimates have the same expected value. This is the case in our one-way analysis of variance if Ho is true; both variances have the same expected value and the expected F value is 1. In Figure 2.1, two F distributions are depicted which are associated with two pairs of degrees of freedom: (a) dfi = 4, df2 = 8, written as F4ig, and (b) dfi = 4, df2 = 30, written as ^4,30·
Figure 2.1 Plots of the probability density function of F^g (= solid line) and F4 30 (= dotted line); the corresponding critical values of F for a = 0.05 are 3.84 and 2.69 Figure 2.1 illustrates that the form of the so-called probability density function of F depends on the associated degrees of freedom. In the example, the lower number of df2 in F4>8 results in a right tail that approximates the abscissa (a; axis) more slowly than the right tail of -F4i30. When the a level is set at .05, the critical value of F 4|8 is 3.84, whereas it is much lower for -?4,30: 2.69. There are two entries for tables containing the critical values of the F distribution: dfi and df2. These two determine the value of the associated critical value. A copy of such a table is given in Table Β of the statistical tables, where the critical values for two significance levels are reproduced: the 5% and the 1% level. Most computer programs, however, do not give the critical values, but present the probability of observing an
24
Analysis of variance
F value equal to or greater than the observed F ratio when Ho is true, i.e., when both mean squares have the same expected value. The derivation of the expected value of the A f 5 " b e t w e e n is rather complex, especially if more than one factor is studied or if complications arise, for instance because subjects have been measured more than once (repeated measures). The interested reader is referred to the standard introductions on analysis of variance (Winer 1971, Kirk 1982). What is of interest here is that the between variance is increased by an additional component if Ho is not true, and that this addition is caused by a group effect acj. The decision strategy with respect to Ho and Hi is based on the value of the quotient between the within and between groups variance (^^between/^^within)· If the value exceeds a critical value, Ho is rejected ( = model 1) and Hi is accepted instead ( = model 2). This decision strategy can be illustrated by the outcomes for the two data sets given in Table 2.1. In Table 2.4 the relevant part of the SPSS output of the ANOVA procedure ( = ANalysis of VAriance) for data set 1 is reproduced. Table 2.5 contains the same information for data set 2. Table 2 . 4 (Part of the) SPSS output from an analysis of variance (ANOVA) applied to data set 1 in Table 2.1 ANALYSIS OP VARIANCE SUM OF SOURCE OF VARIATION GROUP RESIDUAL TOTAL
MEAN
SQUARES
DF
SQUARE
24.000 84.000 108.000
2 6 8
12.000 14.000 13.500
SIGNIF F
0.857
OF F
0.471
In Table 2.4 and 2.5 three sources of variation are mentioned. The 'group' part relates to the effect a; the residual part represents the error (the 'within variance'). Adding the sum of squares of both parts together gives the total sum of squares; the same holds for the number of degrees of freedom. The group effect is tested by dividing the mean square of the group effect a by the mean square of the residual part. The resulting F ratio in Table 2.4 is not significant (the ratio is even less than 1); H 0 ( = model 1) has to be accepted. The resulting F ratio in Table 2.5, however, is significant; Hi ( = model 2) is accepted. When an F ratio turns out to be significant, we have reason to reject the H 0 that both mean squares have the same expected value. For instance, if the F value resulting from -W^between/^^within
Testing effects: the F distribution
25
Table 2.5 (Part of the) SPSS output from an analysis of variance (ANOVA) applied to data set 2 in Table 2.1 ANALYSIS OF VARIANCE SUM OF SOURCE OF VARIATION GROUP RESIDUAL TOTAL
MEAN
SQUARES
DF
SQUARE
24.000 6.000 30.000
2 6 8
12.000 1.000 3.750
SIGNIF F
12.000
OF F
0.008
is significant, and if the expected values are i£(Af5b e t w e e n ) = σ \ + η σ \ and EiMS^ithin) = °"e> w e m a y assume that φ 0. This assumption means that there is a difference between the levels or treatments of a factor somewhere. It does not mean that all levels or treatments of a factor have different effects! In the case of more than two levels, it is quite possible that some of them do not differ from each other. When one wants to assess the difference between specific levels, so-called post hoc tests (also called a-posteriori or unplanned tests) have to be used. This procedure will be discussed in Section 2.12. We have seen that ^ ( M 5 b e t w e e n ) = σ* + and ^(M5 w j t hi n ) = σ\. These expressions enable us to determine the estimator of σ*, that is the variance of the effects. When we subtract the latter value from the former, we get the following result for Table 2.5: 1 2 - 1 = 11. This figure estimates ησ\. As η = 3, σ* is estimated by 11/3 = 3.66. The application of analysis of variance requires two crucial assumptions which have to be met by the data: 1. In each treatment population, the distribution of the error is normal (assumption of normal distribution); it is often stated that departures from normality do not seriously affect inferences about the means of the subpopulations. 2. The variance of the error within each group or subpopulation is equal (assumption of homogeneity of variance). Many textbooks advise the reader not to worry about violations of these assumptions. Recent research, however (cf. Wilcox 1987), has shown that violations should be taken quite seriously. In Chapter 4 we will discuss in more detail the problems which arise when these assumptions are violated.
26
Analysis of variance
2 . 5 Multi-factorial designs and interaction T h u s far only a single-factor design (also called a one-way design) has been discussed. T h e power of analysis of variance, however, lies in t h e ability to investigate t h e effects of more t h a n one independent variable on t h e dependent variable simultaneously. Suppose t h a t t h e d a t a given in Table 2.6 were sampled in a single-factor experiment with three t r e a t m e n t s . Table 2.6 Single-factor design with three treatments treatment 1 II III 1 3 9 2 4 10 4 8 14 5 9 15 3 6 12 There seems to be a 'break' in the scores within the groups in Table 2.6. T h e first two subjects achieved lower scores t h a n t h e last two. By inspecting their personal profiles, it might t u r n out t h a t t h e first two members of each group were high school graduates, whereas t h e last two graduated f r o m university. Consequently, part of the variation within t h e groups might be due to an 'education' factor we did not take into account in t h e original model. T h e scores, originally displayed in a single- or one-factor design, can be rearranged in a two-factor design as shown in Table 2.7. Table 2.7 Two-factor design for the data of Table 2.6
education high school university
treatment I II i n 1 3 9 2 4 10 4 8 14 5 9 15
T h e rearranged d a t a in Table 2.7 clearly suggest t h a t the original onefactor design has to be replaced by a two-factor design. T h e scores seem sensitive to the effects of two factors: treatment and educational level. The model tested should be expanded by including an effect ß j for education: Xijk
= μ + oti + ßj +
tijk
(18)
Multifactorial designs and interaction
27
Apart from the question how such a model with two effects can be tested, an additional complication may arise when two factors are involved. An important concept, not only in the context of analysis of variance, but also in that of other techniques (see the chapters on regression analysis and loglinear models) is that of interaction. To illustrate this concept, the cell means of four data sets are given in Table 2.8, followed in Figure 2.2 by a graphical representation. In Table 2.8, the variable A is the row variable and variable Β the column variable; the row variable receives the subscript i, the column variable the subscript j . Table 2.8 Cell means of four data sets, showing interaction ((c) and (d)) and no interaction ((a) and (b)) data set (a): no interaction B\ B2 -Ö3 8 10 15 Al A2 15 20 13
data set (b): no interaction B\ Β 2 Β 3 8 13 18 Αι A2 11 16 21
data set (c): interaction B\ Β 2 Bs
data set (d): interaction Β ι Β2
Τ-ί
~Αλ
A2
7 9
10 14
13 25
Α2
8 16
Π 13
14 9
All four data sets of Table 2.8 have two factors. Factor A has two levels, factor Β three. The means of the six cells are displayed in Figure 2.2; the ordinate (y axis) relates to the cell values, the abscissa (x axis) to the levels of factor Β. What do the data and their associated graphs suggest? Interaction is said to occur in those cases where the differences between the levels of one factor are not equal for all levels of another factor. In data set (b), for instance, the difference between A\ and A2 is 3 for all levels of Β: 11 - 8 = 3, 16 - 13 = 3, 21 - 18 = 3. In data set (c), on the other hand, the differences between Αχ and A2 are not the same for all levels of factor Β: the difference at level ΒΎ is 9 - 7 = 2, at B2 14 - 10 = 4, and at B3 25 — 13 = 12. In that case the two factors are said to interact. Likewise, no interaction effect occurs in data set (a), whereas data set (d) shows a clear interactional pattern. Interactional patterns cannot be reproduced by an additive score model in which only the two main effects and ß j are included. A supplementary term has to be included in addition to the terms a,- and ßj.
28
Analysis of variance
Figure 2.2 Graphical representation of the mean cell values of the four data sets of Table 2.8; the ordinates represent the dependent variable; panels (c) and (d) display interaction
Multifactorial designs and interaction
29
The following model accounts for the mean cell values in the d a t a sets (a) and (b): x«,·. = *· + «* + & (19) A triple subscript (Xijk) is used. The first subscript indicates the row, the second the column, and the third the measurement within the cell. A dot notation is used to identify means obtained by summing over the corresponding index (in the above equation, the mean over the measurements within the cells). For data set (b) the estimates of α,· and ß j are: αϊ = —1.5, α 2 = 1.5, and βι = —5, ß 2 = 0, = 5 1 . Thus, for instance, score X 1 2 . and X 2Z . can be partitioned as follows: X12.
=
14.5 - 1.5 + 0 = 13
X2Z.
=
14.5 + 1.5 + 5 = 21
It is not possible to construct such an additive model for data sets (c) and (d). A systematic difference emerges between the scores predicted by the model and the scores observed. An interaction term ( a ß ) i j has to be included to obtain a perfect fit: Xij. =μ
(20)
+ cti + ßj + (aß)i j
For data set (c), αχ = —3 and a 2 = 3; β\ = —5, β 2 = —1, and β 3 = + 6 . Given these values and the mean cell scores, the values of the interaction term can be determined. For instance, X12.
=
13-3-1 + 1 =
10
X23m
=
13 + 3 + 6 + 3
25
=
The terms + 1 and + 3 are specific to the cells (12) and (23) respectively. More precisely, they belong to specific combinations of levels i and j of the factors A and B. The interaction effect can be calculated as follows. The effects a^ and ß j denote the deviations of the subpopulation means from the overall population mean: α» = μ^—μ, ßj = μ^—μ. The interaction term {aß)ij represents the deviation of the cell mean μ ^ from the overall mean minus the effects 1
In fact, estimates should be written as &i, ßj or a*, bj; we did not use this notation, because there is no danger of ambiguity between population parameters and their estimates.
30
Analysis of variance
of on and ßj. This gives the following outcomes for the interaction effect: (αβ)
ί ά
=
μα
-
(μ +
α,
-
μα
-
μ - O L i -
=
μα
- μ -
=
μα
-
(μ.·.
μί·
-
μ.ϊ
+
ßj) ßj
-
μ)
+
μ
-
{μ.ϊ
-
μ)
(21)
In a two-factor design, the number of measurements is npq = N. It is assumed that we have an equal number η of measurements in each cell; ρ is the number of rows (or number of levels of factor A)] q is the number of columns (or number of levels of factor Β). If ρ = 2, q = 3, and η = 3, the data can be represented as shown in Table 2.9. Table 2.9 Two-factor design with ρ = 2, q — 3, and η = 3 1 1 2 Column mean
2
X l l l
Xl21
3 -^131
Xll2
Xl22
X%32
Xll3
Xl23
Xl33
X211
X221
X231
X212
X222
X 232
X213
X222
X233
X.\.
X.2.
X.2.
Row mean
The mean square (MS) for the interaction effect can be calculated by dividing the interactional sum of squares ( S S ) by the appropriate number of degrees of freedom. Given the notational system of Table 2.9 and the equation found for the interaction effect, the SS can be calculated by the formula: ρ
^interaction = Σ
ι
i=ij=i
_
-
X
i
~ Χ·ί· +
X
·) 2
(22)
The number of degrees of freedom associated with the interaction sum of squares is (p—l)(qr—1). The MS is the SS divided by the associated number of degrees of freedom. The remaining SS is called the 'error' or 'within cell variation'; the associated number of degrees of freedom is pq{n — 1). 2.6 R a n d o m a n d fixed factors How the effects in a two-factor design should be tested, depends on the status of the factors involved. Factors can be random or fixed, and this dis-
Random and fixed factors
31
tinction plays a crucial role in defining the appropriate mean square against which the mean square of an effect has to be tested. The distinction between random and fixed factors is based upon the status of the categories or treatments in the sample, in relation to the categories or treatments in the population. The categories of the factor may directly represent the relevant categories in the population (the factor is fixed), or they may constitute a random sample of all categories in the population (the factor is random). The question as to whether all relevant levels of a factor are included in the experiment or not, is not academic. The answer has serious consequences for the calculation of the test statistic and even for the possibility of testing particular null hypotheses in more complex designs. When we test the efficiency of three methods of pronunciation training, we are primarily interested in the specific methods investigated, for instance Α ι , A2, and A3. We are not interested in all possible methods, or the question whether in general different methods result in different pronunciation achievements. The methods Αχ, Az, and A3 are chosen deliberately and in a second investigation the same methods are chosen again. These methods represent our population of'levels' completely. In other cases, however, not all possible levels of a factor may be included in the research design. An obvious example is the choice of 'words' which represent a word class category. When we test the reaction times of subjects regarding two specific types of words, we choose, perhaps 'at random', words out of a dictionary or lexical data-base. The selected words do not constitute the whole population of possible words, but only a subset. We would like to generalize our findings with regard to other words of the word class category in question. These two examples illustrate two types of factors, fixed and random: • a factor is called fixed when all possible levels of that factor are included in the experiment ('methods' in our example); • a factor is called random when the number of possible levels of that factor greatly exceeds that of the number of levels included in the experiment; furthermore, the levels included should have been selected at random ('words' in the example given above). The distinction between fixed and random factors might suggest that only the latter enable the researcher to generalize his findings to levels not included in the experiment. For example, when we 'use' students of a linguistics class as subjects in an experiment, our sample is not based on a random selection procedure. Consequently, the subjects do not constitute a random selection of all possible levels of the population. However, the question whether the findings may or may not be generalized to other subjects,
32
Analysis of variance
is not directly and uniquely related to the dichotomy of random and fixed factors. The question that has to be answered is whether the researcher has reason to assume that his subjects constitute a sample which represents the population he is interested in. When the task in the experiment is not of a linguistic nature, he may argue that his subjects are representative of the population of young people with a university background; generalization to that population may then be sufficiently warranted. The presence or absence of fixed and random factors in a design leads to three types of models: a fixed model (containing only fixed factors), a random model {with random factors only), and a mixed model (with both random and fixed factors). The distinction between random and fixed factors has serious consequences for the expected values of the mean squares, and consequently for the pairs of mean squares which have to be used for calculating F ratios. Apart from the conceptual differences between a fixed and a random factor, there are also statistical differences involved. In most cases these differences only show up in the expected values of mean squares, but in some instances, when variance components are at issue (cf. Chapter 5: reliability coefficients), more detailed differences come to the surface. The main difference between a fixed factor A and a random factor A is that the effect a.j of the latter constitutes a random variable, with a variance σ*, and an expected value E(ctj) — 0, whereas the effect ctj of a fixed factor is a constant quantity, subject to the constraint that Σ a j = These differences are the consequence of the assumption that the levels of a random factor are just a sample of a large population of levels. The sum of the corresponding effects in the experiment is not necessarily zero, only the expected value of an effect a j is zero. As all possible levels of a fixed factor occur in the sample, the sum of the effects a j cannot be anything other than zero, as an effect is defined as a deviation score: aj = μ — These differences become manifest when we write out the expected value of the mean square associated with the treatment effect A of a simple one-way design, with the following score model: (23)
Xij = μ + aj + 6ij The expected values of the fixed and random models respectively are: FIXED
(24)
E(MSA)
RANDOM E(MSA)
=
Σ\ +
.2 ησ,α
(25)
Testing effects in a two-factor design
33
The difference lies in the second term; for the random factor it is a genuine variance, whereas for the fixed factor that term contains a summation of squared effects. In many textbooks, and in statistical software, these terms are symbolized by σ 2 , both for the random and the fixed factors. We adhere to this practice, but we shall return to this issue in Chapter 5, where the distinction between the two expressions becomes relevant. 2.7 Testing effects in a two-factor design We have shown how two factors and their interaction may affect the dependent variable, but we have not discussed how the presence or absence of factor effects can be tested statistically. To that end the expected values of the mean squares involved have to be derived. The expected value of a mean square - E(MS) - is the mean of the sampling distribution of mean squares, obtained when repeated random samples are taken from the given (sub)population. In other words, it is the 'average' value of an MS. The derivation is quite a complicated matter and it will not be discussed here. An important element in the derivation is the question whether the factors have to be considered 'random' or 'fixed'. The expected values in a two-factor design are given in Table 2.10. Table 2.10 Expected values of mean squares in a two-factor design Mean Square
df
Expected Value
MSa
ρ-
*< 2 +
MS Β
1 g-1
MS ab Μ SCHOT
( ρ - ΐ ) ( ί - ΐ ) pq(n - 1)
+
+^ σ1
+
η σ
1
β
nqal
+ ηρσ\\
ησ1β
σ1
In Table 2.10, η is the number of observations per cell, Ρ and Q are the numbers of levels of factors A and Β in the population, and ρ and q the respective number of levels actually included in the research design. The fractions ( P — p)/P and (Q — q)/Q are 0 or (near) 1 depending on the status of the levels of the factors. If ρ equals Ρ , which means that factor A is fixed, the fraction is 0. If q is a random sample out of the population Q, the fraction is (approximately) 1. This implies that the choice of the error term in the F ratio depends on the status of the levels of the factors. If factor A is a random factor, the appropriate error term to test the significance of factor Β is the mean square of the interaction term. Since Λ is a random factor, it follows that ρ