322 21 110MB
English Pages 1024 [850] Year 2018
SEVENTH EDITION
Using Multivariate Statistics Barbara G. Tabachnick California State University, Northridge
Linda S. Fidell California State University, Northridge
@ Pearson 330 Hudson SLrccL, NY NY 10013
Portfolio Manager: Tanimaa Mehra Content Producer: Kani Kapoor Portfolio Manager Assistant: Anna Austin Product Marketer: Jessicn Quazza Art/Designer: Integra Software Services Pvt. Ltd. Fu ll-Service Project Manager: Integra Software Services Pvt. Ltd.
Compositor: Integra Software Services Pvt. Ltd. Printer/Binder: LSC Communications, Inc. Cover Printer: Phoenix Co/or/Hagerstown Cover Design: Lumina Dntamntics, Inc. Cover Art: Slnttterstock
Acknowledgments of third party content appear on pages within the text, which constitutes an extension of this copyright page. Copyright © 2019, 2013, 2007 by Pearson Education, Inc. or its affiliates. All Rights Reserved. Printed in the United States of America. This publication is protected by copyright, and permission should be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechan ical, photocopying, recording, or otherwise. For infom1ation rega rding permissions, request forms and the appropriate contacts within the Pearson Education Global R;ghts & Permissions d epartmen t, please visit www.pearsoned.com/permissions/.
PEARSON and ALWAYS LEARNING are exclusive trad emarks owned by Pearson Education, Inc. or its a ffiliates, in the U.S., and/ or other countries.
Unless otherwise indicated herein, any third-party trademarks that may appear in this work are the property of their respective owners and any references to third-party trademarks, logos or other trade dress are for demonstrative or descriptive purposes only. Sucll references are not intended to im ply any sponsorship, endorsement, a uthorization, or promotion of Pearson's products by the owners of such mar ks, or any relationship between the owner and Pearson Education, Inc. or its affiliates, a uthors, licensees or distributors.
Many of the d esigna tions by manufacturers and seller to distinguish their products are clain1ed as trademar ks. Where those designations appea r in this book, and the publisher was aware of a trademark claim, the d esignations have been printed in initial caps or all caps.
Library of Congress Cataloging-in-Publication Data Nam es: Tabachnick, Barbara G., author. I Fidell, Linda S., author. Title: Using multivaria te statistics/Ba rbara G. Tabacllnick, Ca lifornia State University, Northridge, Linda S. Fidell, California Sta te University, Northridge. Description: Seventh edition. I Boston: Pearson, [2019) 1 Chapter 14, by Jodie B. Ullm an . Id entifiers: LCCN 2017040173 1 ISBN 9780134790541 I ISBN 0134790545 Subjects: LCSH: Multivaria te analysis. I Statistics. C lassification: LCC QA278 .T3 2019 I DOC 519.5/35--dc23 LC record available a t https:/ /lccn .loc.gov / 2017040173
1 18
@Pearson
Books a Ia Carte ISBN-10: 0-13-479054-5 ISBN-13: 978-0-13-479054-1
Contents 2.1.3 Prediction of Group Membership 2.1.3.1 One-Way Discriminant Analysis 2.1.3.2 Sequential One-Way Discriminant Analysis 2.1.3.3 Multi way Frequency Analysis
xiv
P reface
1 Introduction 1.1 Multivariate Statistics: Why? 1.1.1 The Domain of Multivariate Statistics: Numbers of IVs and DVs 1.1.2 Experimental and Nonexperimental Resea~
1.1.3 Computers and Multivariate Statistics 1.1.4 Garbage In, Roses Out? 1.2 Som e Useful Definitions
1 1
(Logit)
2
Logistic Regression Sequential Logisbc Regression Factorial Discriminant Analysis Sequential Factorial Discriminant Analysis 2 .1.4 Structu re 2.1.4.1 Principal Components 2.1 .4.2 Factor Analysis 2.1.4.3 Stmctural Equation Modeling 2.1.5 Time Course of Events 2.1.5.1 Survival/Failure Analysis 2.1.5.2 Time-Series Ana lysis 2.2 Som e Furthe r Com pa risons 2.1.3.4 2.1.3.5 2.1.3.6 2.1.3.7
2 3
4 5
1.2.1 Continuous, Discrete, a nd Dichotom o us ~Ia
1.2.2 Samples a nd Populations 1.2.3 Descriptive a nd Infe rential Statis tics 1.2.4 O rlh ogona li ty : Standa rd and Sequential A na lyses
7
1.5 Statistical Power
9 10 10
1.6 Data Appropriate for Multivariate Statistics 1.6.1 The Data Matrix
11 11
1.3 Linear Com b inations o f Variables 1.4 N u m ber a nd Na ture of Variables to Inclu de
1.6.2 The Correlation Matrix
12
1.6.3 The Variance-Covariance Matrix 1.6.4 The Sum-of-Squares and Cross-Products Matrix 1.6.5 Residuals
12
1.7 Organization of the Book
2
5 6 7
A Guide to Statistical Techniques: Using the Book
3
14
20 20
21 21 21 21 22
22 22 22 22 22 23 23
23
2.3 A Decision Tree
24
2 .4 Technique Chapters
27
2.5 Preliminary Check of the Data
28
Review of Univariate and Bivariate Statistics
29
3.1 Hypothesis Testing
13 14
20
29
3.1.1 One-Sample z Test as Prototype
30
3.1.2 Power 3.1.3 Extensions ofthe Model
32 32
3.1.4 Controversy Surrounding Significance
15
Testing 3.2 Analysis of Variance
33 33
15
3.2.1 One-Way Between-Subjects ANOVA
34
2.1.1 Degree of Relationship Among Variables 15 16 2.1.1.1 Bivariate r 16 2.1.1.2 Multiple R 2.1.1.3 Sequential R 16 16 2.1.1.4 Canonical R 17 2.1.1.5 Multiway Frequency Analysis 17 2.1.1.6 Multilevel Modeling 17 2.1.2 Sig n ifica nce o f C ro up Diffe rences 17 2.1.2.t One-Way ANOVA and I Test 17 2.1.2.2 One-Way ANCOVA 18 2.1.2.3 Factorial ANOVA 18 2.1.2.4 Factorial ANCOVA 18 2.1.2.5 lfotelling's T 2 18 2.1.2.6 One-Way MANOVA 19 2.1.2.7 One-Way MANCOVA 19 2.1.2.8 Factorial MANOVA 19 2.1.2.9 Factorial MANCOVA 2.1.2.10 Profile Analysis of Repeated Measures 19
3.2.2 Factorial Between-Subjects ANOVA 3.2.3 Within -Subjects ANOVA
36 38
3.2 .4 Mixed Between-Within-Subjects ANOVA
40
3.2 .5 Design Complexity 3.2.5.1 Nesting 3.2.5.2 Latin-Square Designs 3.2.5.3 Unequal 11 and Nonorthogonality 3.2.5.4 Fixed and Random Effcts 3.2 .6 Specific Com pari sons 3.2.6.1 Weighting Coefficients for Comparisons 3.2.6.2 Orthogonality of Weighting Coefficients 3.2.6.3 Obtained F for Comparisons 3.2.6.4 Critical F for PLmned Comparisons 3.2.6.5 Critical F for Post Hoc Comparisons
41
2.1 Research Questions and Associated Techniques
41 42 42 43 43 43 44 44 45 45
3.3 Parameter Estimation
46
3.4 Effect Size
47 iii
iv Contents 3.5 Bivariate Statistics: Correlation and Regression 48 3.5.1 Correlation 48 3.5.2 Regression 49 3.6 Chi-Square Analysis
50
4 Cleaning Up Your Act: Screening Data Prior to Analysis 4.1 Important Issues in Data Screening 4.1.1 Accuracy o f Data File 4.1.2 Honest Correlations 4.1.2.1 Inflated Correlation 4.1.2.2 Deflated Correlation 4.1.3 Missing Data 4.1.3.1 Deleting Cases or Variables 4.1.3.2 Estimating Missing Data 4.1.3.3 Using a Missing Data Correlation Matrix 4.1.3.4 Treating Missing Data as Data 4.1.3.5 Repeating Analyses with and without Missing Data 4.1.3.6 Choosing Among Methods for Dealing with Missing Datn 4.1.4 O utliers 4.1.4.1 Detecting Univariate and Multivariate Outliers 4.1.4.2 Describing Outliers 4.1.4.3 Reducing the Influence of Outliers 4.1.4.4 Outliers in a Solution 4.1.5 Normality, Linearity, and Homoscedasticity 4.1.5.1 Normality 4.1 .5.2 Linearity 4.1.5.3 Homoscedasticity, Homogeneity of Variance, and Homogeneity of Variance-Covariance Mntriccs
4.1.6 Common Da ta Transformations 4.1.7 Multicollinearity and Singularity 4.1.8 A Checklist and Some Practical Recommendations 4.2 Complete Examples of Data Screening 4.2.1 Screening Ungrouped Data 4.2.1.1 Accuracy of Input, Misi.ing Data, Distribu tions, and Univariate Outliers 4.2.1.2 Linearity and Homoscedasticity 4.2.1.3 Transformation 4.2.1.4 Detecting Mult ivariat~ Outliers 4.2.1.5 Variables Causing Cases to Be Outliers 4.2.1.6 Multicollinearity 4.2.2 Screening Grou ped Data 42.21 Accuracy of lnpu~ Missing 0.1ta, Distributions, Homogeneity of Variance, and Univariate Outliers 4.2.2.2 Linearity 4.22.3 Multivariate Outliers 4.2.2.4 Variables Causing Cases to Be Outliers 4.2.2.5 Multicollinearity
52 53 53
53 53
53
54 57 57
61 61 61 62
62 63 66 66 67
67 68 72
73
75 76
79 79 80 81
84 84 84 86
88
88 89
93 93 94
97
5
Multiple Regression 5.1 General Purpose and Description
99 99
5.2 Kinds of Research Questions 5.2.1 Degree of Relationship 5.2.2 Importance offVs 5.2.3 Adding IVs 5.2.4 Changing lVs 5.2.5 Contingencies Amo ng IVs 5.2.6 Comparing Sets of IVs 5.2.7 Predicting DV Scores for Members of a New Sample 5.2.8 Parameter Estimates
101 101 102 102 102 102 102
5.3 Limitations to Regression Analyses 5.3.1 Theoretical Issues 5.3.2 Practical Issues 5 .3 .2 .1 Ratio of Cases to IVs 5 .3 .2 .2 Absence of Outliers Among the IVs and on the DV 5.3.2.3 Absence of Multicollinearity and Singula rity 5.3.2.4 Normality, Linearity, and Homoscedasticity of Residuals 5.3.2.5 Independence of Errors 5.3.2.6 Absence of Outliers in the Solution 5.4 Fundamental Equations for Multiple Regression 5.4.1 General Linear Equations 5.4.2 Matrix Equations 5.4.3 Computer Analyses of Small-Sample Example
103 103 104 lOS
5.5 Major Types of Multiple Regression 5.5.1 Standard Mu ltiple Regression 5.5.2 Sequential Mu ltiple Regression 5.5.3 Statistica l (Stepwise) Regression 5.5.4 Choosing Among Regression Strategies 5.6 Some Important Issues 5.6.1 Importance of IVs 5.6. 1.1 Standard Multiple Regression 5.6. 1.2 Sequential or Statistical Regression 5.6 .1.3 Commonality Analysis 5 .6 .1.4 Relative Importance Analysis 5.6.2 Statistica l Inference 5.6.2.1 Test for Multiple R 5.6.2.2 Test of Regression Components 5.6.2.3 Test of Added Subset of IVs 5.6.2.4 Confidence Limits 5.6.2.5 Comparing Two Sets of Predictors 5.63 Adjustment of R2 5.6.4 Suppressor Variables 5.6.5 Regression Approach to ANOVA 5.6.6 Centering When Interactions and Powers of IVs Are Included 5.6.7 Mediation in Causal Sequence
103 103
105 106 106 108 109
109 110
111 113 115 115 116 117 121 121 121 122 123 123 125
128 128 129 130 130 131
132 133 134 135 137
Contents
6
5.7 Complete Examples of Regression Analysis 5.7.1 Evaluation of Assumptions 5.7.1.1 Ratio of Cases to IVs 5.7.12 Normality, Linearity, Homoscedasticity, and Independence of Residuals 5.7.1.3 Outliers 5.7.1.4 Multicollinearity and Singularity 5.7.2 Standard Multiple Regression 5.7.3 Sequential Regression 5.7.4 Example of Standard Multiple Regression with Missing Values Multiply Imputed
138 139 139
5.8 Comparison of Programs 5.8.1 ffiM SPSS Package 5.8.2 SAS System 5.8.3 SYSTATSystem
162 163 165 166
Analysis of Covariance 6.1 General Purpose and Description 6.2 Kinds of Research Questions 6.2.1 Main Effects of !Vs 6.2.2 Interactions Among IVs 6.2.3 Specific Comparisons and Trend Analysis 6.2.4 Effects of Covariates 6.2.5 Effect Size 6.2.6 Parameter Estimates 6.3 Limitations to Analysis of Covariance 6.3.1 Theoretical Issues 6.3.2 Practical Issues 6.3.2.1 Unequal Sample Sizes, Missing Data, and Ratio of Cases to IVs 6.3.2.2 Absence of Outliers 6.3.2.3 Absence of Multicollinearity and Singularity 6.3.2.4 Normality of Sampling Distributions 6.3.2.5 Homogeneity of Variance 6.32.6 Linearity 6.32.7 Homogeneity of Regression 6.3.2.8 Reliability of Co,•ariatcs 6.4 Fundamental Equations for Analysis of Covariance 6.4.1 Sums of Squares and Cross-Products 6.4.2 Significance Test and Effect Size 6.4.3 Computer Analyses of Small-Sample Example 6.5 Some Important Issues 6.5.1 Choosing Covariates 6.5.2 Evaluation of Covariales 6.5.3 Test for Homogeneity of Regression 6.5.4 Design Complexity 65.4.1 Wiiliin-Subjects and Mixed Wiiliin-Between Designs 6.5.42 Unequal Sample Sizes
6.5.4.3 Specific Comparisons and Trend Analysis 6.5.4.4 Effect Size 6.5.5 Alternatives to ANCOVA 6.6 Complete Example of Analysis of Covariance 6.6.1 Evaluation of Assumptions 6.6.1.1 Unequal 11 and Missing Oat• 6.6.1.2 Normality 6.6.1.3 Linearity 6.6.1.4 Outliers 6.6.1.5 Multicollinearity and Singularity 6.6.1.6 Homogeneity of Variance 6.6.1.7 Homogeneity of Regression 6.6.1.8 Reliability of Covariates 6.6.2 Analysis of Covariance 6.6.2.1 Main Analysis 6.6.2.2 Evaluation of Covariates 6.6.2.3 Homogeneity of Regression Run 6.7 Comparison of Programs 6.7.1 IBM SPSS Package 6.7.2 SAS System 6.7.3 SYSTAT System
139 142 144 144 150
154
167 167 170 170 170 170 170 171 171 171 171 172 172 172 172 173 173 173 173 174 174 175 177 178 179 179 180 180 181 181 182
7
Multivariate Analysis of Variance and Covariance
v
ISS 187 187
189 189 189 189 191 191 192 192 193 193 193 193 196 196 200 200 200 200
203
7.1 General Purpose and Description
203
7 2 Kinds of Research Questions 7.2.1 Main Effects ofiVs 7.2.2 Interactions Among IVs 7.2.3 Importance of DVs 7.2.4 Parameter Estimates 7.2.5 Specific Comparisons and Trend Analysis 7.2.6 Effect Size 7.2.7 Effects of Covariates 7.2.8 Re peated-Measures Analysis of Variance
206 206 207 207 207
7.3 Limitations to Multivariate Analysis of Variance and Covariance 7.3.1 Theoretical Issues 7.3.2 Practica I Issues 7.3.2.1 Unequal Sample Sizes, Missing Data, and Power 7.3.2.2 Multivariate Normality 7.3.2.3 Absence of Outliers 7.3.2.4 Homogeneity of VarianceCovariance Matrices
7.3.2.5 7.3.2.6 7.3.2.7 7.3.2.8
Linearity Homogeneity of Regression Reliability of Covariates Absence of Multicollinearity and Singularity
7.4 Fundamental Equations for Multivariate Analysis of Variance and Covariance 7.4.1 Multivariate Analysis of Variance
207 208 208 208 208 208 209 209 210 210 210 211 211 211 211 212 212
vi Contents 7.4.2 Computer Analyses of Small-Sample Example 7.4.3 Multivariate Analysis of Covariance 7.5 Some Important Issues 7.5.1 MANOVA Versus ANOVAs 7.5.2 Criteria for Statistical Inference 7.5.3 Assessing DVs 7.5.3.1 Univariate F 7.5.3.2 Roy-Bargmann Stepdown Analysis 7.5.3.3 Using Discriminant Analysis 7.5.3.4 Choosing Among Strategies for Assessing DVs 7.5.4 Specific Comparisons and Trend Analysis 7.5.5 Design Complexity 7.5.5.1 Within-Subjects and BetweenWithin Designs 7.5.5.2 Unequal Sample Sizes 7.6 Complete Examples of Multivariate Ana lysis of Variance and Covariance 7.6.1 Evaluation of Assumptions 7.6.1.1 Unequal Sample Sizes and Missing Data 7.6.1.2 Multivariate Normality 7.6.1.3 Linearity 7.6.1.4 Outliers 7.6.1.5 Homogeneity of VarianceCo,rariance Matrices
218 221 223 223
223 224 224 226 226 227
227 228 228 228
230 230 230
231 231 232 233
7.6.1.6 Homogeneity of Regression 7.6.1.7 Reliability of Covariates 7.6.1.8 Multicollinearity and Singularity
233
7.6.2 Multivariate Ana lysis of Variance 7.6.3 Multiva riate Ana lysis of Covariance 7.6.3.1 Assessing Covariates 7.6.3.2 Assessing DVs
235 244
7.7 Comparison of Programs 7.7.1 IBMSPSSPackage 7.7.2 SAS System 7.7.3 SYSTAT System
235 235
244 245
252 252 255 255
8 Profile Analysis: The Multivariate Approach to Repeated Measures 8.1 Genera l Purpose and Description 8.2 Kinds of Research Questions 8.2.1 Parallelism of Profiles 8.2.2 Overall Difference Among Groups 8.2.3 Flatness of Profiles 8.2.4 Contrasts Following Profile Analysis 8.2.5 Parameter Estimates 8.2.6 Effect Size 8.3 limitations to Profile Analysis 8.3.1 Theoretical Issues 8.3.2 Practical Issues 8.3.2.1 Sample Size, Missing Data, and Power
256 256 257 258 258 258 258 258 259 259 259 259 259
8.3.2.2 Multivariate Normality 8.3.2.3 Absence of Outliers 8.3.2.4 Homogeneity of
Variance-Covariance Matrices 8.3.2.5 Linearity 8.3.2.6 Absence of Multicollinearity
and Singularity 8.4 Fundamenta l Equations for Profile Ana lysis 8.4.1 Differences in Levels 8.4.2 Paral lelism 8.4.3 Flatness 8.4.4 Computer Analyses of Small-Sample Example 8.5 Some Important Issues 8.5.1 Univariate Versus Multivariate Approach to Repeated Measures 8.5.2 Contrasts in Profile Analysis 8.5.2.1 P=llelism and Flatness Significant, Levels Not Significant (Simple-Effects Analysis) 8.5.2.2 ParatleUsm and Levels Significant, Flatness Not Significant (Simple-Effects Analysis) 8.5.2.3 Parallelism, Levels, and Flatness Significant (Interaction Contrasts) 8.5.2.4 Only Parallelism Significant 8.5.3 Doubly Multivariate Designs 8.5.4 Classifying Profiles 8.5.5 Imputation of Missing Values 8.6 Complete Examples of Profile Analysis 8.6.1 Profile Analysis of Subscales of the WISC 8.6.1. 1 Evaluation of Assumptions 8.6.1.2 Profile Analysis 8.6.2 Doubly Multivar iate Ana lysis of Reaction Time 8.6.2.1 Evaluation of Assumptions 8.6.2.2 Doubly Multivariate Analysis of Slope and Intercept 8.7 Comparison of Programs 8.7.1 IBM SPSS Package 8.7.2 SAS System 8.7.3 SYSTATSystem
9 Discriminant Analysis
260 260 260 260 260
260 262 262 265 266 269 269 270
272
274 275 276
277 279 279 280 280 280 282
288 288 290
297 297 298 298
299
9.1 General Purpose and Description
299
9.2 Kinds of Research Questions 9.2.1 Significance of P rediction 9.2.2 Number of Significant Discriminant Functions 9.2.3 Dimensions of Discrimination 9.2.4 Classification Functions 9.2.5 Adequacy of Classification 9.2.6 Effect Size 9.2.7 Importance of Predictor Variables
302 302 302 302 303 303 303 303
9.2.8 Significance of Prediction with Covariates 304 9.2.9 Estimation of Group Means 304
Contents
9.3 Limitations to Discriminant Analysis 9.3.1 Theoretical issues 9.3.2 Practical issues 9.3.2.1 Unequal Sample Sizes, Missing Data, and Power
9.3.2.2 Multivariate Normality 9.3.2.3 Absence of Outliers 9.324 ~lomogeneity of Varianc:e--Covariance Matrices 9.3.2.5 Linearity 9.3.2.6 Absence of Multicollinearity and Stngularity
9.4 Fundamental Equations for Discriminant Analysis 9.4.1 Derivation and Test of Discriminant Functions 9.4.2 C lassification 9.4.3 Computer Analyses of Small-Sample Example
304 304 304 304
305 305
305 306 306
306
309 311
9.5 Types of Discrim inant Analyses 9.5.1 Di rect Discrimin a nt Ana lysis 9.5.2 Sequential Discrimi nan t Analysis 9.5.3 Stepwise (Sta tistical) Discriminant Ana lysis
315 315 315
9.6 Some Important Issues 9.6.1 Statistical lnference 9.6.1.1 Criteria for Overall Statistical
316 316
Significance
317 317
9.6. 1.2 Stepping Methods 9.6.2 Number of Discriminant Functions 9.6.3 Interpreting Discriminant Functions 9.6.3.1 Discriminant Function Plots 9.6.3.2 Structure Matrix of Loadings 9.6.4 Evaluating Predictor Variables 9.6.5 Effect Size 9.6.6 Design Complexity: Factorial Designs
9.6.7 Use of Classification Procedures 9.6.7.1 Cross-Validation and New Cases 9.6.7.2 jackknifed Classification 9.6.7.3 Evaluating Improvement in Classification
9.7 Complete Example of Discriminant Analysis 9.7.1 Evaluation of Assumptions 9.7.1.1 Unequa l Sample Sizes and Missing Data Multiva ria te Normality Linea rity Outliers l lomogeneity of VarianceCovariance Matrices 9.7.1.6 Multicollinearity and Singularity
9.7. 1.2 9.7.1.3 9.7.1.4 9.7.1.5
9.7.2 Direct Discrin1inant Analysis 9.8 Comparison of Programs 9.8.1 IBM SPSS Package 9.8.2 SAS System 9.8.3 SYSTATSystem
10
Logistic Regression 10.2 Kinds of Research Questions 10.2.1 Prediction of Group Membership or Ou tcome 10.2.2 Importance of Predictors 10.2.3 Interactions Among Predictors 10.2.4 Parameter Estimates 10.2.5 Classification of Cases 10.2.6 Significance of Prediction with Covariates 10.2.7 Effect Size 10.3 Limitations to Logistic Regression Analysis 10.3.1 Theoretical Issues 10.3.2 Practical Issues 10.3.2.1 Ratio of Cases to Variables 103.2.2 Adequacy of Expected
316
317 318 318 318
320 321 321 322 322 323 323
324 325 325 325 325 325
105 Types of Logistic Regression 105.1 Direct Logistic Regression 105.2 Sequential Logistic Regression 10.5.3 Statistical (Stepwise) Logistic Regression 105.4 Probit and Other Analyses
10.6 Some Important Issues 10.6.1 Statistical Inference 10.6.1.1 Assessing Goodness of Fit of Models
10.6.2
10.6.3 10.6.4
327
10.6.5
344 344
345
Frequencies nnd Power Linearity in the Logit Absence of Multicollinearity Absence of Outliers in the Solution Independence of Errors
10.4 Fundamenta l Equations for Logistic Regression 10.4.1 Testing and Interpreting Coefficien ts 10.4.2 Goodness of Fit 10.4.3 Comparing Models 10.4.4 Interpretation and Analysis of Residuals 10.4.5 Computer Analyses of Small-Sample Example
326 327
340
346
10.1 General Purpose and Description
10.3.2.3 10.3.2.4 10.3.2.5 10.3.2.6
10.6.1.2 Tests of Ind ividual Predictors Effect Sizes 10.6.2.1 Effect Size fo r n Model 10.6.2.2 Effect Sizes for l'n.>dictors lnterprelalion of Coefficien ts Using Odds Coding Outcome and Predictor Ca tegories Number and Type of Outcome Categories Classification of Cases
10.6.6 10.6.7 Hierarchical and Nonhierarchical Analysis
v ii
346 348 348 348 349 349 349
349 349 350 350 350 350 351 351 351 351 352
352 353 354 355 355 356 360 360 360 362 362 363 363 363 365
365 365
366
367
368 369
3n
viii Contents 10.6.8 Importance of Predictors
373
10.6.9 Logistic Regression for Ma tched Groups
11.4.2 Standard Error of Cumulative Proportion Surviving
408
374
11.4.3 Hazard and Density Functions
408
10.7 Complete Examples of Logistic Regression
374
11.4.4 Plot of Life Tables 11.4.5 Test for Group Differences
410
10.7.1 Evaluation of Limitations t0.7. t.t Ratio of Cases to Variables and Missing Data t0.7. 1.2 Multicollinearity t0.7.1.3 Outliers in the Solution 10.7.2 Direct Logistic Regression with Two-Category Outcome and Continuous Predictors 10.7.2.1 Umitation: Unearity in the L.ogit 10.7.2.2 Direct Logistic Regression with Two-Category Outcome 10.7.3 Sequential Logistic Regression with Three Categories of Outcome 10.7.3.1 Limitations of Multinomial Logistic Regression 10.7.3.2 Sequential Multinomial Logistic Regression
11
374
377
384
387
10.8.1 IBM SPSS Package 10.8.2 SAS Syste m
396 399
10.8.3 SYSTAT System
400
11.2 Kinds of Research Questions 11.2.1 Proportions Surviving at Various Times
401
403
11.2.2 Group Differences in Survival
403 403
11.3.2 Practical Issues 11.3.2.1 Snmple Size nnd Missing Data 11.3.2.2 Nonnality of Sampling Dis tributions, Linearity, and l lomoscedasticity 11.3.2.3 Absence of Outliers 11.3.2.4 Differences Between
403 403 404 404 404
404 404 404 404
405 405
Withdrt1wn and Remaining
Cases 11.3.2.5 Change in Survival Conditions over 1ime
405
11.3.2.6 Proportionality of Hazards 11.3.2.7 Absence of Multicollinearity
405
11.4 Fundamental Equations for Survival Analysis 11.4 .1 Life Tables
405
12
417 417 417 419 423 423 423
11.6.4 Statistical C riteria 11.6.4.1 Test Statistics for Group Differences in Survival Functions 11.6.4.2 Test Statistics for Prediction
426 426 427 427 427 427 428
11.7 Complete Example of Survival Analysis
429
11.7.1 Evaluation of Assumptions 11.7.1.1 Accuracy of Input, Adequacy of Sample Size, Missing Data, and Di>tributions 11.7.1.2 Outliers 11.7.1.3 Differences Between Withdrawn and Remaining Cases 11.7.1.4 Change in Survival Experience over Time 11.7.1.5 Proportionality of llazards 11.7.1.6 MulticollineMity 11.7.2 Cox Regression S urviva l Analysis 11.7.2.1 Effect of Drug Treatment 11.7.2.2 Evaluation of Other Cova riates
430
11.8 Comparison of Programs
405
415
424 425 425 425
11.6.5 Predicting Survival Rate 11.65.1 Regression Coefficients (Parameter Estim.1tes) t 1.65.2 Ha.tard Ratios 11.65.3 Expected Survival Rates
403
411 415
11.6.2 Censored Data 11.6.2.1 Right-Censored Data 11.6.2.2 Other Forms of Censoring 11.6.3 Effect Size and Power
from Covariates
401
11.2.3 Survival Time with Covariates 11.2.3.1 Treatment Effects 11.2.3.2 Importance of Covariates 11.2.3.3 Parameter Estimates 11.2.3.4 Contingencies Among Covariates 11.2.3.5 Effect Size and Power 11.3 Limitations to Survival Analysis 11.3.1 Theoretical Issues
11.6 Some Important Issues 11.6.1 Proportionality of Hazards
384
396
11.1 General Purpose and Description
11.5 Types of Survival Analyses 11.5.1 Actuarial and Product-Limit Ufe Tables and Survivor Functions 11.5.2 Prediction of Group Survival Times from Covariates 11.5.2.1 Direct, Sequential, and Statistical Analysis 11.5.2.2 Cox Proportional-Hazards Model 11.5.2.3 Accelerated Failure-Time Models 11.5.2.4 ChOObing a Method
377 377
10.8 Comparison of Programs
Survival/Failure Analysis
11.4.6 Computer Analyses of Small-Sample Example
374 376 376
409
430 430
433 433 433 434 436 436 436 440
11.8.1 SAS System
444
11.8.2 IBM SPSS Package 11.8.3 SYSTATSystem
445 445
Canonical Correlation
446
405
12.1 General Purpose and Description
446
406
12.2 Kinds of Research Questions
448
Contents
12.2.1 Number of Canonical Variate Pairs 12.2.2 Interpretation of Canonical Variates 12.2.3 Importance of Canonical Variates and Predictors 12.2.4 Canonical Variate Scores 12.3 Limitations 12.3.1 Theoretical Limitations 12.3.2 Practical Issues 12.3.2.1 Ratio of Cases to IVs 12.3.2.2 Normality, Linearity, and Homoscedasticity 123.23 Missing Data 123.24 Absence of Outliers 123.2.5 Absence of Multicollinearity and Singularity 12.4 Fundamental Equations for Canonical Correlation 12.4.1 Eigenvalues and Eigenvectors 12.4.2 Matrix Equations 12.4.3 Proportions of Variance Extracted 12.4.4 Computer A nalyses of Sm all-Sam ple Example
448 449 449 449 450 450 450
451 451 451
451 452 454 457
458
12.5 Some ln1portant Issues 12.5.1 Importance of Canonical Variates 12.5.2 Interpretation of Canonical Variates
462
12.6 Complete Example of Canonical Correlation 12.6.1 Evaluation of Assumptions 12.6. 1.1 Missing Data 12.6.1.2 Normality. Linearity, and Homoscedasticity 12.6.1.3 Outliers 12.6.1.4 Multicoll inearity and Singularity 12.6.2 Can o nical Correia tion
463 463
12.7 Comparison of Programs 12.7.1 SAS System 12.7.2 ffiM SPSS Package 12.7.3 SYSTATSystem
13
448 448
Princip al Components and Factor Analysis
462
463
463
463 466
133.2.5 Absence of Multicollinearity and Singularity 13.3.2.6 Factorability of R 133 .2.7 Absence of Outliers Among Variables
13.4 Fundamental Equations for Factor Analysis 13.4.1 Extraction 13.4.2 Orthogonal Rotation 13.4.3 Communalities, Variance, and Covariance 13.4.4 Factor Scores 13.4.5 Oblique Rotation 13.4.6 Computer Analyses of Small-Sample Example 13.5 Major Types of Factor Analyses 135.1 Factor Extraction Techniques 13.5.1.1 PCA Versus FA 13.5.1.2 Principal Components t3.5.1.3 Principal Factors t3.5.1.4 Image Factor Extraction t3.5.1.5 Maximum Likelihood Factor Extraction
13.5.1.6 Unweighted Least Squares Factoring 13.5.1.7 Generalized (Weighted) Least Squares Factoring 13.5.1.8 Alpha Factoring 135.2 Rotation 13.5.2.1 Orthogonal Rotation 13.5.2.2 Oblique Rotation 13.5.2.3 Geometric Interpretation
13.5.3 Some Practical Recommendations 467
467 473 473 474 475
476
13.1 General Purpose and Description
476
13.2 Kinds of Research Questions 13.2.1 Number of Factors 13.2.2 Nature o f Factors 13.2.3 Importance of Solutions and Factors 13.2.4 Testing Theory in FA 13.2.5 Estimating Scores on Factors
479 479 479 480
480 480 480 480
13.3 Limitations 13.3.1 Theoretical Issues 481 13.3.2 Practical Issues 481 133.21 Sample Size and Missing Data 482 13.3.2.2 Normality 482 13.3.2.3 Linearity 13.3.2.4 Absence of Outliers Among Cases 482
13.6 Some Important Issues 13.6.1 Estimates of Communalities 13.6.2 Adequacy of Extraction and Number of Factors 13.6.3 Adequacy of Rotation and Simple Structure 13.6.4 Importance and Internal Consistency of Factors 13.6.5 Interpretation of Factors 13.6.6 Factor Scores 13.6.7 Comparisons Among Solutions and Groups 13.7 Complete Example of FA 13.7.1 Eva luation o f Limitations 13.7.1.1 Sample Size and Missing Data 13.7.1.2 Normality 13.7.1.3 Linearity 13.7.1.4 Outliers 13.7.1.5 Multicollinearity and Singularity 13.7.1.6 Factorability of R 13.7.1.7 Outliers Among Variables 13.7.2 Principal Factors Extraction with Varimax Rotation
ix
482 482 483
483 485 487 488 489 491 493 496 496 496 498 498 498 499 499 499 499
500 500
501 502
503 504 504
504 507
508 509 510 511 511 511 512 512 512 513 514 514 515
515
x Contents
13.8 Comparison of Programs 13.8.1 IBM SPSS Package 13.8.2 SAS System 13.8.3 SYSTAT System
14
Structural Equation Modeling by Jodie B. Ullman 14.1 Genera l Purpose and Description 14.2 Kinds of Research Questions 14.2.1 Adequacy of the Model 14.2.2 Testing Theory 14.2.3 Amount of Variance in the Variables Accounted for by the Factors 14.2.4 Reliability of the Indicators 14.2.5 Parameter Estimates 14.2.6 Intervening Variables 14.2.7 Croup Differences 14.2.8 Longitudinal Differences 14.2.9 Multilevel Modeling 14.2.10 Latent Class Ana lysis 14.3 Limitations to Structural Equation Modeling 14.3.1 Theoretical Issues 14.3.2 Practical Issues 14.3.21 SampleSizeand Missing Data
525 527 527 527
14.3.2.3 Linearity 14.3.2.4 Absence of Multicollinearity and Singularity 143.2.5 Residuals
14.4 Fundamen tal Equations for Slructural Equations Modeling 14.4.1 Covariance Algebra 14.4.2 Model Hypotheses 14.4.3 Model Specification 14.4.4 Model Estimation 14.45 Model Evaluation 14.4.6 Computer Analysis of Small-Sample Example 14.5 Some Important Issues 14.5.1 Model Identification 14.5.2 Estimation Techniques
563 563
14.5.3.6 Choosing Among Fit Indices
528
t4.5.4. L Chi-Square Difference Test t4.5.4.2 Lilgrange Multiplier (LM) Test t4.5.4.3 Wald Test 14.5.4.4 Some Caveats and Hints on
528 531 531 532
Model Modification
Reliability and Proportion of Variance Discrete and Ordinal Data Multiple Croup Models Mean and Covariance Structure Models 14.6 Complete Examples of Structural Equation Modeling Analysis 14.6.1 Confirmatory Factor Analysis of the WISC 14.5.5 14.5.6 1-1.5.7 1-1.5.8
532 532 532 532 532 533 533 533
t4.6.1.1 Model Specifica tion for CFA 14.6.1.2 Evaluation of Assumptions forCFA 14.6 .1.3 CFA Model Estimation and Preliminary Evaluation 14.6. 1.4 Model Modification
533 533 534
1-1.6.2 SEM of Health Data
534
14.6.2. 1 SE..\dictor 15.4.2.2 Level-2 Equations for a Model with a Level-l Pl\.>dictor 15.4.2.3 Computer Analysis of a Model with a Level-l Predictor
15.4.3 Model with Predictors a t First and Second Levels 15.4.3.1 Level-l Equation for Model with Predictors at Both Levels 15.4.3.2 Level-2 Equations for Model with Predictors at Both Levels 15.4.3.3 Computer Analyses of Model with Predictors at First and Second Le,·els
15.5 Types of ML\11 15.5.1 Repeated Measures 15.5.2 Higher-Order ML\11 15.5.3 Latent Variables 15.5.4 Nonnormal Outcome Variables 15.5.5 Multiple Response Models 15.6 Some Important Issues 15.6.1 lntraclass Correlation 15.6.2 Centering Predictors and Changes in Their In terpretations 15.6.3 Interactions 15.6.4 Random and Fixed Intercepts and Slopes 15.6.5 Statistical Inference 15.6.5.1 Assessing Models 15.6.5.2 Tests of Individual Effects
15.6.6 Effect Size 15.6.7 Estimation Techniques and Convergence Problems 15.6.8 Exploratory Model Building 15.7 Complete Example of MLM 15.7.1 Evaluation of Assumptions
618 618 618
15.7.1.1 Sample Sizes, Missing Data, and Distributions 15.7.1.2 Outliers 15.7.1.3 Multicollinearity and Singularity 15.7.1.4 Independence of Errors: lntracLlss Correlations
619 619
15.7.2 Multilevel Modeling 15.8 Comparison of Programs 15.8.1 SAS System 15.8.2 IBM SPSS Package 15.8.3 HLM Program 15.8.4 MlwiN Program 15.8.5 SYSTATSystem
620 620
623 623 623 624
627
627
628
630
633
633
633
634
638 638 642 642
643 644 644 644
646 648 648 651 651 652
653 653 654 655 656
16
Multiway Frequency Analysis
xi
656 659 659 659
661 668 668 670 671 671 671
672
16.1 General Purpose and Description 16.2 Kinds of Resea rch Questions 16.2.1 Associations Among Variables 16.2.2 Effect on a Dependent Variable 16.2.3 Parameter Estimates 16.2.4 Importance of Effects 16.2.5 Effect Size 16.2.6 Specific Comparisons and Trend Analysis
672 673 673 674 674 674 674
16.3 Limitations to Multiway Frequency Analysis 16.3.1 Theoretical Issues 16.3.2 Practical Issues
675 675 675
16.3.2.1 Independence 16.3.2.2 Ratio of Cases to Variables 16.3.2.3 Adequacy of Expected Frequencies 16.3.2.4 Absence of Outliers in the Solution
16.4 Fundamental Equations for Multiway Frequency Analysis 16.4.1 Screening for Effects 16.4.1.1 16.4.1.2 16.4.1.3 16.4.1.4
Total Effect First-Order Effects Second-Order Effects Third-Order Effect
16.4.2 Modeling 16.4.3 Eva luation and Interpretation 16.4.3.1 Residuals 16.4.3.2 J>aramctcr Estimates
16.4.4 Compu ter Ana lyses of Small-Sa mple Example 16.5 Some Important Issues 16.5.1 Hierarchical and Nonhierarchical Models 16.5.2 Statistical Criteria 16.5.2.1 Tests of Models 16.5.2.2 Tests of Individual Effects
16.5.3 Strategies for Choosing a Model 16.5.3.1 IBM SPSS Hll.OGLINEAR (Hierarchial)
674
675 675 675 676
676 678 678 679 679 683
683 685 685 686
690 695 695 696 696 696
696 697
xii Contents 16.5.3.2 IB~ SPSS GENLOG
(General Log-Linear) 16.5.3.3 SASCATMODand IB~t SPSS LOCUNEAR (General Log-Linear) 16.6 Complete Example of Multiway Frequency Analysis 16.6.1 Evaluation of Assumptions: Adequacy of Expected Frequencies 16.6.2 Hierarchical Log-Linear Ana lysis 16.6.2.1 Preliminary Model Screening 16.6.2.2 Stepwise Model Selection 16.6.2.3 Adequacy of Fit 16.6.24 Interpretation of the Selected Model 16.7 Comparison of Programs 16.7.1 IBM SPSS Package 16.7.2 SASSystem 16.7.3 SYSTAT System
17
Time-Series Analysis 17.1 Genera l Purpose and Description 17.2 Kinds of Research Questions 17.2.1 Pattern of Autocorrelation 17.2.2 Seasonal Cycles and Trends 17.2.3 Forecasting 17.2.4 Effect of an Jntenrention
17.25 Comparing Tune Series 17.2.6 Tune Series with Covaria tes 17.2.7 Effect Size and Power 17.3 Assumptions of Time-Series Ana lysis 17.3.1 Theoretical Issues 17.3.2 Practical Issues 17.3.2.1 Normality of DistributiOltS of Residua Is I7.3.2.2 Homogeneity of Variance and Zero Mean of Residuals 17.3.23 Independence of Residuals 17.3.24 Absence of Outliers 17.3.2.5 Sample Size and Missing Data 17.4 Fundamental Equations for Tune-Series ARIMA Models 17.4.1 Identification of A RIMA (p, d, q) Models 17.4.1.1 Trend Components, d: Making the Process Stationary 17.4.1.2 Auto-Regressive Components 17.4.1.3 Moving Average Components 17.4.1.4 Mixed Models 17.4.1.5 ACFs and PACFs 17.4.2 Estimating Model Parameters 17.4.3 Diagnosing a Model 17.4.4 Computer Analysis of Small-Sample Tune-Series Example 175 Types of Tune-Series Analyses 17.5.1 Models with Seasonal Components 1 7.5.2 Models with Interventions
697
741
17.5.2.2 Abrupt, Temporary Effects 17.5.2.3 Gradual, Permanent Effects
742
698 698 700 700 702 702 705
no no 712 713
714 714 716 717 717 717 718 718 718 718
18
17.5.3 Adding Continuous Variables
747
17.6 Some Important Issues 17.6.1 PatternsofACFsandPACFs 17.6.2 Effect Size 17.6.3 Forecasting 17.6.4 Statistical Methods for Comparing Two Models 17.7 Complete Examples of Tune-Series Analysis 17.7.1 Time-Series Analysis of Introduction of Seat Belt Law 17.7. 1. 1 E'•aluation of Assumptions 17.7. 1.2 Baseline Model Identification and Estimation 17.7.1.3 Baseline Model Diagnosis 17.7.1.4 Intervention Analysis 17.7.2. Time-Series Analysis of Introduction of a Dashboard to an Educational Computer Game 17.7.2.1 Evaluation of Assumptions 17.7.2.2 Baseline Model Identification and Diagnosis 17.7 .2.3 Intervention Analysis
748 748 751 752
17.8 Comparison of Programs 17.8.1 IBM SPSS Package 17.8.2 SASSystem 17.8.3 SYSTAT System
n1 n1
An Overview of the General Linear Model
719 719 719 719 719
no no 721 722
745
17.5.2.4 Models with Multiple Interventions 746
697
718 718 718
17.5.21 Abrupt, Permanent Effects
752 753 753 754
755 758 758
762 763 765 766
n4 n4
775
18.1 Linearity and the General linear Model 18.2 Bivariate to Multivariate Statistics and Overview of Techniques 18.2.1 Bivariate Form 18.2.2 Simple Multivariate Form 18.2.3 Full Multivariate Form
n5
18.3 Alternative Research Strategies
782
n5 n5
m 778
Appendix A A Skimpy Introduction to Matrix Algebra
783
724 724 724
n9 n9 734
737 737 738
A.1 The Trace of a Matrix A.2 Addition or Subtraction of a Constant to a Matrix
784 784
A.3 Multiplication or Division of a Matrix by a Constant
784
A.4 Addition and Subtraction of Two Matrices
785
A.5 Multiplication, Transposes, and Square Roots of Matrice
785
Contents A.6 Matrix "Division" (Inverses and Determinants)
786
A.7 Eigenvalues and Eigenvectors: Procedures for Consolidating Variance from a Matrix
788
B.7 B.8
B.1 B.2 B.3 B.4 B.S B.6
Women's Health and Drug Study Sexual Attraction Study Learning Disabilities Data Bank Reaction Ttme to Identify Figures Field Studies of Noise-Induced Sleep Disturbance Clinical Trial for Primary Biliary Cirrhosis
795 796
Appendix C Statistical Tables C.l Normal Curve Areas C.2 Critical Values of the t Distribution for a = .05 and .01, Two-Tailed Test C.3 Cri tical Values of the F Distribution C.4 Critical Values of Chi Square (r) c.s Critical Values for Squares Multiple Correlation (R~ in Forward Stepwise Selection: a = .05 C.6 Critical Values for F~1AX (S2~1AX/S2~iiN) Distribution for a = .05 and .01
Appendix B Research Designs for Complete Examples
Impact of Seat Belt Law The Selene Online Educational Game
xiii
791 791 793 794 794
795 References 795
Index
797 798 799 800 804
805 807
808 815
Preface ome good things seem to go on forever: friendship and updating this book. It is d ifficult to be lieve that the firs t ed ition manuscript was typewritten, with real cu tting and pasting. The pub lisher required a paper manuscrip t w ith numbered pages-that was almost our downfa ll. We cou ld write a book on multivariate statistics, bu t we couldn' t get the same numbe r of pages (abou t 1200, doub le-spaced) twice in a row. SPSS was in release 9.0, and the o ther p rogram we d emonstrated was BMDP. There were a mere 11 chapters, of which 6 of them were describing techniques. Multilevel and structural equation modeling were not yet ready for prime time. Logistic regression and survival analysis were not yet popular. Ma terial new to this edition includes a redo of all SAS examples, with a p retty new output forma t and replacement of interactive analyses that are no longer available. We've also re-run the IBM SPSS examples to show the new ou tput format. We've tried to update the references in all chapters, including only classic citations if they d ate prior to 2000. New work on rela tive importance has been incorpora ted in multiple regression, canonical correlation, and logistic regress ion analysis-complete with d emonstrations. Multiple imputation procedu res for dealing with missing data have been updated, and we've added a new time-series example, ta king ad vantage of an IBM SPSS expert modeler that replaces p revious tea-leaf read ing aspects of the analysis. Our goals in writing the book remain the same as in all previous ed itions-to p resent complex s tatistical procedures in a way tha t is maximally useful and accessible to researchers who are not necessarily statisticians. We strive to be short on theory but long on conceptual unders tanding. The statistical packages have become increasingly easy to use, making it all the more critical to make sure that they a re applied w ith a good understanding of what they can and cannot do. But above all else-what does it all mean? We have not changed the basic format underlying all of the technique chapters, now 14 of them. We start with an overview of the technique, followed by the types of research questions the techniques are designed to answer. We then p rovide the cautionary tale-what you need to worry about and how to deal with those worries. Then come the fundamenta l equa tions underlying the technique, which some readers truly enjoy working through (we know because they helpfully point out any errors and/ or inconsistencies they find); but other read ers discover they can skim (or skip) the section without any loss to their ability to conduct meaningful ana lysis of their research. The fundamental equations are in the context of a small, made-up, usually silly data set for which compu ter analyses are p rovided- usually IBM SPSS and SAS. Next, we delve into issues surrounding the technique (such as different types of the analysis, follow-up procedures to the main analysis, and effect size, if it is not amply covered elsewhere). Finally, we provide one or two full-bore analyses of an actual rea l-life data set together with a Results section appropria te for a journal. Data sets for these examples are available at www.pearsontughered.com in IBM SPSS, SAS, and ASCTI formats. We end each technique chapter with a comparison of features available in IBM SPSS, SAS, SYSTAT and sometimes other specialized p rograms. SYSTAT is a statis tical package that we reluctantly had to d rop a few editions ago for lack of space. We apologize in advance for the heft of the book; it is not our intention to line the coffers of cruropractors, p h ysical therapists, acupuncturists, and the like, but there's really just so much to say. As to our friendship, it's still going strong despite living in d ifferent cities. Art has taken the place of creating belly dance costumes for both of us, but we remain silly in outlook, although serious in our analysis of research. The lineup of people to thank grows with each ed ition, far too extensive to lis t: students, reviewers, ed itors, and readers who send us corrections and point ou t areas of confusion. As always, we ta ke full responsibility for remaining errors and lack of clarity.
S
Barbnrn G. Tabachnick Linda S. Fidel/
xiv
Chapter 1
Introduction Learning Objectives 1.1 Explain the importance of multivariate techniques in analyzing research
data 1.2 Describe the basic statistical concepts used in multivariate analysis
1.3 Explain how multivariate analysis is used to determine relationships between variables 1.4 Summarize the factors to be considered for the selection of variables in
multivariate analysis 1.5 Summarize the importance of statistical power in research study design 1.6 Describe the types of data sets used in multivariate statistics
1.7 Outline the organization of the text
1.1 Multivariate Statistics: Why? Mu ltivaria te s tatistics are increasing ly popular techniques used for analyzing complicated data sets. They p rovide ana lysis when there are many independen t va riables (IVs) and/or many dependent va riables (DVs), all correla ted with one another to varying degrees. Because of the difficulty in addressing complica ted research questions w ith univaria te analyses and because of the availability of highly developed software for performing multiva riate ana lyses, multivaria te statis tics have become widely used. Indeed, a standard univariate s tatistics course only begins to prepare a student to read resea rch literature or a researcher to p roduce it. But how much harder are the multivaria te techniques? Compared with the multivaria te methods, univaria te statis tical methods are so s traightforward and nea tly structured tha t it is hard to believe they once took so much effort to master. Yet many researchers apply and correctly in terpret results of in tricate analysis of variance before the grand structure is apparent to them. The same can be true of multiva riate s tatistical methods. Although we are delighted if you gain ins ights in to the full multivariate general linear model, 1 we have accomplished our goal if you feel comfortable selecting and setting up multivariate ana lyses and interpreting the compu ter ou tpu t. Mu ltivaria te methods are more complex than univariate by at least an order of magnitude. However, for the most part, the greater complexity requires few conceptual leaps. Familiar concepts s uch as sampling distributions and homogeneity of variance simply become more elabora te. Mu ltivaria te models have not gained popula rity by accident-or even by sinister design. Their growing popularity pa rallels the greater complexity of contemporary research. 1
Chapter 17 attempts to foster such insights.
2 Chapter 1 In psychology, for example, we are less and less enamored of the simple, clean, laboratory s tudy, in which pliant, first-yea r college s tudents each provide us with a single behavioral measure on cue.
1.1.1 The Domain of Multivariate Statistics: Numbers of IVs and DVs Multivariate statis tical methods are an extension of urtivaria te and bivaria te statistics. Multivariate statistics are the complete or general case, whereas univariate and bivaria te statistics are special cases of the mu ltivaria te model. If your design has many variab les, mu ltivaria te techniques often let you perform a sing le analysis ins tead of a series of univariate or bivariate analyses. Variables are roughly dichotomized into two major types- independent and dependent. Independent variables (IVs) are the differing conditions (treatment vs. placebo) to which you expose your research pa rticipants o r the characteristics (tall or short) that the pa rticipants themselves bring in to the research situation. IVs are us ually considered predictor variables because they predict the DVs-the response or ou tcome va riables. Note that IV and DV are defined within a research context; a DV in one research setting may be an IV in another. Additiona l terms for IVs and DVs are p redictor-criterion, stimulus-response, taskperformance, or simply inpu t-ou tput. We use IV and DV throughou t this book to identify variab les tha t belong on one s ide of an equation or the other, withou t causal implication. That is, the terms are used for convenience rather than to ind icate that one of the variables caused or determined the size of the other. The term univariate statistics refers to analyses in which there is a single DV. There may be, however, more than one IV. For example, the amount of social behavior of gradua te students (the DV) is stud ied as a function of course load (one IV) and type of training in social skills to which students are exposed (another IV). Analysis of variance is a commonly used urtivariate statistic. Bivariate statistics frequently refers to analysis of two variab les, where neither is an experimental IV and the desire is simply to study the relationship between the variables (e.g., the relationship between income and amoun t of education). Bivaria te statistics, of course, can be applied in an experimenta l setting, bu t usually they are not. Prototypical examples of bivariate statistics are the Pearson product- moment correla tion coefficient and chi-square analysis. (Chapter 3 reviews univariate and bivariate s tatistics.) With multiva riate s tatistics, you simu ltaneously analyze multip le dependent and multiple independent variables. This capability is important in both nonexperimenta l (correlational or survey) and experimental research.
1.1.2 Experimental and Nonexperimental Research A critical distinction between experimenta l and nonexperimental research is whether the researcher manip ulates the levels of the IVs. In an experiment, the researcher has control over the levels (or conditions) of at least one IV to which a participant is exposed by determining what the levels a re, how they are implemented, and how and when cases are assigned and exposed to them. Further, the experimenter randomly assigns cases to levels of the IV and controls all other influential factors by holding them constant, counterbalancing, or randomizing their influence. Scores on the DV are expected to be the same, w ithin random varia tion, except for the influence of the IV (Shad ish, Cook, and Campbell, 2002). If there are systematic differences in the DV associa ted with levels of the IV, these differences are attributed to the IV. For example, if groups of undergraduates are randomly assigned to the same material but different types of teaching techniques, and afterward some groups of undergraduates perform better than others, the d ifference in performance is sa id, with some degree of confidence, to be caused by the difference in teaching technique. In this type of research, the terms independent and dependent have obvious meaning: the value of the DV depends on the manipulated level of the IV. The IV is manipulated by the experimenter and the score on the DV depends on the level of the IV.
Introduction
In nonexperimental (correla tional or survey) research, the levels of the IV(s) are not manipulated by the researcher. The researcher can define the IV, bu t has no con trol over the assignment of cases to levels of it. For example, groups of people may be categorized in to geographic area of residence (Northeast, Midwest, etc.), but only the definition of the variable is under researcher control. Except for the military or p rison, place of residence is rarely s ubject to manip ulation by a researcher. Nevertheless, a naturally occurring d ifference like this is often considered an IV and is used to p red ict some other nonexperimental (dependen t) variable s uch as income. In this type of research, the distinction between IVs and DVs is usually arbitrary and many researchers prefer to call IVs predictors and DVs criterion variables. In nonexperimental research, it is very difficult to attribu te causa lity to an IV. lf there is a systematic d ifference in a DV associated with levels of an IV, the two va riables are said (with some degree of confidence) to be related, but the cause of the relationship is unclear. Fo r examp le, income as a DV might be rela ted to geographic area, bu t no causa l associa tion is implied. Nonexperimenta l research takes many forms, bu t a common example is the survey. Typically, many people are surveyed, and each respondent provides answers to many questions, producing a large number of variables. These variables are us ually interrelated in highly complex ways, bu t univariate and bivaria te s tatistics are not sensitive to this complexity. Bivariate correlations between all pairs of va riables, for example, cou ld not reveal tha t the 20 to 25 variables measured really represent only two or three "supervariables." lf a research goal is to distinguish among s ubgroups in a sample (e.g., between Catholics and Protestants) on the basis of a variety of attitudinal variables, we cou ld use severa l univariate I tests (or analyses of va riance) to examine group d ifferences on each variable separately. But if the variables are rela ted, which is highly likely, the resu lts of many t tests are misleading and s tatistica lly suspect. With the use of multiva riate s tatistical techniques, complex interrelationships among variables are revea led and assessed in s ta tistical inference. Further, it is possible to keep the overall Type I error rate at, say, 5%, no matter how many variab les are tested. Although most multivariate techniques were developed for use in nonexperimental research, they are also useful in experimental research, in which there may be multiple IVs and multiple DVs. With multiple IVs, the research is usually designed so that the IVs are independent of each other and a straightforward correction for numerous statistica l tests is available (see Chapter 3). With multiple DVs, a problem of inflated error ra te arises if each DV is tested separately. Further, a t least some of the DVs are likely to be correlated w ith each o ther, so separate tests of each DV reanalyze some of the same variance. Therefore, multivariate tests are used. Experimenta l research designs with multiple DVs were unusual a t one time. Now, however, w ith attempts to make experimental designs more realistic, and with the availability of computer programs, experiments often have several DVs. It is dangerous to run an experiment with only one DV and ris k missing the impact of the IV because the most sensitive DV is not measured. Multivaria te s tatistics help the experimenter design more efficien t and more realis tic experiments by allowing measu rement of multip le DVs w ithout violation of acceptable levels of Type I error. One of the few cons idera tions not relevant to choice of statistical technique is whether the da ta are experimen tal o r correla tional. The statis tical methods "work" whether the researcher manipulated the levels of the IV or not. But attribution of causality to resu lts is crucially affected by the experimental- nonexperimental d istinction.
1.1.3 Computers and Multivariate Statistics One answer to the question "Why multivariate statistics?" is that the techniques are now
accessible by compu ter. Only the most dedica ted nu mber cruncher would consider doing rea l-life-sized problems in multivariate statistics withou t a compu ter. Fortunately, excellent mu ltivaria te programs are available in a number of computer packages. Two packages are demons trated in this book. Examples are based on p rograms in IBM SPSS and SAS.
3
4 Chapter 1 If you have access to both packages, you are indeed fortuna te. Progra ms within the packages d o not completely overlap, and some p roblems a re better handled through one package than the other. For examp le, doing severa l versions of the same basic ana lysis on the same set of d ata is particularly easy with IBM SPSS, whereas SAS has the most extensive capabilities for saving derived scores from d ata screening or from in termed iate ana lyses. Chapters 5 through 17 (the chapte rs that cover the specialized multivaria te techniques) offer explanations and illustra tions of a variety of p rograms2 within each package and a comparison of the features of the p rograms. We hope tha t once you understand the techniques, you will be able to generalize to virtually any multivariate program . Recen t versions of the programs are available in Windows, with menus tha t implemen t most of the techniques illus trated in this book. All of the techniques may be imple mented through syntax, and syntax itself is generated through menus. Then you may add o r change syntax as d esired for your ana lysis. For example, you may "paste" menu choices into a syntax window in IBM SPSS, ed it the resulting text, and then run the program . Also, syntax genera ted by IBM SPSS menus is saved in the "journal" fi le (sta tistics.jnl), which may also be accessed and copied into a syn tax window. Syn tax generated by SAS menus is recorded in a "log" file. The con tents may then be copied to an inte ractive w indow, ed ited , and run . Do not overlook the help fi les in these p rograms. Ind eed, SAS and IBM SPSS now p rovid e the entire set of user manua ls online, often w ith more cu rrent information than is available in printed man uals. Ou r IBM SPSS demonstrations in this book are based on syntax generated through menus whenever feasible. We would love to show you the sequence of menu choices, but space d oes not permit. And, for the sake of pa rsimony, we have ed ited p rogram ou tpu t to illus trate the mate rial that we feel is the most important for in terpretation. With commercial computer packages, you need to know which version of the package you are using. Programs are contin ua lly being changed, and not all changes are immed iately implemented a t each facility. Therefore, man y versions of the various p rograms are sim ultaneously in use a t different institutions; even at one ins titution, more than one vers ion of a package is sometimes available. Program upda tes are often corrections of errors discovered in earlier versions. Sometimes, a new version will change the outpu t format but not its information. Occasionally, though, there a re major revisions in one or more programs or a new program is added to the package. Sometimes d efau lts change with upd ates, so tha t the outpu t looks different although syntax is the same. Check to find ou t which version of each package you are using. Then, if you are us ing a p rin ted manual, be su re that the manua l you a re using is consis tent with the vers ion in use at you r facility. Also check updates for error correction in previous releases that may be relevant to some of your p rev ious runs. Except where noted, this book reviews Wind ows versions of IBM SPSS Version 24 and SAS Version 9.4. Info rmation on availability and versions of software, macros, books, and the like changes almost daily. We recommend the In te rnet as a sou rce of "keeping up."
1.1.4 Garbage In, Roses Out? The trick in multiva riate statistics is not in computation. This is easily d one as discussed above. The trick is to select reliable and va lid measurements, choose the appropria te p rogram, use it correctly, and know how to in terp ret the ou tput. Ou tput from commercial compute r p rograms, with their beau tifully formatted tables, graphs, and matrices, can make garbage look like roses. Throughou t this book, we try to suggest clues that reveal when the true message in the ou tpu t more closely resembles the fertilizer than the flowers. Second, when you use multivariate statistics, you rarely get as close to the raw da ta as you d o when you apply univaria te statis tics to a relatively few cases. Erro rs and anomalies 2 We have retained descriptions of features of SYSTAT (Version 13) in these sections, despite the removal of detailed demonstrations of that program in this edition.
Introduction
in the d ata that would be obvious if the data were p rocessed by hand a re less easy to spot when processing is entirely by comp u ter. Bu t the compu ter packages have programs to grap h and describe you r da ta in the simplest univariate terms and to display bivaria te relationships a mong your variables. As discussed in Chap ter 4, these p rograms p rov ide p reliminary analyses that are absolu tely necessa ry if the results of multivaria te programs are to be believed. There a re also certain costs associated with the benefits of using multivariate p rocedu res. Benefits of increased flexibility in research design, for ins tance, are sometimes paralleled by increased ambiguity in interp retation of results. In add ition, multivariate results can be quite sensitive to which ana lytic s trategy is chosen (cf. Section 1.2.4) and d o not always prov ide better p rotection agains t statistica l errors than their univariate counterparts. Add to this the fact that occasionally you still cannot get a firm statis tical answer to your research questions, and you may wond er if the increase in complexity and difficulty is warranted. Frankly, we think it is. Slip pery as some of the concepts and p rocedu res are, these statistics p rovid e insigh ts into relationships among variables that may more closely resemble the comp lexity of the "real" world. And sometimes you get at least partia l answers to q uestions that could not be asked at a ll in the univariate framework. Fo r a complete analysis, making sense of your data usually requires a judicious mix of multivariate and univa riate statistics. The ad dition of multivaria te statis tical methods to your repertoire makes data analysis a lot more fun. If you liked univariate statistics, you will love multivariate statistics!3
1.2 Some Useful Definitions In order to describe multiva riate statistics easily, it is useful to review some common terms in research d esign and basic statistics. Distinctions were made between IVs and DVs and between experimenta l and nonexperimenta l resea rch in preceding sections. Add itiona l terms that are
encountered repea tedly in the book b ut not necessarily rela ted to each other are d escribed in this section.
1.2.1 Continuous, Discrete, and Dichotomous Data In applying statis tical techniques of any sort, it is importan t to consid er the type of measure-
ment and the na ture of the correspondence between the numbers and the events that they rep resent. The distinction mad e here is a mong continuous, discrete, and dichotomous variables; you may p refer to s ubs titute the terms interval o r quantitative for continuous and nominal, categorical or qualitative for dichotomous and discrete. Continuous variables a re measured on a scale tha t changes values smoothly rather than in s teps. Continuous variables ta ke on any va lues within the range of the scale, and the size of the number reflects the am oun t of the variable. Precision is limited b y the measu ring ins trument, not by the nature of the scale itself. Some examples of con tin uous variables are time as measured on an old-fashioned analog clock face, annua l income, age, temperature, d is tance, and grade poin t average (GPA). Discrete variables take on a finite and usually sma ll number of values, and there is no smooth transition from one va lue or category to the next. Examples include time as displayed by a digital clock, continen ts, ca tegories of religious affiliation, and type of community (ru ral o r urban). Sometimes discrete va riables are used in multiva riate analyses as if continuous if there are numerous categories and the categories represen t a quantitative attribute. For instance, a variable tha t represents age categories (where, say, 1 s tands for 0 to 4 years, 2 stands for 5 to 9 years, 3 stands for 10 to 14 years, and so on up through the norma l age span) can be used because there are a lot of categories and the numbers designate a quantita tive a ttribu te (increasing age). Bu t the same num be rs used to designate categories of religious affilia tion are not in ap p ropria te 3
Don't e\'en think about it.
5
6
Chapter 1
form for analysis w ith many of the techniques4 because religions do not fall along a quantitative con tinuum. Discrete va riables composed of qualitatively differen t categories are sometimes ana lyzed after being changed into a n umber of d ichotomous or two-level va riables (e.g., Ca tholic vs. non-Catholic, Pro testant vs. non-Protestant, Jewish vs. non-Jewish, and so on until the degrees of freedom are used). Reca tegorization of a d iscrete variable into a series of d ichotomous ones is called dummy variable coding. The conversion of a discrete variable into a ser ies of dichotomous ones is done to limit the relationship between the d ichotomous variables and o thers to linear relationships. A discrete variable with more than two categories can have a rela tionship of any shape w ith another variable, and the rela tionship is changed a rbitrarily if the assignment of numbers to categories is changed. Dichotomous variables, however, with only two points, can have only linear relationships with o ther variables; they are, therefore, app ropriately ana lyzed by methods using correlation in which only linear relationships are ana lyzed. The distinction between contin uous and discrete variables is not always clear. If you add enough digits to the digital clock, for instance, it becomes for all p ractical pu rposes a contin uous measu ring dev ice, whereas time as measu red by the analog dev ice can also be read in discrete categories such as hou rs or half hours. In fact, any continuous measuremen t may be rend ered discrete (or dichotomous) w ith some loss of informa tion, by specifying cu toffs on the con tinuous scale. The p roperty of variables that is crucial to the application of multivaria te p rocedures is not the type of measuremen t so much as the shape of distribution, as d iscussed in Chapter 4 and in discussions of tests of assumptions in Chapters 5 through 17. Non-norma lly d is tribu ted con tinuous va riables and dichotomous va riables with very uneven splits between the categories p resent problems to several of the multivariate analyses. This issue and its resolution a re discussed at some length in Chap ter 4. Another type of measurement that is used sometimes produces a rank o rder scale. This scale assigns a number to each case to ind icate the case's position vis-a-vis o ther cases along some d im ension. For ins tance, ranks are assigned to con testan ts (firs t place, second place, third place, e tc.) to p rovid e an ind ication of who is the best-bu t not b y how much. A p roblem with rank order measures is that their d is tribu tions are rectangula r (one frequency per number) instead of norma l, unless tied ranks are permitted and they pile up in the mid dle of the distribution. In p ractice, we often trea t variables as if they a re continuous when the underlying scale is thought to be continuous, but the measured scale actua lly is rank o rder, the number of ca tegories is large-say, seven o r more, and the da ta meet o ther assumptions of the analysis. For ins tance, the number of correct items on an objective test is technica lly not continuous because fractiona l va lues a re not possib le, bu t it is thought to measure some underlying continuous variable such as cou rse maste ry. Another example of a va riable with ambiguous measuremen t is one measured on a Like rt-type scale, in which consumers ra te their a ttitudes towa rd a p roduct as "strong ly like," "modera te ly like," "mild ly like," "neither like nor dislike," "mild ly dislike," "moderately dislike," o r "strongly d islike." As mentioned previous ly, even d ichotomous va riab les may be treated as if contin uous under some conditions. Thus, we often use the term continuous, throughou t the remaind er of this book, whether the measured scale itself is con tin uous o r the va riab le is to be treated as if con tin uous. We use the te rm discrete for variables with a few categories, whether the categories differ in type o r quantity.
1.2.2 Samples and Populations Samp les a re measured to make generalizations abou t populations. Ideally, samples are selected , usua lly b y some random p rocess, so that they represent the popu lation of interest. In real life, however, popula tions are frequently best defined in terms of samp les, rather than vice versa; the popula tion is the group from which you were able to rand omly sample. 4
Some multivariate techniques (e.g., logistic regression, SEM) are appropriate for all types of variables.
Introduction
Sampling has somewhat differen t connotations in nonexperim enta l and experimen tal research. In nonexperimental research, you investigate relationships am ong variables in some p redefined popu lation. Typically, you take elabora te p recautions to ensure that you have achieved a representative sample of tha t popula tion; you d efine your population, and then do your best to randomly sample from it.5 In experimental resea rch, you attemp t to create different popula tions by treating subgroups from an originally homogeneous grou p differently. The sampling objective here is to ensure that a ll cases come from the same population before you trea t them differently. Rand om sam p ling consists of randomly assigning cases to treatmen t groups (levels of the IV) to ensu re that, before differen tial treatmen t, all s ubsamples come from the same popula tion. Statistical tests p rovid e evid ence as to whether, after treatment, all samples still come from the same pop ulation. Generalizations abou t treatment effectiveness are made to the type of indiv iduals who participa ted in the experimen t.
1.2.3 Descriptive and Inferential Statistics Descriptive statistics d escribe samples of cases in terms of va riables or combinations of variables. Inferential statis tical techniques test hypotheses abou t differences in populations on the basis of measurements mad e on samples of cases. If statistically significan t d ifferences are found, descrip tive statistics are then used to p rovid e estima tions of central tendency, and the like, in the population. Descriptive statis tics used in this way are called parameter estimates. Use of inferential and d escriptive s tatistics is rarely an either-or p roposition. We are usually in terested in both describing and making inferences about a d ata set. We d escribe the da ta, find statistica lly significant d ifferences or rela tionships, and estimate population values for those findings. However, there are more restrictions on inference than there are on d escription. Many assump tions of multivariate statistical methods are necessary only for inference. If simple d escrip tion of the samp le is the major goa l, many assumptions a re relaxed, as discussed in Chapters 5 through 17.
1.2.4 Orthogonality: Standard and Sequential Analyses Orthogonality is a perfect nonassociation between variables. If two variables a re orthogonal, knowing the value of one variable gives no clue as to the value of the o ther; the correlation between them is zero. Orthogonality is often desirab le in s tatistical applica tions. For instance, factorial designs for experiments are orthogonal when two or more IVs are completely crossed with equal sample sizes in each combination of levels. Except for use of a common erro r term, tests of hypotheses abou t main effects and in te ractions a re independen t of each o ther; the outcome of each test gives no hint as to the outcome of the o thers. In o rthogonal expe rimental d esigns with random assignm en t of cases, manipulation of the levels of the IV, and good controls, changes in value of the DV can be unambiguously attribu ted to various main effects and in teractions. Simila rly, in multivariate analyses, there are ad vantages if sets of IVs or DVs are o rthogonal. If all pairs of IVs in a set are orthogonal, each IV ad ds, in a simple fashion, to p rediction of the DV. Consider income as a DV w ith ed ucation and occupational p restige as IVs. If education and occupational p restige are orthogonal, and if 35% of the va riability in income may be p red icted from education and a d ifferen t 45% is pred icted from occupationa l prestige, then 80% of the variance in income (the DV, Y) is predicted from education and occupational p restige together. Orthogonality can easily be illustra ted in Venn diagrams, as shown in Figure 1.1. Venn d iagrams represent shared variance (or correlation) as overlapping a reas between two (or more) circles. The total variance for income is one circle. The section with horizontal s tripes rep resents the part of income p red ictable from educa tion (the first IV, X1), and the section with
5
Strategies for random sam pling are discussed in many sources, includ ing Levy and Lem enshow (2009), Rea and Par ker (1997), and de Vaus (2002).
7
8
Chapter 1
Figure 1.1 Venn diagram for Y (income), X, (education), a nd X2 (occupational p restig e). vertical s tripes represents the part p redictable from occupational prestige (the second IV, X2); the circle for education overlaps the circle for income 35% and the circle for occupationa l prestige overlaps 45%. Together, they account for 80% of the variability in income because education and occupationa l p restige are orthogonal and do not themselves overlap. There are similar ad vantages if a set of DVs is orthogonal. The overa ll effect of an IV can be pa rtitioned in to effects on each DV in an add itive fashion. Usually, however, the variables are correlated with each other (nonorthogonal). IVs in nonexperimental d esigns are often correlated naturally; in experimental designs, IVs become correlated when unequal numbers of cases are measured in different cells of the design. DVs are usually correlated because individual differences among participants tend to be consistent over many attributes. When variables are correlated, they have sha red or overlapping variance. In the example of Figure 1.2, education and occupational p restige correlate with each o ther. Although the independent con tribution made by education is s till35% and that by occupational prestige is 45%, their join t contribu tion to p rediction of income is not 80%, but rather something smalle r d ue to the overlapping a rea shown by the arrow in Figure 1.2(a). A major decision for the mu ltivaria te analyst is how to handle the va riance tha t is predictable from more than one va riable. Many multivariate techrUques have at least two s trategies for handling it, bu t some have more. In s tanda rd analysis, the overlapping variance contribu tes to the size of su mmary statistics of the overall rela tionship bu t is not assigned to either variable. Overlapping variance is disregarded in assessing the contribu tion of each variable to the solution. Figu re 1.2(a) is a Venn d iagram of a standard ana lysis in which overlapping variance is shown as overlapping areas in circles; the unique contribu tions of X 1 and X 2 to p red iction of Yare shown as horizon tal and ver tical areas, respectively, and the total relationship between Y and the combination of X1 and X2 is those two areas plus the a rea w ith the arrow. If X1 is education and X2 is occupationa l p restige, then in s tanda rd analysis, X1 is "credited w ith" the area marked by the horizonta l lines and X2 b y the a rea marked by vertica l lines. Neither of the IVs is assigned the area d esignated with the arrow. When X 1 and X2 substantially overlap each other, very little horizontal or vertical area may be left for either of them, despite the fact that they are both related to Y. They have essentially knocked each other ou t of the solu tion.
Area represents variance in relationship that contributes to solution but is assigned to nor neither
x,
(a) Stand ard analysis
x2
(b) Sequential analysis in which X 1 is g iven priority over X2
Figure 1.2 Standa rd (a) a nd sequentia l (b) a nalyses of the relationship between Y. x, , and X2 . Horizonta l s had ing de picts varia nce assigned to X1 . Vertical sha ding depicts vari ance assigned to X2 .
Introduction
Sequential ana lyses differ, in that the researcher assigns priority for entry of variables into equations, and the first one to enter is assigned both unique variance and any overlapping variance it has with other variables. Lower-priority variables are then assigned on entry their unique and any remaining overlapping variance. Figure 1.2(b) shows a sequential analysis for the same case as Figure 1.2(a), where X1 (ed ucation) is given priority over X2 (occupational p restige). The total variance explained is the same as in Figure 1.2(a), bu t the relative contributions of X1 and X2 have changed; education now shows a s tronger relationship with income than in the standard analysis, whereas the relation between occupa tional prestige and income remains the same. The choice of strategy for dealing with overlapping variance is not trivial. If variables are correlated, the overall relationship remains the same, but the apparent importance of variables to the solution changes depending on whether a s tandard or a sequential strategy is used. If the multivariate p rocedu res have a reputation for unreliability, it is because solutions change, sometimes d ramatically, when different strategies for entry of variables a re chosen. However, the s tra tegies also ask different questions of the da ta, and it is incumbent on the researcher to determine exactly which question to ask. We try to make the choices clear in the chapters that follow.
1.3 Linear Combinations of Variables Mu ltivaria te ana lyses combine variables to do usefu l work, s uch as p redict scores or p redict group membership. The combination that is formed depends on the rela tionships among the variables and the goals of analysis, but in most cases, the combination is linear. A linear combina tion is one in which each variab le is assigned a weight (e.g., W1), and then the p roducts of weights and the variable scores are summed to pred ict a score on a combined variable. In Equation 1.1, Y' (the predicted DV) is predicted by a linear combina tion of X1 and X2 (the IVs). (1.1)
If, for example, Y' is p redicted income, X1 is educa tion, and X2 is occupational prestige, the best p red iction of income is obtained by weighting education (X 1) by W1 and occupational prestige (Xv by W2 before summing. No o ther values of W1 and W2 produce as good a p red iction of income. Notice that Equation 1.1 includes neither X 1 nor X2 raised to powers (exponents) nor a product of X 1 and X2 . This seems to severely restrict multiva riate solu tions un til one realizes that X1 could itself be a p roduct of two different variab les or a single variable raised to a power. For example, X1 migh t be education squa red. A mu ltivaria te solution does not produce exponents or cross-products of IVs to improve a solution, but the researcher can include Xs that are cross-products of IVs or are IVs ra ised to powers. Inclusion of variables raised to powers o r cross-products of variab les has both theoretical and p ractical implications for the solution. Berry (1993) p rov ides a useful discussion of many of the issues. The size of the W values (or some function of them) often reveals a great deal abou t the relationship between DVs and IVs. If, for instance, the W value for some IV is zero, the IV is not needed in the best DV- IV relationship. Or if some IV has a large W va lue, then the IV tends to be important to the relationship. Although complications (to be explained later) prevent interpretation of the multivariate solu tion from the sizes of the W values alone, they are nonetheless important in most multivariate p rocedures. The combination of variables can be considered a supervariab le, not d irectly measured but worthy of interpreta tion. The supervariable may represent an underlying d imens ion that predicts something or optimizes some relationship. Therefore, the attempt to understand the meaning of the combination of IVs is worthwhile in many multivariate ana lyses. In the search for the best weights to apply in combining variab les, computers do not try out all possible sets of weights. Various a lgorithms have been developed to compute the weights. Most algorithms involve manipulation of a correlation matrix, a va riance-covariance matrix, o r a s um-of-squares and cross-products ma trix. Section 1.6 describes these ma trices in very
9
10 Chapter 1 simple terms and shows their development from a very small data set. Appendix A describes some terms and manip ulations appropriate to matrices. In the fourth sections of Chap ters 5 through 17, a small h ypo thetical sa mple of d ata is analyzed by hand to show how the weights are d erived for each analysis. Though this informa tion is useful for a basic understanding of multiva riate statistics, it is not necessary for applying multiva riate techniques fru itfully to your resea rch questions and may, sad ly, be skipped by those who are ma th averse.
1.4 Number and Nature of Variables to Include Attention to the number of variables included in an analysis is important. A general rule is to get the best solution with the fewest variables. As more and more variables are included, the solu tion usually improves, but only slightly. Sometimes the improvement d oes not compensate for the cost in degrees of freed om of including more variables, so the power of the analyses diminishes. A second p roblem is averjitting. With overfitting, the solu tion is very good; so good, in fact, that it is unlikely to generalize to a population. Overfitting occurs when too many variables are included in an analysis relative to the sample size. With smaller samples, very few variables can be analyzed. Generally, a researcher should include only a limited number of uncorrelated variables in each analysis,6 fewer with smaller samples. We give guidelines for the number of variables that can be included relative to sample size in the third sections of Chapters 5 through 17. Add itiona l cons idera tions for inclusion of variables in a multivariate ana lysis include cost, availability, meaning, and theoretical relationships among the variables. Except in analysis of s tructure, one usua lly wan ts a small number of valid, cheaply obtained , easily available, uncorrela ted variables that assess a ll the theoretically important d im ensions of a research area. Another impo rtant cons idera tion is reliability. How stable is the position of a given score in a distribution of scores when measured at different times o r in differen t ways? Unreliable variab les d egrad e an analysis, whereas reliable ones enhance it. A few reliable variables give a more meaningful solution than a large n umber of less reliable variables. Indeed , if variables are sufficiently unreliable, the entire solu tion may reflect only measurement error. Further cons id era tions for variable selection are mentioned as they ap ply to each analysis.
1.5 Statistical Power A critical issue in designing any study is whether there is adequate power. Power, as you may recall, rep resents the probability that effects that actually exist have a chance of prod ucing statistical significance in your eventual data analysis. For example, d o you have a large enough sample size to show a significant relationship between GRE and GPA if the actual relationship is fairly large? What if the relationship is fairly small? Is your sample large enough to reveal significant effects of treatment on your DV(s)? Relationships among power and errors of inference are d iscussed in Chapter 3. Issues of power a re best consid ered in the planning s ta te of a study when the researcher determines the required sample size. The researcher estimates the size of the anticipated effect (e.g., an expected mean difference), the variability expected in assessment of the effect, the desired a lpha level (ordina rily .05), and the desired power (often .80). These four estimates are required to d etermine the necessary sample size. Fa ilure to consider power in the p lanning s tage often results in fa ilure to find a s ignificant effect (and an unp ublishable stud y). The interested reader may wish to cons ult Cohen (1988), Rossi (1990), Sedlmeier and Gigerenzer (1989), o r Murp hy, Myors, and Wolach (2014) for more detail. There is a great d eal of software ava ilable to help you estimate the power available with various sample sizes for various statistical techniques, and to help you determine necessary 6
The exceptions are analysis of s tructure, such as factor analysis, in which numerous correlated variables are measured .
Introduction
sample size given a desired level of power (e.g., an 80% probability of acrueving a significant result if an effect exists) and expected sizes of rela tionships. One of these p rograms that estimates power for severa l techniques is NCSS PASS (Hintze, 2017). Many o ther p rograms are rev iewed (and sometimes available as shareware) on the Internet. Issues of power relevan t to each of the statistical techniques are d iscussed in Chapters 5 through 17.
1.6 Data Appropriate for Multivariate Statistics An appropriate d ata set for multivariate statistical methods consists of values on a number of variables for each of several participants or cases. For continuous variables, the values are scores on variables. For example, if the continuous variable is the GRE, the values for the various participants are scores such as 500, 420, and 650. For discrete variables, values are number codes for group membership or treatment. For example, if there are three teaching techniques, students who receive one technique are arbitrarily assigned a "1," those receiving another technique are assigned a "2".
1.6.1 The Data Matrix The d ata matrix is an organization of scores in which rows (lines) represent participants and columns represent variables. An example of a data matrix with six pa rticipants 7 and four variables is given in Table 1.1. For example, X1 might be type of teaching technique, X2 score on the GRE, X3 GPA, and X4 gend er, with women coded 1 and men coded 2. Data are en tered into a data file with long-term s torage accessible by comp u ter in ord er to app ly computer techniques to them. Each participant s tarts with a new row (line). Information id entifying the participant is typica lly entered first, followed by the value of each variable for that participant. Scores for each variable are en tered in the same order for each student. If there are more da ta for each case that can be accommodated on a single line, the d ata are con tinued on add itional lines, bu t a ll of the d ata for each case are kept together. All of the computer package manuals p rovid e informa tion on setting up a data matrix. 1n this example, there a re values for every variable for each s tuden t. This is not always the case with research in the real world. With large n umbers of cases and variab les, scores are frequently missing on some variables for some cases. For instance, respondents may refuse to answer some kinds of questions, or some students may be absent the day when a particular test is given, and so forth. This creates missing values in the data matrix. To dea l w ith missing values, first build a d ata file in which some symbol is used to ind icate that a va lue on a variable is missing in d ata for a case. The various p rograms have standard symbols, s uch as a dot (.), for this p u rpose. You can also use other symbols, bu t it is often just as convenien t to use one of the default symbols. Once the d ata set is available, consult Chap ter 4 for various options to d eal with this messy (bu t often unavoid able) p roblem.
Table 1.1
7
A Data Matrix of Hypothetical Scores
x2
X3
500
3.20
Student
x.
2
1
420
2.50
3
2
650
3.90
x. 2
4
2
550
3.50
5
3
480
3.30
1
6
3
600
3.25
2
Normally, of course, there are many more than six cases.
2
11
1 2 Chapter 1
Table 1.2
Correlation M atrix for Part of Hypothetical Data for Table 1.1
x2 R=
x2
X3
1.00
.85
- .13
x•
X3
.85
1.00
- .46
x.
- .13
- .46
1.00
1.6.2 The Correlation Matrix Most readers are familiar with R, a correlation matrix. R is a square, symmetrical matrix. Each row (and each column) represen ts a different variable, and the value at the intersection of each row and column is the correlation between the two variables. For instance, the value at the intersection of the second row, third column, is the correlation between the second and the third variables. The same correlation also appears a t the intersection of the third row, second column. Thus, correlation matrices are said to be symmetrical about the main diagonal, which means they are mirror images of themselves above and below the diagonal from top left to bottom right. Hence, it is common p ractice to show only the bottom half or the top half of an R matrix. The entries in the main diagonal are often omitted as well, since they are all ones-correlations of variables with thernselves.8 Table 1.2 shows the correlation matr ix for X2- X3 , and X4 of Table 1.1. The value .85 is the correlation between X2 and X3 and it appears twice in the matrix (as d o o ther values). Other correlations are as ind icated in the table. Many p rograms allow the resea rcher a choice between analysis of a correlation ma trix and analysis of a variance-covariance matrix. If the correlation matr ix is analyzed, a unit-free resu lt is p roduced. That is, the solution reflects the relationships among the variables bu t not in the metric in which they are measu red. If the metric of the scores is somewha t arbitrary, analysis of R is appropriate.
1.6.3 The Variance-Covariance Matrix If scores are measu red along a meaningful scale, it is sometimes appropriate to ana lyze a variance-covariance ma trix. A variance-covariance matrix, l:, is also square and symmetrica l, but the elemen ts in the main diagonal a re the variances of each va riable, and the off-diagona l elements are covariances between pairs of different va riables. Variances, as you recall, are averaged squared dev iations of each score from the mean of the scores. Since the deviations are averaged , the number of scores included in comp u tation of a va riance is not relevan t, but the metric in which the scores a re measu red is relevant. Scores measured in la rge numbers tend to have la rge n umbers as va riances, and scores measured in small n umbers tend to have small variances. Cova riances are averaged cross-p roducts (product of the d eviation between one va riable and its mean and the dev iation between a second variable and its mean). Covariances are similar to correla tions except that they, like variances, retain informa tion concerning the scales in which the variables are measured. The variance-covariance ma trix for the con tin uous da ta in Table 1.1 appears in Table 1.3.
Table 1.3
VariancErCovariance Matrix for Part of Hypothetical Data of Table 1.1
x2
X3
x.
7026.66
32.80
- 6.00
X3
32.80
.21
- .12
x.
- 6.00
- .12
.30
x2 l: =
8 Alternatively, other information such as
standard deviations is inserted.
Introduction
1.6.4 The Sum-of-Squares and Cross-Products Matrix The ma trix, S, is a precursor to the va riance-cova riance matrix in which devia tions are not yet averaged. Thus, the size of the entries depends on the number of cases as well as on the metric in which the e lements were measured. The sum-of-squares and cross-p roducts matrix for Xz, X3 , and X4 in Table 1.1 appears in Table 1.4. The entry in the major diagonal of the matrix S is the sum of squared deviations of scores from the mean for that variable, hence, "sum of squares," or SS. That is, for each variable, the value in the major d iagonal is N
SS(X1) = 2, (X1i - Xi)2
(1.2)
i= l
where
i = 1,2, ... ,N N = the nu mber of cases
j = the variab le iden tifier X1i = the score on variab le j by case i Xi = the mean of all scores on the j th variable For example, for X4, the mean is 1.5. The sum of squared deviations around the mean and the diagona l value for the variable is 6
2, ( Xi4 - X4 ) 2 = {1 - 1.5) 2 + {2 - 1.5)2 + {1 - 1.5) 2 + {2 - 1.5)2 + {1 - 1.5) 2 + {2 - 1.5) 2 = 1.50
The off-diagonal elements of the sum-of-squares and cross-products matrix are the crossproducts- the sum of products (SP)-of the variables. For each pair of variables, represented by row and colu mn labels in Tab le 1.4, the entry is the su m of the p roduct of the deviation of one variab le around its mean times the deviation of the other variab le a round its mean. N
SP{X1 Xk) = 2, ( X1i - Xi)(X1k - Xk)
(1.3)
J=l
where j iden tifies the firs t variab le, k iden tifies the second variable, and all other terms a re as defined in Equation 1.1. (Note tha t if j = k, Equation 1.3 becomes identica l to Equation 1.2.) For example, the cross-product term for variables X2 and X3 is N
2, {X,1
-
X2 ){X13 - X3)
= {500 - 533.33){3.20 - 3.275)
+ {420 - 533.33){2.50 - 3.275)
+ ... + {600 - 533.33){3.25 - 3.275) = 164.00 Most compu ta tions start with S and proceed to ~ or R. The progression from a su m-ofsquares and cross-products matrix to a variance-covariance matrix is simple.
!. =
- 1- s
(1.4)
N- 1
Table 1.4 Sum-of-Sq uares and CrossProducts Matrix for Part of Hypothetical Data of Table 1.1
x. 164.00
- 30.00
164.00
1.05
- 0.58
- 30.00
- 0.58
1.50
35133.33
13
14 Chapter 1 The variance-covariance matrix is prod uced by d ividing every element in the su m-ofsqu ares and cross-products matr ix by N - 1, where N is the n umber of cases. The correla tion matrix is derived from an S matrix by d ividing each sum-of-squ ares by itself (to produce the 1s in the main d iagonal of R) and each cross-p roduct of the S matrix by the square root of the product of the sum-of-squ ared dev iations around the mean for each of the variables in the p air. Tha t is, each cross-produ ct is d ivided by Denominator(X1 Xk) = Y!:(X;i- Xi) 2 '£(X;k - Xk)2
(1.5)
where terms are defined as in Equ ation 1.3.
For some multivariate operations, it is not necessary to feed the da ta ma trix to a computer p rogra m. Instead , an S o r an R matrix is entered, w ith each row (representing a variable) s ta rting a new line. Often, consid erab le comp u ting time and expense are saved by entering one or the other of these ma trices rather than raw d ata.
1.6.5 Residuals Often a goal of analysis or test of its efficiency is its ability to reproduce the values of a DV o r the correlation matrix of a set of variables. For example, we migh t want to p red ict scores on the GRE (X2 ) of Table 1.1 from know ledge of GPA (X3 ) and gen der (X4). After app lying the p roper statistical operations- a multip le regression in this case- a p red icted GRE score for each student is computed by applying the p roper weights for GPA and gender to the GPA, and gender scores for each stud ent. Bu t because we already ob tained GRE scores for the samp le of stud ents, we are able to compare the p red icted score with the obtained GRE score. The d ifference between the p red icted and obtained values is known as the residual and is a measure of error of p red iction. In most analyses, the residuals for the entire sample sum to zero. Tha t is, sometimes the p red iction is too large and some times it is too small, bu t the average of a ll the errors is zero. The squared valu e of the residu als, however, p rovid es a measu re of how good the p rediction is. When the p redictions a re close to the ob tained va lues, the squared erro rs are small. The way that the residu als a re d is tribu ted is of fur ther interest in evaluating the degree to which the d ata meet the assump tions of multivariate ana lyses, as d iscussed in Ch apter 4 and elsewhere.
1.7 Organization of the Book Chapter 2 gives a guide to the multiva riate techniques that are covered in this book and p laces them in context with the more familiar un ivaria te and b ivariate statistics w here possible. Chapter 2 includes a flow chart tha t organizes s tatistical techniques on the basis of the major research qu estions asked. Ch apter 3 provid es a brief review of univariate an d bivariate s ta tistical techniques for those who are interested. Chapter 4 deals with the assump tions an d limitations of mu ltivaria te s tatistical methods. Assessment and viola tion of assumptions are d iscussed, along with alternatives for dealing with viola tions when they occur. The reader is guided back to Chapter 4 frequ en tly in Chapters 5 through 17. Chap ters 5 through 17 cover specific multivariate techniques. They include d escrip tive, conceptual sections as well as a guid ed tour through a real-world data set for which the analysis is appropriate. The tour includes an examp le of a Results section describing the outcome of the statistical analysis appropriate for submission to a p rofessional journal. Each technique chap ter includes a comparison of compu ter p rograms. You may want to vary the order in which you cover these chapters. Chapter 18 is an attempt to integrate univa riate, bivaria te, and multiva riate statistics through the multivariate general linear model. The common elemen ts underlying all the techniques are emph asized, rather than the differences am ong them . Chapter 18 is meant to pu ll together the ma terial in the remainder of the book with a conceptual rather than pragma tic emphasis. Some may wish to consid er this material earlier, for instance, immediately after Chap ter 2.
Chapter 2
A Guide to Statistical Techniques Using the Book Learning Objectives 2.1 Determine statistical techniques based on the type of research questions 2.2 Determine when to use specific analytical strategies 2.3 Use a decision tree to determine selection of statistical techniques 2.4 Outline the organization of the technique chapters 2.5 Summarize the primary requirements before selecting a statistical technique
2.1 Research Questions and Associated Techniques This chapter organizes the statistical techniques in this book by major research questions. A decis ion tree a t the end of this chapter leads you to an app ropriate analysis for your data. On the basis of your major research question and a few characteris tics of your d ata set, you determine which statistical technique(s) is appropriate. The first and the most important criterion for choosing a technique is the major research question to be answered by the statistical analysis. Here, the research questions are categorized into degree of relationship among variables, significance of group differences, prediction of group membership, structure, and questions that focus on the time course of even ts. This chapter emphasizes d ifferences in research questions answered by the different techniques d escribed in nontechnical terms, whereas Chapter 18 provides an integrated overview of the techniques with some basic equations used in the multivariate general linear mod el.1
2.1.1 Degree of Relationship Among Variables If the major purpose of ana lysis is to assess the associations among two o r more va riables, some form of correla tion/regression or chi-square is appropria te. The choice am ong five d ifferent s tatistical techniques is made by determining the number of independ ent and dependent variables, the na ture of the va riables (continuous or discrete), and whether an y of the independent variables (IVs) are best conceptualized as cova riates.2
1 2
You may find it helpful to read Chapter 18 now instead of waiting for the end. If the effects of some IVs are assessed after the effects of other IVs are statis tically removed, the latter are called
covoriates.
15
16 Chapter2 2.1.1.1 BIVARIATE R Biva riate correlation and regression, as reviewed in Chapter 3, assess the degree of relationship between two continuous variables, such as belly d ancing s kill and years of musical training. Biva riate correlation measu res the association between two variables with no d istinction necessary between IV and DV (depend en t variab le). Bivariate regression, on the o ther hand, p redicts a score on one variable from knowledge of the score on another va riable (e.g., predicts ski ll in belly dancing as measured by a single index, such as know led ge of steps, from a single p red ictor, such as years of musical training). The p redicted variable is considered the DV, whereas the predicto r is cons idered the IV. Bivariate correlation and regression are not multivariate techniques, but they are integrated in to the gen era l linear model in Chap ter 18. 2.1.1.2 MULTIPLE R Multiple correlation assesses the d egree to which one con tinuous variab le (the DV) is related to a set of o ther (usually) continuous variables (the IVs) that have been combined to create a new, composite va riable. Multiple correlation is a bivariate correlation between the origina l DV and the composite va riable created from the IVs. For example, how large is the association between the belly dancing skill and the set of IVs, such as years of musica l training, body flexibility, and age? Multiple regression is used to predict the score on the DV from scores on several IVs. In the p receding examp le, belly dancing skill measured by knowledge of steps is the DV (as it is for bivariate regression), and we have add ed body flexibility and age to years of musical training as IVs. O ther examples are prediction of success in an educa tional p rogram from scores on a number of aptitude tests, prediction of the sizes of earthquakes from a variety of geological and electromagnetic variables, or stock market beh av ior from a variety of po litica l and economic variables. As for bivariate correla tion and regression, multiple correla tion emphasizes the d egree of relationship between the DV and the IVs, whereas multip le regression emphasizes the p red iction of the DV from the IVs. In multiple correlation and regression, the IVs may or may not be correlated with each other. With some ambiguity, the techniques also allow assessment of the relative contribution of each of the IVs toward p redicting the DV, as discussed in Chapter 5. 2.1.1.3 SEQUENTIAL R In sequential (sometimes called hierarchical) multiple regression, IVs are given p riorities by the researcher before their contribu tions to the p red iction of the DV a re assessed . For example, the researcher might first assess the effects of age and flexibility on belly dancing s kill before looking at the contribu tion tha t years of musical training makes to that s kill. Differences among dancers in age and flexibility are s tatistically "removed" before assessment of the effects of years of musical training. In the example of an ed ucational p rogram, success of outcome (e.g., grade on a final exam ) might first be p redicted from variables such as age and IQ. Then scores on various aptitude tests are added to see if p red iction of fina l exam grade is enhanced after adjustmen t for age and IQ. In general, then, the effects of IVs that en te r first are assessed and removed before the effects of IVs that en ter later are assessed. For each IV in a sequential multiple regression, higherp riority IVs act as covaria tes for lower-p riority IVs. The d egree of relationship between the DV and the IVs is reassessed at each step of the sequence. That is, multiple correlation is recomp u ted as each new IV (or set of IVs) is ad ded . Sequential multiple regression, then, is also useful for d eveloping a reduced set of IVs (if that is desired) by d etermining when IVs no longer ad d to p red ictability. Sequen tia l multip le regression is discussed in Chapter 5. 2.1.1.4 CANONICAL R In canonical correlation, there are several continuous DVs as well as several continuous IVs, and the goal is to assess the rela tionship between the two sets of variables. For example, we might stud y the relationship between a number of indices of belly dancing skill (the DVs, such as know ledge of s teps, ability to p lay finger cymbals, and respons iveness to the music) and the IVs (such as flexibility, musical training, and age). Thus, canonical correla tion ad ds DVs (e.g., further ind ices of belly dancing s kill) to the sing le index of s kill used in biva riate and multip le correlations, so tha t there are multiple DVs as well as multiple IVs in canonical correlation.
A Guide to Statistical Techniques
Or we rrught ask whether there is a relationsrup among acruevements in arithmetic, reading, and spelling as measured in elementary school and a set of va riables reflecting early cruldhood development (e.g., ages at first speech, walking, and toilet training). Such research questions are answered b y canonical correlation, the s ubject of Chapter 12. 2.1.1.5 MULTIWAY FREQUENCY ANALYSIS A goal of multiway frequency analysis is to assess relationships among discrete variables where none is considered a DV. For example, you might be interested in the relationships among gender, occupationa l ca tegory, and p referred type of reading material. Or the resea rch question rrugh t involve rela tionsrups among gender, categories of religious affiliation, and attitude towa rd abortion. Chapte r 16 deals with multiway frequency ana lysis. When one of the va riables is cons idered a DV with the rest serving as lVs, multiway frequency analysis is ca lled Jogit analysis, as described in Section 2.1.3.3. 2.1.1.6 MULTILEVEL MODELING In many research applications, cases are nested in (normally occurring) groups, wruch may, in tum, be nested in other groups. The quintessential example is studen ts nested in classrooms, wruch are, in turn, nested in schools. (Another common example in volves repeated measures where, e.g., scores are nested within students who are, in turn, nested in classrooms, which, in turn, are nested in schools.) However, students in the same classroom are likely to have scores that correlate more rughly than those of s tudents in genera l. This crea tes p roblems with an analysis that pools all students in to one very la rge group, ignoring classroom and school designations. Multilevel modeling (Chapter IS) is a somewhat complicated but increasing ly popular strategy for ana lyzing da ta in these situations.
2.1.2 Significance of Group Differences When participants are randomly assigned to groups (treatments), the major research question usually is the extent to wruch statistically significant mean differences on DVs are associated with group membersrup. Once significant differences are found, the researcher often assesses the degree of relationsrup (effect size or strength of association) between lVs and DVs. The resea rch question also is applicable to naturally formed groups. The choice among techniques hinges on the number of lVs and DVs and whether some variables are conceptualized as covariates. Further distinctions are made as to whether all DVs are measured on the same scale and how within-subjects lVs are to be treated. 2.1.2.1 ONE-WAY AN OVA AN D t TEST The two statistics, reviewed in Chapter 3, one-way analysis of va riance (ANOVA) and t test, are strictly univariate in natu re and are adequately covered in most standard statistica l texts. 2.1.2.2 ONE-WAY ANCOVA One-way analysis of covariance (ANCOVA) is designed to assess group differences on a sing le DV after the effects of one or more covaria tes are statis tically removed. Covariates are chosen because of their known association with the DV; otherwise, there is no point in using them. For example, age and degree of reading d isability a re usually related to the outcome of a program of educational therap y (the DV). If groups are formed by randomly assigning children to different types of educational therapies (the IV), it is useful to remove differences in age and degree of reading disability before examining the relationship between ou tcome and type of therapy. Prior d ifferences among cruldren in age and reading disability are used as cova riates. The ANCOVA question is: Are there mean differences in outcome associa ted with type of educational therapy after adjusting for differences in age and degree of reading d isability? ANCOVA gives a more powerful look at the lV- DV relationsrup by minimizing error variance (cf. Chapter 3). The stronger the rela tionsrup between the DV and the covaria te(s), the greater the power of ANCOVA over ANOVA. ANCOVA is discussed in Chapte r 6. ANCOVA is also used to adjus t for differences among groups, when groups are naturally occurring and random assignment to them is not possible. For example, one rrught ask if attitude toward abortion (the DV) varies as a function of religious affiliation. However, it is not
17
18 Chapter2 possible to rand omly assign people to religious affi liation. In this situa tion, there could easily be other systematic differences among groups, s uch as level of education, that a re also related to attitude toward abortion. Apparent differences among religious grou ps might well be d u e to differences in education rather than d ifferences in religious affilia tion. To get a "pu rer" measure of the relationship between attitude and religious affiliation, attitude scores are first adjusted for educational differences, that is, education is used as a covariate. Chapte r 6 also discusses this somewha t p roblematical use of ANCOVA. When there are more than two groups, planned o r post hoc comparisons are available in ANCOVA just as in ANOVA. With ANCOVA, selected and/or pooled group means are adjusted for differences on covariates before differences in means on the DV are assessed . 2.1.2.3 FACTORIAL AN OVA Factorial ANOVA, reviewed in Chapter 3, is the s ubject of numerous statistics texts (e.g., Brown, Michels, and Winer, 1991; Keppel and Wickens, 2004; Myers and Well, 2002; Tabachnick and Fidell, 2007) and is introduced in most e lementary texts. Although there is only one DV in factorial ANOVA, its place within the genera l linear mod el is d iscussed in Chap ter 18. 2.1.2.4 FACTORIAL ANCOVA
Factorial ANCOVA differs from one-way ANCOVA only in that there is more than on e IV. The d esirability and the use of covariates are the same. For ins tance, in the educational therapy example of Section 2.1.2.2, another interesting IV migh t be gender of the child. The effects of gender, the type of educational therapy, and their interaction on the ou tcome are assessed after adjusting for age and prior degree of reading d isability. The in teraction of gender w ith type of therap y asks if boys and girls differ as to which type of edu cational therapy is more effective after adjustmen t for cova riates. 2.1.2.5 HOTELLING'S T2 Hotelling's T 2 is used when the IV has only two groups and there are sever al DVs. For example, there migh t be two DVs, such as score on an academic achievement
test and attention span in the classroom, and two levels of type of educational therapy, emphasis on perceptual tra ining vers us emphasis on acad emic training. It is not legitimate to use sepa ra te t tests for each DV to look for differences between grou ps because that infla tes Type I error due to unnecessary multiple significance tests with (likely) correlated DVs. Instead, Hotelling's T 2 is used to see if groups d iffer on the two DVs combined . The researcher asks if there are nonchance differences in the cen troids (average on the combined DVs) for the two groups. Hotelling's T 2 is a special case of multivariate analysis of va riance, just as the t test is a special case of univa riate ana lysis of va riance, when the IV has only two groups. Mu ltivaria te analysis of variance is d iscussed in Chapter 7. 2.1.2.6 ONE-WAY MANOVA Multiva riate analysis of variance evaluates d ifferences among centroids (composite means) for a set of DVs when there are two o r more levels of an IV (groups). MANOVA is useful for the educational therap y examp le in the p reced ing section with two groups and also when there are more than two groups (e.g., if a nontreatment con trol group is added). With more than two groups, planned and post hoc comparisons are ava ilable. For example, if a main effect of treatment is found in MANOVA, it migh t be interesting to ask post hoc if there are differences in the centroids of the two groups given differen t types of educationa l therapies, ignoring the control group, and, possibly, if the centroid of the control group differs from the centroid of the two educational therapy groups combined. Any number of DVs may be used; the p rocedure deals with correlations a mong them, and the entire analysis is accomplished within the preset level for Type I error. Once statis tically s ignificant d ifferences are found, techniques are available to assess which DVs a re influenced by which IV. For example, assignment to treatment group migh t affect the academic DV bu t not a tten tion span. MAN OVA is also ava ilab le when there are w ithin-subject IVs. For exam p l e, child ren might be measured on both DVs three times: 3, 6, and 9 months after therapy begins. MANOVA is discussed in Chapter 7 and a special case of it (profile analysis, in which the within-subjects IV is treated multivariately) in Chapter 8. Profile analysis is an alternative to one-way
A Guide to Statistical Techniques
between-subjects MANOVA when the DVs a re all measured on the same scale. Discriminant analysis is an alternative to one-way between-subjects d esigns, as described in Section 2.1.3.1 and Chapter 9. 2.1.2.7 ONE-WAY MANCOVA In add ition to dealing with mu ltiple DVs, multivariate ana ly· sis of va riance can be app lied to p roblems when there are one or more covaria tes. In this case, MANOVA becomes multivariate ana lysis of covariance-MAN CO VA. In the educationa l therapy example of Section 2.1.2.6, it might be worthwhile to adjust the DV scores for pretreatment differences in academic achievement and attention span. Here the covariates are pretests of the DVs, a classic use of cova riance analysis. After adjustmen t for p retreatment scores, differences in posttest scores (DVs) can be more clearly attributed to treatmen t (the two types of educationa l therapies plus con trol group tha t ma ke up the IV). In the one-way ANCOVA example of religious grou ps in Section 2.1.2.2, it might be interesting to test political liberalism versus conservatism and attitude toward ecology, as well as attitude towa rd abortion, to create three DVs. Here again, differences in attitudes might be associa ted with both d ifferences in religion and differences in educa tion (which, in tum, varies with religious affiliation ). In the context of MANCOVA, education is the covariate, religious affiliation the IV, and attitudes the DVs. Differences in a ttitudes among groups with d ifferent religious affiliations are assessed after adjus tment for differences in education. If the IV has more than two levels, planned and post hoc comparisons are useful, w ith adjus tment for covariates. MANCOVA (Chapter 7) is available for both the main ana lysis and the comparisons. 2.1.2.8 FACTORIAL MAN OVA Factorial MANOVA is the extens ion of MANOVA to designs with more than one IV and multiple DVs. For example, gender (a between-subjects IV) might be added to the type of educational therapy (another between-subjects IV) with both academic achievement and a ttention span used as DVs. In this case, the analysis is a two-way betweensu bjects factoria l MANOVA that prov ides tests of the main effects of gend er and type of educationa l therapy and their in te raction on the centroids of the DVs. Duration of therapy (3, 6, and 9 months) migh t be add ed to the design as a within· su bjects IV with type of educational therapy a between-subjects IV to examine the effects of d u ration, the type of educationa l therapy, and their in teraction on the DVs. In this case, the analysis is a facto rial MAN OVA with one between- and one within-subjects IV. Comparisons can be made among margins or cells in the design, and the influence of vari· ous effects on combined or ind ividua l DVs can be assessed. For instance, the researcher might p lan (or decide post hoc) to look for linea r trends in scores associated with duration of therapy for each type of therapy separately (the cells) or across a ll types of therapies (the ma rgins). The search for linear trend could be conducted among the combined DVs or sepa ra tely for each DV with app rop riate adjustments for Type I error rate. Virtually any complex ANOVA design (cf. Chap ter 3) w ith mu ltip le DVs can be analyzed through MANOVA, given access to appropriate comp u ter p rograms. Factoria l MANOVA is covered in Chapter 7. 2.1.2.9 FACTORIAL MANCO VA It is sometimes d esirable to incorporate one or more covari· ates into a factorial MANOVA design to produce factoria l MANCOVA. For example, pretest scores on academic achievement and attention span could serve as covariates for the two-way between-subjects d esign with gender and type of educational therapy serving as IVs and posttest scores on academic achievement and attention span serving as DVs. The two-way betweens ubjects MANCOVA provides tests of gender, type of educational therap y, and their in teraction on adjusted, combined cen troids for the DVs. Here again, p rocedures a re ava ilable for comparisons among groups or cells and for evaluating the influences of IVs and their interactions on the va rious DVs. Factoria l MANCOVA is discussed in Chapte r 7. 2.1.2.10 PROFILE ANALYSIS OF REPEATED MEASURES A special form of MANOVA is available when all of the DVs are measured on the same scale (or on scales with the same
19
20 Chapter2 psychometric p roperties) and you want to know if groups differ on the scales. For example, you might use the subscales of the Profile of Mood States as DVs to assess whether mood p rofiles d iffer between a group of belly dancers and a group of ballet dancers. There are two ways to conceptualize this design. The firs t is as a one-way between-subjects design in which the IV is the type of dancer and the DVs are the Mood States subscales; oneway MANOVA p rov ides a test of the main effect of type of dancer on the combined DVs. The second way is as a profi le study with one grouping variable (type of dancer) and the several subscales; profile analysis p rovides tests of the main effects of type of dancer and of subscales as well as their interaction (frequently the effect of greatest interest to the researcher). If there is a grouping variable and a repeated measure such as trials in which the same DV is measured several times, there are three ways to conceptualize the design. The first is as a one-way between-subjects design w ith several DVs (the score on each trial); MANOVA p rovides a test of the main effect of the grou ping va riable. The second is as a two-way betweenand within-subjects design; ANOVA p rovides tests of grou ps, tria ls, and their interaction, but with some very restrictive assumptions that are likely to be violated. Third is as a p rofile s tudy, in which p rofile analysis prov ides tests of the main effects of groups and trials and their interaction, bu t withou t the restrictive assump tions. This is sometimes called t11e multivariate
approach to repeated-measures A NOVA. Finally, you might have a between- and within-subjects design (groups and trials), in which severa l DVs a re measured on each trial. For example, you might assess groups of belly and ba llet dancers on the Mood States su bscales a t various points in their training. This ap plication of profile ana lysis is frequently referred to as doubly multivariate. Chapter 8 deals with all these forms of p rofile analysis.
2.1.3 Prediction of Group Membership In resea rch where groups are id entified, the emp hasis is frequently on predicting group membership from a set of variables. Discriminant analysis, logit analysis, and logis tic regression are designed to accomplish this prediction. Discriminant analysis tends to be used when all IVs are con tinuous and nicely distributed, logit ana lysis when IVs are a ll discrete, and logistic regress ion when IVs are a mix of contin uous and discrete and / o r poorly distributed.
2.1.3.1 ONE-WAY DISCRI MINANT ANALYSIS In one-way discriminant analysis, the goal is to predict membership in groups (the DV) from a set of IVs. For example, the researcher might want to p red ict category of religious affiliation from attitude toward abortion, liberalism versus conservatism, and attitude toward ecological issues. The analysis tells us if group membership is predicted at a ra te that is significantly better than chance. Or the researcher might try to discriminate belly dancers from ballet dancers from scores on Mood Sta tes subscales. These are the same questions as those add ressed by MANOVA, but turned a round. Group membership serves as the IV in MANOVA and the DV in discriminant analysis. If groups differ significantly on a set of va riables in MANOVA, the set of variables significantly p red icts group membership in discriminant analysis. One-way between-subjects designs can be fruitfully ana lyzed through either p rocedure and are often best analyzed with a combination of both p rocedures. As in MANOVA, there are techniques for assessing the contribution of various IVs to the p red iction of grou p membership. For example, the major sou rce of discrimination among religious groups might be abortion attitude, with little p redictability con tribu ted by political and ecological attitudes. In add ition, discriminant analysis offers classification p rocedures to evaluate how well ind ividual cases are classified in to their appropriate grou ps on the basis of their scores on the IVs. One-way discriminant analysis is covered in Chapter 9. 2.1.3.2 SEQUENTIAL ONE-WAY DISCRIMINANT ANALYSIS Sometimes IVs are assigned p riorities by the resea rcher, so their effectiveness as p red ictors of group membership is evalua ted in the established order in sequen tia l discriminant analysis. Fo r example, when attitudina l
A Guide to Statistical Techniques
variables are predicto rs of religious affilia tion, variables might be p rioritized according to their expected con tribution to p rediction, with abortion attitude given the highest p riority, political liberalism versus conservatism second p riority, and ecological attitude the lowest priority. Sequential d iscriminant ana lysis first assesses the degree to which religious affiliation is p red icted from abortion attitude at a better-than-chance rate. Gain in prediction is then assessed with the addition of political attitude, and then w ith the add ition of ecological attitude. Sequential analysis p rovides two types of usefu l information. First, it is helpful in eliminating p redictors that do not contribute more than p red ictors already in the analysis. For example, if political and ecological attitudes do not add app reciably to abortion attitude in p redicting religious affiliation, they can be dropped from further analysis. Second, sequential d iscriminant analysis is a covariance analysis. At each step of the hie ra rchy, higher-priority predicto rs are covaria tes for lower-priority p red ictors. Thus, the analysis permits you to assess the contribution of a predictor with the influence of other p redictors removed. Sequential d iscriminant ana lysis is also useful for evaluating sets of predicto rs. For example, if a set of con tinuous demographic variables is given higher p riority than an attitudinal set in prediction of group membership, one can see if attitudes significantly add to prediction after adjustment for demographic differences. Sequen tial discriminant ana lysis is discussed in Chapter 9. However, it is usually more efficient to answer such questions through sequential logistic regression, particularly when some of the predicto r variables are con tinuous and o thers discrete (see Section 2.1.3.5). 2.1.3.3 MULTIWAY FREQUENCY ANALYSIS (LOGIT) The logit form of multi way frequency analysis may be used to predict group membership when all of the predictors a re discrete. For example, you migh t want to p red ict whether someone is a belly dancer or not (the DV) from knowledge of gender, occupational category, and p referred type of reading material (science fiction, romance, history, o r s tatistics). This technique a llows eva luation of the odds tha t a case is in one group (e.g., belly dancer) based on membership in va rious categories of p redictors (e.g., female professors who read science fiction). This form of mu lti way frequency ana lysis is discussed in Chapter 16. 2.1.3.4 LOGISTIC REGRESSION Logistic regression allows prediction of group membership when p red ictors are continuous, discrete, or a combination of the two. Thus, it is an alternative to both d iscriminant analysis and logit analysis. Fo r example, p rediction of whether someone is a belly dancer may be based on gender, occupational category, preferred type of reading material, and age. Logistic regression allows one to evalua te the odds (or p robability) of membership in one of the grou ps (e.g., belly dancer) based on the combination of va lues of the predictor variables (e.g., 35-year-old female professors who read science fiction). Chapter 10 covers logistic regress ion analysis. 2.1.3.5 SEQUENTIAL LOG ISTIC REGRESSION As in sequential discriminant analysis, sometimes p redictors are assigned priorities and then assessed in terms of their contribu tion to p rediction of group membership given their p riority. For example, one can assess how well the p referred type of reading material predicts whether someone is a belly dancer after adjusting for d ifferences associated w ith age, gender, and occupational ca tegory. Sequential logistic regression is also covered in Chapter 10. 2.1.3.6 FACTORIAL DISCRIMINANT ANALYSIS 1f grou ps are formed on the basis of more than one attribute, prediction of group membership from a set oflVs can be performed th rough factorial discriminan t analysis. For example, respondents might be classified on the basis of both gender and religious affiliation. One cou ld use attitudes towa rd abortion, politics, and ecology to p red ict gender (ignoring religion) or religion (ignoring gender), or both gender and religion. Bu t this is the same problem as addressed by factoria l MANOVA. For a nu mber of reasons, p rograms designed for d iscriminant analysis do not readily extend to factoria l arrangements of groups. Unless some special cond itions are met (d. Chapter 9), it is usually better to rephrase the research question so that factorial MANOVA can be used.
21
22 Chapter2 2.1.3.7 SEQUENTIAL FACTORIAL DISCRIMINANT ANALYSIS Difficulties inherent in factorial discriminant ana lysis extend to sequen tia l arrangements of p red ictors. Usually, however, questions of interest can readily be rephrased in terms of facto rial MANCOVA.
2.1.4 Structure Another set of questions is concerned with the latent structure underlying a set of variables. Depending on whether the search for structure is empirical or theoretical, the choice is principal components, factor analysis, or structural equation modeling. Principal components is an empirical approach, whereas factor analysis and structural equation modeling tend to be theoretical approaches. 2.1.4.1 PRINCIPAL COMPONENTS If scores on numerous va riables are available from a group of participants, the researcher might ask if and how the variables group together. Can the va riables be combined into a smaller number of supervariab les on which the participants differ? For example, suppose people are asked to rate the effectiveness of numerous behaviors for coping w ith stress (e.g., talking to a friend, going to a movie, jogging, and making lis ts of ways to solve the p roblem). The numerous behaviors may be empirically rela ted to just a few basic coping mechanisms, such as increasing or decreasing social contact, engaging in physical activ ity, and ins trumental manipulation of stress p roducers. Principa l componen ts analysis uses the correlations am ong the va riables to d evelop a small set of components that empirically s ummarizes the correla tions a mong the variables. It p rovides a descrip tion of the rela tionship rather than a theoretical analysis. This ana lysis is discussed in Chapter 13. 2.1.4.2 FACTOR ANALYSIS When there is a theory about underlying structure or when the researcher wants to understand underlying structure, factor analysis is often used. In this case, the researcher believes that responses to many different questions are driven by just a few underlying structures called factors. In the example of mechanisms for coping with stress, one might hypothesize ahead of time that there are two major factors: general approach to problems (escape vs. direct confrontation) and the use of social supports (withdrawing from people vs. seeking them out). Factor analysis is useful in developing and assessing theories. What is the structu re of personality? Are there some basic dimensions of personal ity on which people differ? By collecting scores from many people on numerous variables that may reflect different aspects of personality, researchers address questions abou t underlying structure through factor analysis, as d iscussed in Chapter 13. 2.1.4.3 STRUCTURAL EQUATION MODELING Structural equation modeling combines factor ana lysis, canonical correlation, and multiple regression. Like factor analysis, some of the variables can be latent, whereas others a re directly observed. Like canonical correlation, there can be many IVs and many DVs. And similar to mu ltiple regression, the goa l may be to study the relationships a mong many variables. For example, one may want to predict birth outcome (the DVs) from several demographic, persona lity, and a ttitudina l measures (the IVs). The DVs a re a mix of several observed variables such as birth weight, a la tent assessment of mother's acceptance of the child based on severa l measured attitudes, and a la tent assessment of infant responsiveness; the IVs are several demographic variables such as socioeconomic s tatus, race, and income, several la tent IVs based on persona lity measures, and prebirth attitudes towa rd pa renting. The technique evaluates whether the model p rovides a reasonable fit to the da ta and the contribution of each of the IVs to the DVs. Comparisons among alternative models, as well as evaluation of differences between groups, are also possible. Chapter 14 covers structura l equation modeling.
2.1.5 Time Course of Events Two techniques focus on the time course of events. Survival/ failure analysis asks how long it takes for something to happen. Time-series analysis looks a t the change in a DV over the course of time.
A Guide to Statistical Techniques
2.1.5.1 SURVIVAL/FAILURE ANALYS IS Survival/fa ilure analysis is a family of techniques dealing with the time it takes for something to happen: a cure, a failure, an empl oyee leaving, a relapse, a death, and so on. For example, what is the life expectancy of someone d iagnosed with b reast cancer? Is the life expectancy longer with chemotherap y? O r, in the context of failure analysis, what is the expected time before a hard d isk fails? Do DVDs last longer than COs? Two major varieties of surviva l/failure analysis a re life tables, which describe the cou rse of survival of one or more groups of cases (e.g., DVDs and COs), and dete rmination of whether survival time is influenced by some variables in a set. The la tter technique encompasses a set of regression techniques in which the DV is the su rviva l time. Chapter 11 covers this ana lysis. 2.1.5.2 TIME-SERIES ANALYSIS Time-series ana lysis is used when the DV is measu red over a very large number of time periods-a t least 50; time is the major IV. Time-series analysis is used to forecast future events (stock markets' ind ices, crime s tatistics, etc.) based on a long series of past even ts. Time-series analysis is a lso used to evaluate the effect of an interven tion, such as implementation of a water-conservation p rogram by observing water usage for many periods before and after the in terven tion. Chapter 17 covers this ana lysis.
2.2 Some Further Comparisons When assessing the degree of relationship among variables, bivariate r is appropria te when only two variables (one DV and one IV) are involved, while mu ltip le R is appropriate when there are severa l variables on the IV s ide (one DV and several IVs). The mu ltivaria te analysis adjus ts for correlations that are likely p resent among the IVs. Canonical correlation is available to study the relationship between several DVs and several IVs, adjusting for correlations among all of them. These techniques are us ually applied to contin uous (and dichotomous) variables. When a ll variab les a re d iscrete, mu ltiway frequency analysis (vastly expanded chi square) is the choice. Numerous analy tic stra tegies are available to study mean differences among groups, depending on whether there is a sing le DV or multip le DVs, and whether there are covariates. The familiar ANOVA (and ANCOVA) is used with a single DV, w hile MANOVA (and MANCOVA) is used when there are multiple DVs. Essen tially, MAN OVA uses weigh ts to combine multiple DVs in to a new DV and then performs ANOVA. A third important issue when s tudying mean d ifferences among groups is whether there are repeated measures (the familiar within-subjects ANOVA). You may recall the restrictive and oftenviolated assumption of sphericity with this type of ANOVA. The two multivariate extensions of repeated-measures ANOVA (profile analysis of repeated measu res and doubly multivariate p rofile analysis) circumvent this assumption by combining the DVs; MANOVA combines different DVs, while p rofile analysis combines the same DV measured repeated ly. Another variation of p rofile analysis (here called p rofile ana lysis of repea ted measu res) is a multiva riate extension of the familiar "mixed" (between- within-subjects) ANOVA. None of the multivariate extensions is usually as powerful as its univariate "parent" if the assumptions of the parent are met. The DV in both discriminant analysis and logistic regression is a discrete variable. 1n discriminant ana lysis, the IVs are usually continuous variables. A complication arises w ith discriminant analysis when the DV has more than two groups because there can be as many ways to d is tinguish the groups from each other as there are degrees of freedom for the DV. For example, if there are three levels of the DV, there are two degrees of freedom and therefore two potential ways to combine the IVs to separate the levels of the DV. The first combina tion might, for instance, separate members of the first group from the second and third groups (but not them from each other); the second combination might, then, separate members of group two from group three. Those of you familiar with comparisons in ANOVA p robably recognize this as a familiar process for working with more than two groups; the differen ce is tha t in ANOVA you create the comparison coefficients used in the analysis while in d iscriminant analysis, the analysis tells you how the groups a re best discriminated from each other (if they are).
23
24 Chapter2 Logistic regression analyzes a discrete DV too, bu t the IVs are often a mix of con tinuous and d iscrete variables. For tha t reason, the goa l is to predict the probability that a case will fall into various levels of the DV rather than group membership per se. In this way, the analysis closely resembles the familiar chi-square analysis. In logistic regression, as in all mu ltivaria te techniques, the IVs are combined, bu t in an exponent rather than d irectly. That makes the analyses conceptually more difficu lt, bu t well worth the effort, especially in the medical/biologica l sciences where the risk ratios, a p roduct of logistic regression, are routinely discussed. There are several procedures for examining structure (tha t become increasing ly "speculative"). Two very closely aligned techniques are principal components and factor analyses. These techniques are interesting because there is no DV (or, for that matter, IVs). Instead, there is jus t a bunch of variables, w ith the goal of analysis to discover which of them "go" together. The idea is that some latent, underlying structure (e.g., several different factors representing components of personality) is driving similar responses to correlated sets of questions. The trick for the researcher is to divine the "meaning" of the factors tha t are developed during analysis. The technique of p rincipa l components p rovides an empirical solution while factor analysis provides a more theoretical solu tion. Structural equation modeling combines multiple regression with factor analysis. There are one or more DVs in this technique, and the DVs and IVs can be both discrete and continuous, both latent and observed. That is, the researcher tries to predict the values on the DVs (continuous or d iscrete) using both observed IVs (con tinuous and discrete) and the latent ones (factors derived from many observed variables during the analysis). Structural equation modeling con tinues to rapid development, with expansion to MANOVA-like analyses, longitudinal analysis, sophisti· cated procedures for handling missing data, poorly d istributed variables, and the like. Multilevel modeling assesses the significance of va riables where the cases are nested into different levels (e.g., s tudents nested in classes nested in schools; patients nested in wards nested in hospitals). There is a DV at the lowest (student) level, but some IVs pertain to stu· dents, some to classes, and some to schools. The analysis ta kes into accoun t the (likely) higher correlations among scores of students nested in the same class and of classes nested in the same school. Relationships (regressions) developed at one level (e.g., p red icting student scores on the SAT from parental educationa l level) become the DVs for the next level, and so on. Finally, we p resen t two techniques for analyzing the time course of events, survival ana lysis and time-series analysis. One underlying IV for both of these is time; there may be other IVs as well. In survival analysis, the goal is often to determine whether a treated group survives longer than an untrea ted group, given the current standard of care. (In manufacturing, it is called failure analyses, and the goal, for instance, is to see if a part manufactured from a new alloy fails later than the part manufactured from the current alloy.) One advantage of this tech· nique, a t least in medicine, is its ability to analyze da ta for cases that have disappeared for one reason o r another (moved away, gone to another clinic for treatment, or died of another cause) before the end of the study; these are called censored cases. Time-series analysis tracks the pattern of the DV over multiple measurements (a t least 50) and may o r may not have an IV. If there is an IV, the goal is to determine if the pattern seen in the DV over time is the same for the group in one level of the IV as for the group in the other level. The IV can be natura lly occurring or manipulated. Generally, statistics are like tools- you pick the wrench you need to do the job.
2.3 A Decision Tree A decision tree s tarting w ith major resea rch questions appears in Table 2.1. For each question, the choice among techniques depends on the number of IVs and DVs (sometimes an arbitrary distinction) and whether some variables a re usefully viewed as covariates. The table also briefly describes analytic goals associated w ith some techniques. The paths in Table 2.1 are only recommenda tions concerning an analytic stra tegy. Researchers frequently discover that they need two or more of these procedures or, even more
A Guide to Statistical Techniques
frequ ently, a judicious mix of univa riate and multivaria te procedures to fully ans wer their research questions. We recommend a flexible approach to data analysis in which both univaria te and multivariate p rocedures are used to clarify the results.
Table 2.1
Choosing Among Statistical Techniques
MaJOr Research Question
Number (Kind) of Dependent Varoables
One (continuous)
Number (Kmd) of Independent Variables
1 is better than just one imputed data set. SAS MI and MIANALYZE are demonstrated in Section 5.7.4, where guidelines are given for choice of 111. IBM SPSS MULTIPLE IMPUTATION, available in newer versions of the program, has two rou tines. The first ana lyzes patterns of missing values graphically (d etailed information abou t patterns is available in IBM SPSS MVA, as discussed p reviously). The second routine crea tes the m multiply imp u ted data sets. Several IBM SPSS p rocedures (e.g., multiple and logistic regress ion) recognize these as da ta sets in which results are presented separately for each impu ta tion and then the combined data. Note that if you save the combined data set w ith them multiply impu ted d ata sets, the first imp u tation w ill be numbered 0, and contains the o riginal d ata set with missing d ata. Delete this data set before doing any analyses.
3
Output is especially sparse for procedures other than SAS REG.
Cleaning Up Your Act
Other methods, such as hot decking, a re available bu t they require specialized software and have few advantages in most s ituations over o ther imputation methods offered by IBM SPSS, SAS, SOLAS, and NORM. 4.1.3.3 USING A MISSING DATA CORRELATION MATRIX Another option with randomly missing data involves ana lysis of a missing data correlation matrix. In this option, all available pairs of values are used to calculate each of the correlations in R. A variable with 10 missing values has all its correlations w ith other variables based on 10 fewer pairs of n umbers. If some of the other variables a lso have missing values, bu t in different cases, the number of complete pairs of va riables is further reduced. Thus, each correlation in R can be based on a d ifferent number and a differen t su bset of cases, depending on the pattern of missing values. Because the s tandard error of the sampling distribution for r is based on N, some correlations are less s table than o thers in the same correlation matrix. But that is not the only p roblem. In a correlation matrix based on complete data, the sizes of some correlations place constraints on the sizes of o thers. In particula r, (4.2)
The correla tion between variables 1 and 2, r 12, cannot be smaller than the value on the left or larger than the value on the right in a three-variable correlation matrix. If r 13 = .60 and r23 = .40 then r 12 cannot be smaller than - .49 or larger than .97. If, however, r12, r 231 and r 13 are all based on different subsets of cases due to missing data, the value for r12 can go ou t of range. Most multivariate s tatistics involve ca lculation of eigenva lues and eigenvectors from a correla tion matrix (see Appendix A). With loosened constrain ts on size of correlations in a missing data correlation matrix, e igen values sometimes become negative. Because eigenvalues represen t variance, negative e igenvalues represent something a kin to negative variance. Moreover, because the to tal variance that is partitioned in the analysis is a constant (us ually equal to the number of variab les), positive e igenvalues are inflated by the s ize of negative eigenvalues, resulting in inflation of variance. The s tatistics derived under these conditions can be q uite distorted. However, with a large sample and only a few missing values, eigen values are often all positive even if some correlations are based on slightly differen t pairs of cases. Under these cond itions, a missing data correlation matrix p rovides a reasonable multivariate solution and has the advantage of using all availab le data. Use of this option for the missing da ta problem should not be rejected ou t of hand bu t shou ld be used cautiously with a wary eye to negative e igenva lues. A missing value correlation matrix is prepa red through the PAI RW I SE deletion option in some of the IBM SPSS programs. It is the default option for SAS CORR. If this is not an option of the program you want to run, then generate a missing data correlation matrix through ano ther p rogram for input to the one you are using. 4.1.3.4 TREATING MISSING DATA AS DATA It is possible that the fact that a value is missing is itself a very good predictor of the variable of interest in your resea rch. If a dummy variable is created when cases w ith complete data are assigned 0 and cases w ith missing data 1, the liability of missing data cou ld become an asset. The mean is inserted for missing values so that all cases are analyzed, and the dummy variable is used as simply another variable in analysis, as discussed by Cohen, Cohen, West, and Aiken (2003, pp. 431-451). 4.1.3.5 REPEATING ANALYSES WITH AND WITHOUT MISSING DATA If you use some method of estimating missing va lues o r a missing data correlation matrix, consider repeating you r ana lyses using only complete cases. This is particularly important if the data set is small, the p roportion of missing values high, or data are missing in a nonrandom pattern. If the results are similar, you can have confidence in them. If they are different, however, you need to in vestiga te the reasons for the difference, and either evaluate which resu lt more nearly approximates "rea lity" or report both sets of results.
61
62 Chapter4 4.1.3.6 CHOOSING AMONG METHODS FOR DEALING WITH MISSING DATA The first s tep in dealing with missing data is to observe their pattern to try to determine whether data are randomly missing. Deletion of cases is a reasonable choice if the pattern appears random and if only a very few cases have missing data and those cases are missing data on different variables. However, if there is evidence of nonrandornness in the pattern of missing da ta, methods that preserve all cases for further ana lysis are preferred. Deletion of a variable with a lot of missing data is also acceptable as long as that variable is not critica l to the analysis. Or, if the variable is important, use a dummy variable that codes the fact tha t the scores are missing coupled w ith mean substitution to p reserve the variable and make it possible to ana lyze all cases and variab les. It is best to avoid mean s ubstitution unless the proportion of missing va lues is very small and there are no o ther options available to you. Using prior knowledge requires a grea t deal of confidence on the part of the researcher about the research area and expected results. Regression methods may be implemented (with some difficulty) withou t specialized softwa re bu t are less desirable than EM methods. EM methods sometimes offer the simplest and most reasonable approach to imputation of missing data, as long as your p reliminary analysis p rovides ev idence that scores are missing randomly (MCAR or MAR). Use of an EM covariance matrix, if the technique permits it as input, provides a less-biased ana lysis a data set w ith imputed values. However, unless the EM p rogram prov ides appropriate standard errors (as per the SEM or MLM programs of Chapters 14 and 15 o r NORM), the strategy shou ld be limited to da ta sets in which there is not a great deal of missing da ta, and inferential results (e.g., p values) are in terpreted with caution. EM is especially appropria te for techniques that do not rely on inferential statistics, s uch as explora tory factor analysis (Chapte r 13). Better yet is to incorporate EM methods into mu ltiple imputation. Multiple impu tation is currently considered the most respectable method of dealing with missing data. It has the advantage of not requiring MCAR (and perhaps not even MAR) and can be used for any form of GLM ana lysis, such as regression, ana lysis of variance (ANOVA), and logistic regression. The problem is that it is more difficult to implement and may not p rovide the full richness of output that is typica l with other methods. Using a missing data correla tion matrix is tempting if your software offers it as an option for your analysis because it requires no extra steps. It makes most sense to use when missing data are scattered over va riables, and there are no variables with a lot of missing values. The vagaries of missing data correla tion matrices should be minimized as long as the data set is large and missing values are few. Repeating analyses with and without missing data is highly recommended whenever any impu ta tion method or a missing data correlation matrix is used and the p roportion of missing va lues is high-especially if the data set is small.
4.1.4 Outliers An outlier is a case with such an extreme value on one variable (a univaria te outlier) o r such a strange combination of scores on two or more va riables (multivariate outlie r) that it d istorts statistics. Consider, fo r example, the bivariate scatterplot of Figure 4.1 in which severa l regression lines, all with s lightly different slopes, p rov ide a good fit to the da ta points inside the swa rm . But when the data point labeled A in the u pper right-hand portion of the scatterp lot is also considered, the regression coefficient that is computed is the one from among the severa l good a lternatives that p rov ides the best fit to the extreme case. The case is an outlier because it has much more impact on the va lue of the regression coefficient than any of those inside the swarm. Outliers are found in both univariate and multivariate s ituations, among both d ichotomous and con tinuous variables, among both IVs and DVs, and in both da ta and resu lts of analyses. They lead to both Type I and Type II e rrors, frequently w ith no clue as to which effect
Cleaning Up Your Act
A
Figure 4.1 Bivariate scatte rplot for showing impact of an outlier. they have in a particular analysis. And they lead to resu lts tha t do not generalize except to ano ther sample with the same kind of outlier. There are four reasons for the presence of an outlier. First is incorrect data en try. Cases that are extreme should be checked carefully to see that data are correctly entered. Second is failure to specify missing-value codes in compu ter syntax so tha t missing-va lue ind icators are read as rea l data. Third is that the outlier is not a member of the popu lation from which you intended to sample. If the case should not have been sampled, it is deleted once it is detected. Fourth is that the case is from the intended popula tion bu t the distribution for the va riable in the popu lation h as more extreme va lues than a normal distribution. In this event, the researcher retains the case but considers changing the value on the va riable(s) so that the case no longer has as much impact. Although errors in data entry and missing valu es specification are easily found and remedied, deciding between alte rnatives three and four, between deletion and reten tion with altera tion, is difficult. 4.1.4.1 DETECfiNG UNIVARIATE AN D MULTIVARIATE OUTLIERS Univariate ou tliers are cases with an ou tland ish va lue on one variable; mu ltivariate ou tlie rs are cases w ith an unusu al combination of scores on two o r more variables. For example, a 15-year-old is perfectly w ithin bounds regarding age, and someone who earns $45,000 a year is in boun ds regarding income, but a IS-year-old wh o earns $45,000 a yea r is very unusual an d is likely to be a mu ltivaria te outlier. Mu ltivaria te ou tliers can occur when several d ifferent populations a re mixed in the same sample or when some important variables a re omitted that, if included, wou ld attach the outlier to the rest of the cases. Univariate ou tlie rs are easier to spot. Among dichotomous variables, the cases on the "wrong" s ide of a very uneven split are likely univariate outliers. Rummel (1970) suggests deleting d ichotomous variab les w ith 90-10 splits between categories, or more, both because the correlation coefficients between these variables and others are trunca ted and because the scores for the cases in the small category are more influ en tia l than those in the category w ith numerous cases. Dichotomous variables with extreme splits are easily found in the programs for frequ ency distribu tions (IBM SPSS FREQUENCIES or SAS UNIVARIATE) used during rou tine preliminary data screening. Among con tinu ous variab les, the procedu re for searching for outliers depends on whether data are grouped. If you are going to perform one of the an alyses w ith ungrouped data (regression, canonical correlation, factor ana lysis, stru ctural equ ation modeling, or some forms of time-series ana lysis), univariate and mu ltiva riate ou tliers a re sou ght among all cases a t once, as illustrated in Sections 4.2.1.1 (univa riate) an d 4.2.1.4 (mu ltiva riate). If you are going to perform one of the analyses with grouped data (ANCOVA, MANOVA or MANCOVA, profile analysis, discrimin ant analysis, logistic regression, survival an alysis, or mu ltilevel modeling), outliers are sough t separa te ly within each group, as illus trated in Sections 4.2.2.1 and 4.2.2.3.
63
64 Chapter4 Among continuous va riables, univaria te outliers are cases with very large standard ized scores, z scores, on one or more variab les, that a re disconnected from the other z scores. Cases with s tan dardized scores in excess of 3.29 (p < .001, two-tailed test) are potential outliers. However, the extremeness of a stand ardized score d epends on the size of the sample; with a ver y large N, a few standardized scores in excess of 3.29 are expected. Z scores a re available through IBM SPSS EXPLORE or DESCRIPTIVES (where z scores are saved in the data file), and SAS STANDARD (with MEAN = 0 and STD = 1). Or you can h and-calculate z scores from any ou tpu t tha t p rov ides means, standard d eviations, and maxim um and minimum scores. As an a lternative or in ad dition to inspection of z scores, there are grap hical methods for finding univaria te outliers. Helpful p lots are histograms, box p lots, norma l p robability plots, or detrended normal probability p lots. H istograms of variab les are readily un derstood and ava il· ab le and may reveal one or more univaria te ou tliers. There is usua lly a p ileup of cases near the mean with cases trailing away in either direction. An ou tlier is a case (or a very few cases) that seems to be unattached to the rest of the distr ibution. His tograms for continu ous variab les a re p rodu ced by IBM SPSS FREQUENCIES (preced ed by SORT and SPLIT for grouped data), and SAS UNIVARIATE o r CHART (w ith BY for grouped da ta). Box plots are simpler and li terally box in obser vations that are a roun d the med ian; cases that fall far away from the box are extreme. Normal p robability plots and detrended norma l p robability plots are very useful for assessing normality of distribu tions of variab les and are discussed in that con text in Section 4.1.5.1. However, univaria te outliers are v isible in these p lots as points that lie at a considerable distance from o thers and the tren d lin e. Once po tential univa riate ou tliers are located, the researcher decid es whether transfor· mations are accep tab le. Transforma tions (Section 4.1.6) are und ertaken both to improve the normality of d is tribu tions (Section 4.1.5.1) and to p ull univa riate outlie rs closer to the center of a d is tribu tion, thereby reducing their impact. Transformations, if acceptab le, a re undertaken prior to th e sea rch for multivariate ou tliers because the s ta tistics used to reveal them (Mahalanobis d is tance and its variants) are also sensitive to failures of normality. Mahalanobis distance is the dis tance of a case from th e centroid of the remaining cases where the cen troid is the point created at the in te rsection of the means of a ll the va riables. In most d ata sets, the cases form a swarm around the centroid in multiva riate space. Each case is represented in the swarm by a sing le point at its own pecu liar combination of scores on a ll of the variables, just as each case is rep resented by a poin t at its own X, Y combination in a bivari· a te scatterplot. A case that is a mu ltivaria te ou tlie r, h owever, lies ou tside the swa rm, some dis· tance from the other cases. Maha lanob is d is tance is on e measure of that multivariate distance and it can be evalu ated for each case using the J(l distribu tion. Mahalanobis distance is tempered by the patte rns of variances and covariances a mong the variables. It gives lower weigh t to variab les w ith large variances an d to groups of highly correlated variables. Under some con d itions, Mahalanob is d istance can e ither " mask" a real outlier (produce a fa lse nega tive) or "swamp" a n orma l case (produ ce a false positive). Thus, it is n ot a perfectl y reliable in d icator of mu ltiva riate outliers and sh ou ld be used with cau tion. Mahalanobis distances are requested and in terpreted in Sections 4.2.1.4 and 4.2.2.3 and in numerous o ther p laces throughou t the book. A very conservative p robability estimate for a case being an outlier, say, p < .001 for the J(l valu e, is appropria te w ith Mahalanob is distance. Other s ta tistical measures used to id en tify mu ltivaria te ou tliers are leverage, discrepancy, and influen ce. Although d evelop ed in the con text of multip le regression (Chap ter 5), the three measures are now available for some of th e other analyses. Leverage is rela ted to Mahalanobis distance (or varia tions of it in the " h at" matrix) and is variously called HATDIAG, RHAT, o r l1u. Althou gh leverage is related to Mahalanob is distance, it is measured on a different scale so that s ignificance tests based on a x2 distribution do not app ly. 4 Equ ation 4.3 sh ows the rela tionship between leverage, hu, an d Mahalanobis d is tance. Mahalanob isdistance = (N - 1)(11;; - 1/ N) 4
(4.3)
Lunneborg (1994) suggests that outliers be defined as cases with h" ;, 2{k/ N) where k is the number of variables.
Cleaning Up Your Act 65 Or, as is sometimes more useful,
'';;
Mahalanobis distance N - 1
1 N
------'------'-- + -
The latter form is handy if you want to find a critical value for leverage a t a = .001 by translating the critical ;t(2 va lue for Mahalanobis d istance. Cases with high leverage are far from the o thers, but they can be far out on basically the same line as the other cases, or far away and off the line. Discrepancy measures the extent to which a case is in line with the others. Figure 4.2(a) shows a case w ith high leverage and low discrepancy; Figure 4.2(b) shows a case with high leverage and high discrepancy. In Figure 4.2(c) is a case with low leverage and high discrepancy. In a ll of these figures, the ou tlier appears disconnected from the remaining scores. Influence is a p roduct of leverage and discrepancy (Fox, 1991). It assesses change in regression coefficients when a case is del eted; cases with influence scores la rger than 1.00 are suspected of being outlie rs. Measures of influence are variations of Cook's distance and are identified in output as Cook's distance, modified Cook's distance, DFFITS, and DBETAS. For the interested reader, Fox (1991, pp. 29-30) describes these te rms in more deta il. Leverage and/or Mahalanobis distance values are available as statistical methods of ou tlier detection in both SAS and IBM SPSS. However, research (e.g., Egan and Morgan, 1998; Hadi and Simonoff, 1993; Rousseeuw and van Zomeren, 1990) ind icates that these methods are not perfectly reliable. Unfortunately, alternative methods are computationally challenging and not readily available in statistical packages. Therefore, multivariate ou tliers are currently most easily detected through Mahalanobis distance, o r one of its cousins, but cautiously. Statistics assessing the distance for each case, in turn, from all other cases, are available through IBM SPSS REGRESSION by evoking Mahalanobis, Cook's, or Leverage values through the Save command in the Regression menu; these va lues are saved as separate columns in the da ta file and examined using s tanda rd descriptive p rocedu res. To use the regression program just to find outliers, however, you must specify some variable (such as the case number) as DV, to find outliers among the set of va riables of interest, cons idered IVs. Alternatively, the 10 cases w ith largest Mahalanobis distance are printed out by IBM SPSS REGRESSION using the RESIDUALS subcommand, as demonstrated in Section 4.2.1.4. SAS regression programs provide a leverage, hii, value for each case that con verts easily to Mahalanobis distance (Equation 4.3). These values are also saved to the data file and examined using s tandard s tatistical and graphical techniques. When multivariate outliers are sought in grouped data, they are sought within each group separa tely. IBM SPSS and SAS REGRESSION require separa te runs for each group, each with its own error term. Programs in other packages, such as SYSTAT DISCRIM, provide Mahalanobis distance fo r each case using a within-groups e rror term, so that outliers identified through those programs may be different from those identified b y IBM SPSS and SAS REGRESSION.
X
. ::
. ::
. :: X
X
(a) High leverage, low discrepancy, moderate influence
(b) High leverage, high discrepancy, high influence
(c) Low leverage, high discrepancy, moderate influence
Figure 4.2 The relationships among leverage, discrepancy, and influence.
66 Chapter4 IBM SPSS DISCRIMINANT prov ides ou tliers in the solu tion. These are not pa rticularly helpfu l for screening (you would not want to d elete cases just because the solution doesn't fit them very well), bu t a re useful to eva lua te gen era lizability of the resu lts. Frequen tly, some m ultivariate ou tliers hid e behind other multivariate outliers-ou tliers are known to mask other ou tliers (Rousseeu w and van Zomeren, 1990). When the first few cases iden tified as outliers are deleted, the data set becomes more consistent and then o ther cases become extreme. Rob ust approaches to this p roblem have been p roposed (e.g., Egan and Morgan, 1998; Hadi and Simonoff, 1993; Rousseeu w and van Zomeren, 1990), but these are not imp lemented in popular softwa re packages. These methods can be approximated by screening for multivariate outliers severa l times-each time d ealing with cases identified as outliers on the last run, un til finally no new outliers a re identified . But if the process of identifying ever mo re ou tliers seems to stretch into infinity, do a trial run w ith and withou t outliers to see if ones id entified late r are truly influencing resu lts. If not, d o not d elete or modify the la te r-iden tified ou tlie rs. 4.1.4.2 DESCRIBING OUTLIERS Once multivariate ou tlie rs are iden tified, you need to discover why the cases a re extreme. (You already know why univaria te ou tliers are extreme.) It is important to identify the variables on which the cases are devian t for three reasons. First, this p rocedure helps you decid e whether the case is properly part of your sample. Second, if you are going to modify scores instead of del ete cases, you have to know which scores to modify. Third, it prov ides an indication of the kinds of cases to which your results d o not generalize. If there are only a few multivariate ou tliers, it is reasonable to exam ine them individually. If there are severa l, you can examine them as a group to see if there are any variables that separate the group of ou tliers from the rest of the cases. Whether you are trying to descri be one or a group of outliers, the trick is to create a dummy grouping variab le where the ou tlier(s) has one value and the rest of the cases another va lue. The dummy va riable is then used as the grouping DV in discriminant analysis (Chapter 9) or logistic regression (Chapter 10), or as the DV in regression (Chap ter 5). The goal is to identify the variables that distinguish ou tliers from the other cases. Variables on which the ou tl ier(s) differs from the rest of the cases en ter the equation; the remaining variables do not. Once those variab les are iden tified, means on those variables for ou tlying and nonoutlying cases a re found through any of the routine descriptive p rograms. Description of outliers is illus trated in Sections 4.2.1.4 and 4.2.2.3. 4.1.4.3 REDUCI NG THE INFLUENCE OF O UTLIERS Once univaria te outliers have been id entified, there are several strategies fo r reducing their impact. But befo re you use one of them, check the data for the case to make su re that they are accurate ly entered in to the data file. If the data are accura te, consider the possibility that one variable is responsible fo r most of the outliers. If so, elimina tion of the variab le would reduce the numbe r of outliers. If the variab le is highly correlated with o thers or is not critical to the analysis, deletion of it is a good a lternative. If neither of these s imple alternatives is reasonable, you must d ecid e whether the cases that are ou tliers are properly part of the popu lation from which you in tended to sample. Cases with extreme scores, which a re, nonetheless, apparently not discrepant from the rest of the cases, are more likely to be a legitimate part of the sample. If the cases are not part of the popula tion, they are deleted with no loss of generalizability of results to your intended popula tion. If you deci de tha t the ou tliers are samp led from your ta rget population, they rema in in the analysis, but steps a re ta ken to redu ce their impact- variables are transformed or scores changed. A first op tion for reducing impact of univaria te outliers is variab le transformationund ertaken to change the shape of the distribution to more nearly norma l. In this case, outliers are considered part of a nonnorma l dis tribu tion with tails that are too heavy so that too many cases fall at extreme va lues of the distribu tion. Cases that were ou tlie rs in the untransformed distribution are s till on the tails of the transformed distribution, bu t their impact is reduced . Transformation of variables has other salutary effects, as described in Section 4.1.6.
Cleaning Up Your Act
A second option for univariate ou tliers is to change the score(s) on the variable(s) for the ou tlying case(s) so tha t they are deviant, bu t not as deviant as they were. For ins tance, assign the outlying case(s) a raw score on the offending variable that is one unit larger (or sma ller) than the next most extreme score in the distribution. Because measurement of variables is sometimes rather a rbitra ry anyway, this is often an attractive alternative to reduce the impact of a univariate outlie r. Transformation or score a lteration may not work for a tru ly multivariate outlier because the problem is with the combina tion of scores on two or more variables, not w ith the score on any one variable. The case is discrepant from the rest in its combinations of scores. Although the number of possible multivaria te outliers is often substantially reduced after transformation o r alteration of scores on variables, there are some times a few cases that are still far away from the others. These cases are usually deleted. If they are allowed to remain, it is with the knowled ge that they may distort the results in alm ost any direction. Any transformations, changes of scores, and deletions are reported in the Results section together w ith the rationale. 4.1.4.4 OUTLIERS IN A SOLUTION Some cases may not fit well within a solution; the scores p red icted for those cases b y the selected model are very different from the actua l scores for the cases. Such cases are identified after an analysis is completed, not as part of the screening p rocess. To iden tify and e limina te or change scores for such cases, before conducting the major analysis, make the analysis look better than it should. Therefore, conducting the major ana lysis and then "retrofitting" is a p rocedure best limited to explora to ry analysis. Chapters that describe techniques for ungrouped data d eal with outliers in the solution when d iscussing the limitations of the technique.
4.1.5 Normality, Linearity, and Homoscedasticity Underlying some multivaria te procedu res and most s tatistical tests of their ou tcomes is the ass ump tion of multivariate normality. Multivaria te normality is the assumption that each variable and all linear combinations of the variables are normally distributed . When the assumption is met, the residua ls 5 of analysis a re also normally distributed and ind ependent. The assumption of multivariate normality is not readily tested because it is impractical to test an infinite number of linear combinations of va riables for normality. Those tests that are available are overly sensitive. The assumption of multivaria te norma lity is made as pa rt of derivation of many significance tests. Although it is tempting to conclude that most inferential statistics are robust6 to violations of the assump tion, that conclus ion may not be warranted? For example, Schmider, Z iegler, Danay, Beyer, and Buner (2010), using a high quality random number genera tor for simulations, found the expected number of Type I e rro rs running 5000 ANOVAs w ith rectangula r and exponen tia l distributions w ith sample sizes of 25 per group. However, Bradley (1982) reported that statis tical inference becomes less and less rob ust as distributions depart from normality, rapid ly so under many conditions. And even when the statistics are used purely descrip tively, normality, linearity, and homoscedasticity of va riables enhance the analysis. The safest stra tegy, then, is to use transformations of variab les to improve their normality unless there is some compelling reason not to.
s Residuals are leftovers. They are the segments of scores not accounted for by the multivariate analysis. They are also called "errors" between predicted and obtained scores where the analysis provides the predicted scores. Note that the practice of using a dummy DV such as case number to investigate multivariate outliers will nol rroduce meaningful residuals plots. Robus t means that the researcher is led to correctly reject the null hypothesis at a given alpha level the right number of times even if the distributions do not meet the assum ptions of analysis. Often, Monte Carlo procedures are used where a distributio n with some known properties is put into a computer, sampled from repeatedly, and repeatedly analyzed; the researcher studies the rates of retention and rejection of the null hypothesis against the known properties of the dis tribution in the computer. 7 The univariate F test of mean differences, for example, is frequently said to be robust to violation of assum ptions of normality and homogeneity of variance with large and equal samp les.
67
68 Chapter4 The assumption of mu ltivaria te normality applies d ifferen tly to d ifferen t multivari-
ate s ta tistics. For an a lyses wh en cases are not grou ped , th e assu mption applies to the distributions of th e variab les themselves or to the resid uals of the analyses; for analyses wh en cases are group ed , th e assu mption ap p lies to th e sampling distr ibutions 8 of means of va riab les. If there is multivariate normality in ungrouped d ata, each variable is itself normally distributed and the rela tionships between pairs of variables, if p resen t, are linear an d homoscedastic (i.e., the variance of one variable is the same at a ll values of the other variable). The assump tion of multiva riate normality can be partially checked by examining the normality, lin earity, and homosced asticity of individua l variables and pairs of them or through examina tion of residuals in analyses involving p red iction 9 The assump tion is certainly v iolated, a t least to some extent, if the individua l variables (or the residuals) are not norma lly dis tribu ted o r d o n ot have pairwise linearity and homoscedasticity. For grouped d ata, it is the sampling distribu tions of the means of va riables tha t are to be normally distribu ted. The Central Limit Theorem reassu res us that, with sufficiently large sample sizes, sampling distr ib utions of means are normally distributed rega rdless of the distributions of variab les. For examp le, if there are at least 20 d egrees of freed om for error in a univa riate ANOVA, the F test is sa id to be robust to violations of normality of variab les (provid ed that there are no outliers). These issu es are d iscussed again in the third sections of Chap ters 5 through 17 as they ap ply directly to one o r another of the multivariate p rocedures. For nonparametric p rocedures su ch as multiway frequ ency analysis (Chapter 16) and logistic regression (Chap ter 10), there are n o distributiona l assumptions. Instead, distribu tions of scores typ ica lly are hyp othesized and observed distribu tions are tested agains t h yp othesized distributions. 4.1.5.1 NORMALITY Screening continuous variables for normality is an importan t early step in almost every multivariate analysis, particularly when inference is a goal. Although normality of the variab les is not a lways required for an alysis, the solution is usually quite a b it better if the variables are all n orma lly distributed. The solu tion is d egrad ed if the variables are not norma lly distributed, and particu larly if they are nonnormal in very differen t ways (e.g., some positively and some negatively skewed ). Normality of variab les is assessed by either statistica l or graphica l methods. Two componen ts of normality are skewness an d kurtosis. Skewness h as to d o with the symmetry of the distribution; a s kewed variab le is a variable whose mean is n ot in the center of the distribu tion. Kurtosis h as to d o w ith the p eakedn ess of a distribu tion; a dis tribu tion is e ither too peaked (with sh ort, thick tails) or too flat (with long, thin tails). 1 Figure 4.3 sh ows a norma l distribu tion, dis tribu tions with s kewness, and dis tribu tions w ith n onnormal kurtosis. A variab le can have significant skewness, kurtosis, o r both. When a distribu tion is normal, the valu es of s kewness and ku rtosis are zero. If th ere is positive s kewness, there is a p ileup of cases to the left an d the righ t tail is too long; with negative skewness, there is a pileup of cases to the right and the left ta il is too long. Kurtosis values above zero ind icate a dis tribu tion that is too pea ked with short, thick ta ils, an d ku rtosis values below zero indicate a distribu tion that is too flat (also with too many cases in the tails). 11 Nonnormal kurtosis p rodu ces an underestimate of the va riance of a va riable.
°
8
A samp ling distribution is a distribution of statis tics (not of raw scores) com puted from random samples of a given size taken repeatedly from a population. For example, in univariate ANOVA, hypotheses are tested with respect to the samp ling dis tribution of means (Chapter 3). 9 Analysis of residuals to screen fo r normality, linearity, and homoscedasticity in multiple regression is discussed in Section 5.3.2.4. 10 If you decide that outliers are sam pled from the intended population but that there are too many cases in the tails, ~ou are saying that the distribution from which the outliers are sampled has kurtosis that departs from normal. 1 The equation for kurtosis gives a value of 3 when the distribution is normal, but all of the statis tical packages subtract 3 before printing kurtosis so that the expected value is zero.
Cleaning Up Your Act
Positive skewness
Negative skewness
Positive kurtosis
Negative kurtosis
Figure 4.3 Normal dis trib ution, distributions with skewness, and d istributions with kurtoses.
There are significance tests for both skewness and kurtosis that test the ob tained va lue against null hypotheses of zero. For ins tance, the s tandard e rror for skewness is approximately
s, =
~
(4.4)
where N is the n umber of cases. The obtained skewness va lue is then compared with zero using the z distribution, where
s- o
z= - -
s,
(4.5)
and S is the va lue reported for skewness. The standard error for kurtosis is approximately (4.6)
and the obtained kurtosis value is compared w ith zero using the z distribu tion, where
K- 0
z=-Sk
(4.7)
and K is the value repo rted for kurtosis. Conventional bu t conservative (.01 o r .001) alpha levels are used to evaluate the significance of skewness and kurtosis w ith small to moderate samples, but if the sample is large, it is a good idea to look a t the shape of the d is tribu tion instead of using formal inference tests. Because the s tandard errors for both skewness and kurtosis decrease with larger N, the null hypothesis is likely to be rejected with large samples when there are only minor dev iations from normality.
69
70 Chapter 4 In a large sample, a variable with statistically s ignificant skewness often does not devia te enough from normality to make a substantive difference in the analysis. In other words, with large sa mples, the significance level of skewness is not as important as its actua l size (worse the farther from zero) and the visual appearance of the distribution. In a large sample, the impact of d eparture from zero ku rtosis also dinUrushes. For example, und erestima tes of variance associated with positive kurtosis (distrib utions with short, truck tails) disappea r with samples of 100 or m ore cases; with negative kurtosis, underestimation of variance disappears with samples of 200 or m ore (Watemaux, 1976). Values for skewness and ku rtosis are ava ilable in several p rogra ms. IBM SPSS FREQUENCIES, fo r instance, p rints as options s kewness, kurtosis, and their standa rd errors, and, in add ition, superimposes a normal d is tribution over a frequency histogra m for a variable if HISTOGRAM NORMAL is specified. DESCRIPTIVES and EXPLORE also p rin t s kewness and kurtosis sta tistics. A histogram or stem-and-leaf p lot is also available in SAS UNIVARIATE. 12 Frequency histogra ms are an important grap hical device for assessing normality, especially with the normal distribu tion as an overlay, but even more helpful than frequency histogra ms are expected norma l p robability plots and detrended expected normal p robability plots. In these p lots, the scores a re ranked and sorted; then an expected normal va lue is computed and compared with the actual normal value for each case. The expected normal va lue is the z score that a case with that rank holds in a norma l distribution; the normal va lue is the z score it has in the actua l d is tribu tion. If the actual distribution is norma l, then the poin ts for the cases fall along the diagonal running from lower left to upper right, with some minor d eviations due to random p rocesses. Deviations from normality shift the points away from the d iagonal. Cons ider the expected normal p robabi lity plots for ATTDRUG and TIMEDRS through IBM SPSS PPLOT in Figure 4.4. Syntax ind icates that the VARI ABLES we are interested in are attdr ug and timedrs. The remaining syntax is p roduced by d efault by the IBM SPSS Wind ows menu system. As reported in Section 4.2.1.1, ATTDRUG is reasonably norma lly distributed ( kurtosis= - 0.447, skewness= - 0.123) and TIMEDRS is too peaked and positively skewed (kurtosis = 13.101, s kewness = 3.248, both significantly different from 0). The cases for ATTDRUG line up along the diagonal, w hereas those for TIMEDRS do not. At low values of TIMEDRS, there are too many cases above the d iagonal, and at high values, there are too many cases below the diagonal, reflecting the pa tte rns of skewness and kurtosis. Detrend ed normal p robability plots for TIMEDRS and ATTDRUG are also in Figu re 4.4. These plots a re sinU lar to expected norma l p robability plots except that deviations from the d iagona l are p lotted instead of va lues a long the diagonal. In other words, the linear trend from lower left to up pe r right is removed. If the distribution of a variable is norma l, as is ATTDRUG, the cases distribu te themselves evenly above and below the horizontal line tha t intersects the Y-axis at 0.0, the line of zero deviation from expected normal values. The skewness and kurtosis of TIMEDRS a re again apparen t from the cluster of points above the line at low values of TIMEDRS and below the line at high values of TIMEDRS. Normal probability p lots for variab les are also ava ilable in SAS UNIVARIATE and IBM SPSS MANOVA. Many of these p rogra ms also p roduce detrended normal plots. If you are going to perform an analysis w ith ungrouped data, an a lternative to screening the variables p rior to analysis is conducting the analysis and then screening the residuals (the differences between the p red icted and obtained DV values). If normality is p resent, the residuals are normally and independently distribu ted. That is, the d ifferences between p red icted and obtained scores-the e rrors-are symmetrically distributed around a mean value of zero and there are no contingencies a mong the errors. In multiple regression, residuals are also screened for normality through the expected norma l p robability plot and the detrended norma l
=
12
In structural equation modeling (Chapter 14), skewness and kurtosis for each variable are available in EQS and Mardia's coefficient (the multivariate kurtosis measure) is available in EQS, PRELJS, and CALIS. In addition, PRE US can be used to deal with nonnormality through alternative correlation coefficients, such as polyserial or polychoric (cf. Section 14.5.6).
Cleaning Up Your Act
PPLOT /VARIABLES = attdrug t i medrs /NO LOG /NOSTANDARDIZE /TYPE = P-P /FRACTION = BLOM /TIES MEAN /DIST = NORMAL.
1.0
Normal P-P plot of attitudes toward medication
Detrended normal P-P plot of attitudes toward medication
.M ,-------------------------, .03
.0
0.8
E
0.6
ea. :::>
0
0
-'= 0.00 c ~ -.01
0.4
8. X
w
0
.01
!5
0
""2
0
.02
1--- - - - - - - - - - - - - --;0;1
~ -.02
0.2
0
-.03 0 .0 ...__ ...__ _.__ _.__ _.__ _, 0.0 0.2 0.4 0.6 0.8 1.0 Observed cum prob
0
- .M L - - - ' - - - - ' - - - - - ' ' - - - - - ' L - - _ J 1.0 0.0 0.2 0.4 0.6 0.8 Observed cum prob
Normal P-P plot of visits to health professionals
Detrended normal P-P p lot of visits to health professionals 0.3
.0
ea.
0.2
E
0.1
~
0.0 1-----0----------1~ ... - --i
0
g
E
:::>
u 8. X
0
0
_g
0
~
0
«i E
0
o
"
\
·~
w
a
-0.1
0 0
0.2 0.4 0.6 0.8 Observed cum prob
1.0
-0.2 ~~--J-----L----'-----L-----L----J 1.2 .4 0.0 .2 .6 .8 1.0 Observed cum prob
Figure 4.4 Expected normal probab ility pl ot and d etrend ed normal probabi lity plot for ATIDRUG and TIM EDRS. IBM SPSS PPLOT syntax and output. p robab ility p la tY IBM SPSS REGRESSION p rovides this diagnostic technique (and others, as discussed in Ch apter 5). If the residuals are normally distributed, the expected normal p robability p lot and the detrended normal p robability p lot look jus t the same as they do if a va riable is normally distributed. In regression, if the residuals p lot looks normal, there is no reason to screen the individua l variables for normality. Although residuals will reveal departures from normality, the analyst has to resist temptation to look at the rest of the output to avoid "tinkering" with variables and cases to p rod uce an anticipated result. Because screening the variables should lead to the same conclusions as screening residuals, it may be more objective to make one's decisions about transformations, deletion of outliers, and the like, on the basis of screening runs alone rather than screening through the outcome of analysis.14
13
For grouped data, residuals have the same shape as within-group distributions because the predicted value is the mean, and subtracting a constant does not change the shape of the distribution. Many of the programs for ~rouped data plot the within-group distribution as an option, as discussed in the next few chapters when relevant. 4 We realize that others (e.g., Berry, 1993; Fox, 1991) ha\'e very different views about the w isdom of screening from residuals.
71
72 Chapter 4 With ungrouped data, if nonnormality is found, transformation of variables is considered . Common transformations are described in Section 4.1.6. Unless there are compelling reasons not to transform, it is p robably better to do so. However, realize tha t even if each of the variables is normally distribu ted, o r transformed to normal, there is no guarantee that all linear combinations of the variables are normally d istributed. Tha t is, if variables are each univariate normal, they d o not necessarily have a multivariate normal distribution. However, it is more like ly that the assump tion of multivaria te norma lity is met if all the variables are normally distributed . With grouped d ata, normality is evaluated separa tely for each group. Transformation is considered If there are few error df and nonnorma lity is revealed. The same transformation is applied to all groups even though tha t transformation may not be the optimal one for every group. 4.1.5.2 LI NEARITY The assump tion of linearity is that there is a stra ight-line relationship between two variables (where one o r both of the variables can be combinations of severa l variab les). Linea rity is importan t in a practical sense because Pearson's r only captures the linear relationships a mong variables; if there a re subs tantial nonlinea r relationships among variables, they are ignored . Nonlinearity is diagnosed either from residuals p lots in analyses involving a pred icted variable or from bivariate scatterplots between pairs of variables. In plots where standardized residuals are plotted agains t p red icted va lues, nonlinearity is indica ted when most of the res iduals are above the zero line on the plot a t some p redicted values and below the zero line at o ther p red icted values (see Chapter 5). Linearity between two variables is assessed roughly by inspection of biva riate sca tterplots. If both variables a re normally distribu ted and linearly rela ted, the scatte rplot is oval-shaped . If one of the variables is nonnormal, then the scatterplot between this variable and the other is not oval. Examination of bivariate scatterplots is demons trated in Section 4.2.1.2, a long with transformation of a variable to enhance linearity. However, sometimes the relationship between variables is simp ly not linea r. Consid er, for instance, the number of symptoms and the d osage of drug, as shown in Figu re 4.5(a). It seems likely tha t there are lots of symptoms when the d osage is low, only a few symp toms when the d osage is moderate, and lots of symp toms again when the d osage is high. Number of symptoms and drug dosage are curvilinearly related. One alternative in this case is to use the square of d osage to represent the cu rvilinear rela tionship ins tead of dosage in the analysis. Another a lternative is to recode d osage into two dummy va riab les (high vs. low on one d ummy variable and a combination of high and low vs. med ium on another dummy variable) and then use the dummy variables in place of d osage in analysis. 15 The d ichotomous d ummy variables can only have a linear relationship with other variables, if, indeed, there is any relationship a t a ll afte r recod ing. Often, two variables have a mix of linear and curvilinea r rela tionships, as shown in Figu re 4.5(b). One variab le genera lly gets sma ller (or larger) as the other gets larger (or smalle r) but there is also a curve to the relationship. For instance, symptoms might drop off with increasing d osage, but only to a point; increasing dosage beyond the point does not result in further reduction o r increase in symptoms. In th is case, the linear component may be strong enough tha t not much is lost by ignoring the cu rvilinea r component unless it has impo rtant theoretica l implications. Assessing linearity through bivaria te scatterplots is reminiscen t of reading tea leaves, especially w ith small samples. And there a re many cups of tea if there are several variables and all possible pairs are examined, especially when cases a re grouped and the analysis is done sepa ra te ly within each group. If there are only a few variables, screening all possible pairs is not burdensome; if there are numerous variables, you may want to use s tatistics on s kewness to screen only pairs that are likely to depart from linearity. Think, also, abou t pa irs of variables that migh t have true nonlinearity and examine them through bivaria te scatterplots. Bivaria te sca tterplots are produced by IBM SPSS GRAPH, and SAS PLOT, among other p rograms. IS A nonlinear analytic strategy is mos t appropriate here, such as non linear regression through SAS NUN, but such strategies are beyond the scope of this book.
Cleaning Up Your Act
: .· .;
.... .. ........... .::·....··.
.. .. ·.... .. :....: . ..·.··.. ··.:. .·.:..... ...· Moderate
Low
High
DOSAGE (a) Curvilinear
(/)
!5 a
~ 0 (/)
Q;
.. ..... .... ..::....·. ....
...... .. ·..... ... . . . .... .... . . . ..... . . .
.0
E
:::J
z
Low
Moderate
High
DOSAGE (b) Curvilinear+ linear
Figure 4.5
Curvilinear relationship and curvilinear plus linear relationship .
4.1.5.3 HOMOSCEDASTICITY, HOMOGENEITY OF VARIANCE, AND HOMOGENEITY OF VARIANCE-COVARIANCE MATRICES For ungrouped da ta, the assump tion of homoscedasticity is tha t the variability in scores for one continu ous variable is roughly the same at all valu es of another continu ous variable. For grouped data, this is the same as the assumption of homogeneity of variance when one of the variables is discrete (the grouping variable), the other is con tinuous (the DV); the variability in the DV is expected to be about the same at a ll levels of the group ing variable. Homoscedasticity is related to the assumption of normality because when the assump tion of multivariate normality is met, the relationships between variables are h omoscedastic. The b ivariate sca tte rplots between two variab les a re of roughly the same width all over with some bulging toward the midd le. Homoscedasticity for a b ivaria te plot is illustrated in Figure 4.6(a). Heteroscedasticity, the failure of h omoscedasticity, is caused either by n onnormality of on e of the variables or by the fact that one va riable is rela ted to some transformation of the o ther. Consider, for example, the relationship between age (X 1) and income (X2) as depicted in Figure 4.6(b). People start out making abou t the same sala ries, but w ith increasing age, people spread farther apart on income. The relationship is pe rfectly lawful, bu t it is not homoscedastic. ln this example, income is likely to be positively s kewed and transformation of income is likely to improve the homoscedasticity of its rela tionship with age. Another source of heteroscedasticity is a greater e rror of measurement at some levels of an IV. For example, people in the age range 25 to 45 might be more concerned about their weight than people who are younger o r older. Older and younger people wou ld, as a result, give less reliable estima tes of their weight, increasing the variance of weight scores a t those ages.
73
74
Chapter4
..... ..........···....·. . ..::·.. : .: .. :·..·.· .. .. ·. ·.. ....... :·: .. ·.
... ·.... ...·.....:..·.... . ..... .. . . ::: ... ·.·.·.. ···:: .. ::· ·. ·. : ·. .. : . .. ··:.: ·..
(a) Homoscedasticity with both variables normally distributed
(b) Heteroscedasticity with skewness on x2
Figure 4.6 Bivariate scatterplots under conditions of homoscedasticity and heteroscedasticity. It should be noted that heteroscedasticity is not fatal to an analysis of ungrouped data. The linear relationship between variables is captured by the analysis, but there is even more p redictability if the heteroscedasticity is accounted for. If it is not, the ana lysis is weakened, but not invalidated. When data are grouped, homoscedasticity is known as homogeneity of va riance. A great deal of research has assessed the robustness (or lack thereof) of ANOVA and ANOVA-like analyses to violation of homogeneity of variance. Recent guidelines have become more stringent than earlier, more cavalier ones. There are formal tests of homogeneity of variance but most are too strict because they also assess normality. (An exception is Levene's test of homogeneity of va riance, which is not typica lly sensitive to departures from normality.) Instead, once outliers are eliminated, homogeneity of variance is assessed w ith Fmax in conjunction with sample-size ratios. Fmax is the ratio of the largest cell variance to the smallest. If sample sizes a re relatively equal (within a ratio of 4 to 1 or less for largest to smallest cell size), an Fmax as grea t as 10 is acceptable. As the cell size d iscrepancy increases (say, goes to 9 to 1 ins tead of 4 to 1), an Fmax as small as 3 is associated with inflated Type I error if the larger variance is associated with the smaller cell size. However, heterogeneity of variance also poses power p roblem. Although Type I e rror is controlled when the larger group has the larger variance, Type II error is inflated, that is, power can be quite diminished. Power is even diminished when the la rger group has the smaller variance; in that case both Type I and Type II errors a re inflated. In general, power is lower when Type I error is most conservative and higher when Type I error is inflated (Milligan, Wong, and Thompson, 1987). Violations of homogeneity us ually can be corrected by transformation of the DV scores. Interpretation, however, is then limited to the transformed scores. Another option is to use untransformed va riables with a more s tringent a level (for nominal a, use .025 with modera te viola tion and .01 w ith severe violation). The multivariate analog of homogeneity of variance is homogeneity of variancecovariance matrices. As fo r univariate homogeneity of variance, infla ted Type I error rate occurs when the greatest dispersion is associated with the smallest sample size. The forma l test used by IBM SPSS, Box's M , is too strict with the large sample sizes usually necessary for mu ltivaria te applications of ANOVA. Section 9.7.1.5 demonstra tes an assessment of homogeneity of variance-covariance matrices through SAS DISCRIM using Bartlett's test. SAS DISCRIM permi ts a stringent a level for determining heterogeneity, and bases the d iscriminant analysis on separate variance-covariance matrices when the assumption of homogeneity is violated.
Cleaning Up Your Act
4.1.6 Common Data Transformations Although data transformations are recommended as a remedy for ou tliers and for failu res of normality, linearity, and hom osced asticity, they are not universa lly recommend ed. The reason is that an analysis is in terpreted from the va riables that a re in it, and transformed va riables are some tim es harder to interpret. For instance, a lthough IQ scores are widely understood and meaningfully inte rp re ted, the logarithm of IQ scores may be harder to explain . Whether transformation increases difficulty of interp retation often depends on the scale in which the variable is measured. If the scale is meaningfu l or wid ely used, transformation often hinders in terp retation, but if the scale is somewhat arbitrary anyway (as is often the case), transformation does not notably increase the d ifficulty of interpretation. With ungrouped data, it is p robably best to transform variables to normality unless interp retation is not feasible with the transformed scores. With grouped d ata, the assumption of normality is evaluated with respect to the sampling d istribution of means (not the d is tribution of scores) and the Central Limit Theorem p red icts normality with decently sized samp les. However, transforma tions may improve the ana lysis and may have the fu rther advantage of reducing the impact of ou tlie rs. Our recommend ation, then, is to cons ider transformation of variables in a ll s ituations unless there is some reason not to. If you decide to transform, it is importan t to check that the variable is normally or nearnormally d is tribu ted after transformation. Often you need to try first one transformation and then another until you find the transformation that produces the s kewness and kurtosis values nearest zero, the p rettiest picture, and /or the fewest outliers. With almost every da ta set in which we have used transformations, the results of analysis have been substantially imp roved . This is pa rticu larly true when some variables are s kewed and others are not, or variables are s kewed very differently prior to transformation. However, if a ll the variables are s kewed to about the same mod erate extent, improvemen ts of analysis w ith transformation are often marginal. With grouped d ata, the test of mean d ifferences after transformation is a test of differences between medians in the o riginal d ata. After a d istribution is normalized b y transformation, the mean is equal to the median. The transformation affects the mean bu t not the med ian because the median d epends only on rank o rder of cases. Therefore, conclusions about means of transformed dis tribu tions apply to medians of untransformed d is tribu tions. Transformation is underta ken because the dis tribu tion is skewed and the mean is not a good ind ica to r of the cen tral tend ency of the scores in the distribution. For skewed distributions, the med ian is often a more appropriate measure of central tendency than the mean, anyway, so interpreta tion of differences in medians is appropriate. Variables differ in the extent to which they diverge from normal. Figure 4.7 presents several d istributions together w ith the transformations that are likely to rend er them normal. If the distribution differs mod erately from normal, a square root transformation is tried first. If the d istribution d iffers substantially, a log transformation is tried. If the d istribution differs severely, the inverse is tried. According to Bradley (1982), the inverse is the best of several alternatives for }-shaped distributions, but even it may not render the d istribution normal. Finally, if the departure from normality is severe and no transformation seems to help, you may want to try d ichotomizing the variable a t the median or trichotomizing it in a way that the middle category has abou t half the cases and the remaining categories each have abou t 25% of the cases. The direction of the deviation is also considered. When distributions have positive skewness, as discussed earlier, the long tail is to the right. When they have negative skewness, the long tail is to the left. If there is negative skewness, the best s trategy is to reflect the variable and then apply the app rop riate transformation for positive skewness. 16 To reflect a variable, find the largest score in the d is tribution. Then create a new variable by s ubtracting each score from the largest score plus 1. In this way, a variable with negative skewness is converted to one w ith positive s kewness 16
Remem ber, however, that the interpretation of a reflected variable is just the opposite of what it was; if big numbers meant good things prior to reflecting the variable, big numbers mean bad things after ward .
75
76
Chapter4
TRANSFORMATION
Square root
Reflect and square root
Reflect and logarithm
Reflect and inverse
Figure 4.7 Original distributions and common transformations to produce normality. p rior to transformation. When you interpret a reflected variable, be s ure to reverse the direction of the interpretation as well (or consider re-reflecting it after transformation). Remember to check your transformations after applying them. If a va riable is only modera tely positively skewed, for instance, a square root transformation may make the va riable moderately negatively skewed, and there is no ad vantage to transformation. Often you have to try several transformations before you find the most helpful one. Syntax for transforming variab les in IBM SPSS and SAS is in Table 4.3. 17 Notice that a constan t is also add ed if the d is tribution con tains a va lue less than one. A cons tant (to bring the smallest value to at least one) is add ed to each score to avoid taking the log, squa re root, or inverse of zero. Different softwa re packages handle missing da ta differently in various transformations. Be sure to check the manual to ensure tha t the program is treating missing da ta the way you want it to in the transformation. It shou ld be clearly understood that this section merely scratches the surface of the topic of transforma tions, abou t which a great dea l more is known. The interested reader is referred to classic treatises s uch as Box and Cox (1964), Mostelle r and Tukey (1977) or Yeo and Johnson (2000) for a more flexible and challenging approach to the p roblem of transformation.
4.1.7 Multicollinearity and Singularity Multicollinearity and singularity are problems w ith a correlation matrix tha t occur when variab les are too highly correlated. With multicollinearity, the variables are very highly correlated (say, .90 and above); with singularity, the va riables a re redund ant; one of the variables is a combination of two or more of the other variab les. 17 Logarithmic (LO) and power (PO) transformations are also available in PRELIS fo r variables used in structural equation modeling (Chapter 14). A A (GA) value is specified for power transformations; for example, A = 1/ 2 provides a square root transform (PO GA =.5) .
C leaning U p
Table 4.3
Syntax for Common Data Transformations IBM SPSS COMPUTE
SAS DATA Procedure
NEWX=SQRT(X)
NEWX• SQRT( X)
NEWX=LGlO(X) NEWX=LGlO(X+C)
NEWX• LOGlO (X) NEWX• LOGlO(X+C)
Moderate positive
skewness Substantial positive
skewness With zero Severe positive
skewness
NEWX=l/X
L-shaped With zero
NEWX=l/(X+C)
NEWX• l/( X+C)
NEWX=SQRT(K-X)
NEWX• SQRT(K·X)
NEWX=LG10(K-X)
NEWX• LOGlO(K·X)
NEWX=l/(K-X)
NEWX• l/(K-X)
Moderate negative
skewness Substantial negative
skewness Severe negative
skewness J-shaped C K
= a constant added to each scete so that the smalest score is 1. = a constant from wtaich each soote is subtracted so that the smaDest soore is 1 ; usualy equal to the largest score + 1.
For example, scores on the Wechs ler Adu lt Intelligence Scale (the WAIS) and scores on the Stanford - Binet Intelligence Scale a re likely to be multicollinear because they are two s imilar measures of the same thing. But the total WAIS IQ score is singular with its subscales because the to ta l score is found by comb ining s ubsca le scores. When variables a re multicoll inear o r sin gular, they con tain redun dant information and th ey are not all needed in the same analysis. In o ther words, there are fewer variables than it appears and the correlation matrix is not full rank because there are not really as many va riables as columns. Either b ivariate or mu ltiva riate correla tions can crea te multicollinearity or singularity. If a biva riate correla tion is too high, it sh ows up in a correlation matrix as a correla tion above .90, an d, after d eletion of one of the two redundan t variables, the p roblem is solved. If it is a mu ltivaria te correla tion that is too high, d iagnosis is s ligh tly more difficu lt because mu ltivariate statistics are need ed to find the offen ding variable. Fo r example, although the WAIS IQ is a combination of its subscales, the biva riate correla tions between total IQ and each of the subsca le scores a re n ot all that high. You would not know tha t there was singu larity by examination of th e correlation matrix. Mu lticollinearity and singularity cause both logica l an d statis tical p roblems. The logical p rob lem is that unless you are doing ana lysis of s tructure (factor ana lysis, p rincip al components analysis, and structura l-equation mod eling), it is n ot a good idea to inclu de redund ant variables in the same analysis. They are not n eeded, and because they infla te the size of error terms, they actually weaken an analysis. Unless you are doing analysis of s tructure or are dealing with repeated measures of the same variable (as in various forms of ANOVA inclu ding p rofile ana lysis), think carefully before including two variables with a bivaria te correla tion of, say, .70 o r more in the same analysis. You might omit one of the variab les or you might create a composite score from th e redundant variab les. The statistica l problems created by singula rity and multicollinea rity occur at much higher correlations (.90 an d high er). The p rob lem is that singula rity p rohib its, and mu lticollinearity renders uns table, matrix in version. Matrix invers ion is the logical equivalent of d ivision; calculations requiring division (and there are many of them- see the fourth sections of Chapters 5 throu gh 17) cannot be pe rformed on singu la r matrices because they p roduce determinants equ al to zero that cannot be used as d ivisors (see App endix A). Multicollinea rity often occu rs wh en you form cross p rodu cts o r powers of va riables and include them in the
Your Act
77
78 Chapter 4 analysis along with the original variables, unless steps are taken to reduce the mu lticollinearity (Section 5.6.6). With multicollinearity, the determinant is not exactly zero, bu t it is zero to several decima l p laces. Division by a near-zero determinant p roduces very large and uns tab le numbers in the inverted matrix. The sizes of numbers in the in verted matrix fluctua te wildly w ith only minor changes (say, in the second or third d ecimal place) in the sizes of the correlations in R. The portions of the mu ltivaria te solu tion that flow from an inverted matrix that is uns table a re also unstable. In regression, for ins tance, error terms get so large that none of the coefficients is s ignificant (Berry, 1993). For example, when r is .9, the p recision of estima tion of regression coefficients is ha lved (Fox, 1991). Most programs protect against mu lticollinearity and singularity by computing SMCs for the variables. SMC is the squ ared mu ltiple correlation of a va riable where it serves as DV with the rest as lVs in multiple correla tion (see Chapter 5). 1f the SMC is high, the variable is highly related to the o thers in the set and you have mu lticollinearity. If the SMC is 1, the variab le is perfectly related to others in the set and you have singu larity. Many p rograms con ver t the SMC values for each variable to tolerance {1 - SMC) and deal with tolerance ins tead of SMC. Screening for singularity often takes the form of running your main analysis to see if the compu ter balks. Singularity aborts most runs except those for p rincipal compon ents an alysis (see Chapte r 13), where matrix in version is not requ ired. If the run aborts, you need to identify and d elete the offending variab le. A first step is to think abou t the va riables. " Did you create any of the va riables from other variables; for ins tance, did you create one of them by adding two oth ers?" If so, deletion of one removes singularity. Screening for multicollin ea rity tha t causes statistical ins tability is also a rou tine with most p rograms because they have tole rance criteria for in clusion of va riab les. If the to lerance {1 - SMC ) is too low, the variable does not en ter the ana lysis. Defau lt toleran ce levels range between .01 an d .0001, so SMCs are .99 to .9999 before variab les a re exclu ded . You may wish to take contro l of this p rocess, h owever, by adjus ting the tolerance level (an option with many p rograms) o r decid ing you rself which variable(s) to d elete instead of letting the p rogram ma ke the d ecision on p u rely s tatis tica l groun ds. Fo r this you need SMCs for each variable. Note that SMCs are not evaluated separately for each group if you are analyzing grouped da ta. SMCs are availab le through factor analysis and regression p rograms in all packages. PREUS p rovides SMCs for structura l equ ation modeling. SAS and IBM SPSS h ave incorporated co/linearity diagnostics p roposed by Belsley, Kuh, and Welsch (1980) in which a conditioning index is p rodu ced , as well as variance p roportions associa ted w ith each variab le, after s tandardiza tion, for each root (see Chapters 12 and 13 and Appen dix A for a d iscussion of roots and dimensions). Two or more variables with la rge va riance p roportions on the same d imens ion are those with p rob lems. Con dition index is a measure of tightness or depend ency of on e variab le on the others. The cond ition ind ex is mon otonic with SMC, bu t not linear w ith it. A high condition index is associated with variance inflation in the standard e rror of the parameter estimate for a va riable. When its standard error becomes very large, the parameter estimate is highly uncertain. Each root (dimension) accounts for some p roportion of the variance of each para meter estimated . A collinearity p rob lem occu rs wh en a root w ith a high condition in dex con tributes strongly (has a high variance p rop ortion) to the variance of two or more variab les. Criteria for multicollinearity s uggested by Belsely et al. (1980) are a conditioning in dex greater than 30 for a given dimens ion coup led with va riance propo rtions greater than .50 for at least two differen t variab les. Collinea rity diagnostics are demons trated in Section 4.2.1.6. There are several op tions for d ea ling with collinearity if it is d etected. First, if the only goal of analysis is p red iction, you can ignore it. A second option is to d elete the variable with the highest va riance p roportion. A third option is to sum or average the collinear variables. A fou rth option is to compu te p rincipa l components and use the components as the p redictors instead of the o riginal va riables (see Chapter 13). A final alternative is to center one o r more of the va riables, as discussed in Chapters 5 and 15, if multicollinearity is caused by forming in teractions o r powers of continu ous variables.
Cleaning Up Your Act
Table 4.4
Checklist for Screening Data
1. Inspect univariate descriptive statistics for accuracy o f inp ut a . Out-of-range values b . Plausible means and standard deviations c . Univariate o utliers 2. Evaluate amount and d istribution of missing data; d eal with pro blem 3. Check pairwise plots for nonlinearity and heterosoedasticity 4. Identify and deal with nonnormal variables and univariate outliers a . Check skewness and kurtosis, probability plots b . Transform variables (if desirable) c . Check resuns o f transformation 5. Identify and deal with multivariate outliers a . Variables causing muttivariate outliers b . Description o f munivariate outliers 6. Evaluate variables for mutticollinearity and singularity
4.1.8 A Checklist and Some Practical Recommendations Table 4.4 is a checklis t for screening da ta. It is important to consider a ll the issues p rior to the fundamental analysis lest you be temp ted to make some of your decisions based on their influence on the ana lysis. If you choose to screen through residuals, you cannot avoid d oing an analysis at the sam e time; however, in these cases, you concentrate on the residuals and not on the o ther featu res of the analysis while ma king your screening decisions. The order in which screening takes place is important because the decisions tha t you make at one step influence the outcomes of later steps. In a s itua tion where you have both nonnormal variables and potentia l univaria te ou tliers, a fundamental decision is whether you would p refer to transform variables, delete cases, o r change scores on cases. If you transform variables first, you a re likely to find fewer ou tlie rs. If you dele te or modify the ou tlie rs firs t, you are likely to find fewer variables with nonnorma lity. Of the two choices, transformation of variables is usua lly p referable. It typically reduces the number of outliers. It is likely to p roduce normality, linearity, and homoscedasticity am ong the variables. It increases the likelihood of multivariate normality to bring the d ata into conformity with one of the fund amental assump tions of most inferential tests. And on a very p ractical level, it usually enhances the analysis even if inference is not a goal. On the other hand, transformation may threaten interpretation, in which case all the statis tical niceties are of little avail. O r, if the impact of ou tliers is reduced first, you a re less likely to find variab les tha t a re skewed because significan t s kewness is sometimes caused by extreme cases on the ta ils of the d is tribu tions. If you have cases tha t are univariate outliers because they a re not pa rt of the population from which you intend ed to samp le, by all means d elete them befo re checking d istribu tions. Last, as will become obvious in the rest of this chapter, although the issues are d ifferent, the runs on which they are screened are not necessarily d ifferent. That is, the same run often p rovid es you with information regarding two or more issues.
4.2 Complete Examples of Data Screening Eva lua tion of assumptions is somewhat different for ungrouped and grouped da ta. That is, if you are going to perform mu ltip le regression, canonica l correlation, factor ana lysis, or structural equation mod eling on ungrouped d ata, there is one approach to screening. If you are going to perform univariate or multivariate ana lysis of variance (including p rofile analysis),
79
80 Chapter4
discriminant analysis, or multilevel modeling on grouped data, there is another approach to screening. 18 Therefore, two complete examples are presented that use the same set of variables taken from the research described in Appendix B.l: number of visits to hea lth professionals (TIMEDRS), attitudes toward drug use (ATTDRUG),attitudes toward housework (AlTHOUSE), INCOME, marital status (MSTATUS), and RACE. The grouping variable used in the analysis of grouped data is current employment status (EMPLMNT). 19 Da ta are in files labeled SCREEN.• (e.g., SCREEN.sav for IBM SPSS or SCREEN.sas7bdat for SAS). Where possible in these examples, and for illustrative purposes, screening for ungrouped data is performed using IBM SPSS, and screening for grouped data is performed using SAS programs.
4.2.1 Screening Ungrouped Data A flow diagram for screening ungrouped data appears as Figure 4.8. The d irection of flow assumes that data transformation is undertaken, as necessary. If transformation is not acceptable, then other p rocedures for handling outliers are used.
Searching for:
Programs:
Plausible Range Missing Values Normality Univariate Outliers
IBM SPSS FREQUENCIES DESCRIPTIVES ,.----+1 CHECK DISTRIBUTIONS EXPLORE SASMEANS UNIVARIATE
Pairwise Linearity
IBM SPSS PLOT SAS PLOT
Or Check Residuals
IBM SPSS REGRESSION SAS REG
Flow Diagram
or~
oj.k.
TRANSFORM Deal with Missing Data
IBMSPSS MVA SASMI
Multivariable Outliers
IBM SPSS REGRESSION SAS REG
(Describe Outliers)
IBM SPSS DISCRIMINANT LIST VARIABLES SASMEANS DATA
Onward
Figure 4.8 18
Flow diagram for screening ungrouped data.
If you are using multi way frequency analysis or logistic regression, there are far fewer assumptions than with these other analyses. 19 This is a motley collection of variables chosen primarily for their statis tical properties.
Cleaning Up Your Act 81 4.2.1.1 ACCURACY OF INPUT, MISSING DATA, DISTRIBUTI ONS, AND UNIVARIATE OUTLIERS A check on accuracy of data entry, missing data, skewness, and kurtosis for the data set is done through IBM SPSS FREQUENCIES, as shown in Table 4.5. The minimum and maximum values, means, and standard deviations of each of the variables are inspected for plausibility. For instance, the Minimum number of visits to hea lth p rofess ionals (TIMEDRS) is 0 and the Maximum is 81, higher than expected but found to be accurate on checking the data sheets.20 The Mean for the variable is 7.901, higher than the na tional average but not extremely so, and the standard deviation (Std. Deviation) is 10.948. These values are all reasonable, as are the values on the o ther variables. For instance, the ATTDRUG variable is cons tructed with a range of 5 to 10, so it is reassuring to find these va lues as Minimum and Maximum. TIMEDRS shows no Missing cases but has s trong positive Skewness (3.248). The significance of Skewness is evaluated by dividing it by Std. Error of Skewness, as in Equation 4.5,
z
= 3.248 = 28.74
.113
to reveal a clear departure from symmetry. The distribution also has significant Kurtosis as evalua ted by Equation 4.7:
z
= 13.101 = 57 97
.22
.
The departures from normality are also obvious from inspection of the difference between frequencies expected under the normal distribution (the superimposed curve) and obtained frequencies. Because this variable is a candidate for transforma tion, evaluation of univaria te outliers is deferred. ATTDRUG, on the other hand, is well behaved. There are no Missing cases, and Skewness and Kurtosis are well within expected values. ATTHOUSE has a single missing value but is otherwise well distributed except for the two extremely low scores. The score of 2 is 4.8 standard deviations below the mean of ATTHOUSE (well beyond the p = .001 criterion of 3.29, two-tailed) and is
Table 4.5 Syntax and IBM SPSS FREQUENCIES Output Showing Descriptive Statistics and Histograms for Ungrouped Data FREQUENCIES VARIABLES=timedrs attdrug atthouse income mstatus race
/FORMAT=NOTABLE /STATISTICS=STDDEV VARIANCE MINIMUM MAXIMUM MEAN SKEWNESS SESKEW KURTOS I S SEKURT /HISTOGRAM NORMAL /ORDER=ANALYSIS .
· Fre quencies Statistics ViSitS 10 nealth professionals N
Valid _ _ _ _ _4_6_5_
Altitudes toward medJcat1on
Attitudes toward hOusework
465
~ncome
464
Whether turrinlly married
race
439
465
465
0
0
I
26
0
0
7.90 10.948
7.69
23.54
4.21
1.78
1.09
1.156
4.484
2.419
.416
.284
II 9.870
1.337
20.102
5.851
.173
.081
_ _ _3:..c.2:..48 _ _ _...:. · .123
· .457
.582
·1 .346
2.914
.117 _ _ _.II .113 .113 _ _.;,;.,;.,;_ ss:..__ ___;·c.. ' ';.:3'-------'---'---.....:.;.;...:. :..:..:,3 Std. Error or Skewn~e.:. Kurtosis 1.556 ·.359 ·.447 ·.190
6.521
Missing Mean Std Deviation Va riance Skewness
Sid Error of Kurtosis Minimum Maximum
81
.226
.226
5
2
10
35
.233 10
.226
.113 .226
2 (Continued)
20
The woman wi th this number of visits was terminaiJy ill when she was interviewed.
82 Chapter4
Table 4.5 (Continued) Histogram Vls~
to hea~h professionals
Attitudes tow1rd medication
.....
..-.-.. r.t
SINI. OtY. •10M
""
.. "'
·20
. .. ..
100
10
Visits to he~llh professionals
Attitudes toward medication
Attitudes toward housework
...
Income
....
._..... 23.$4
IAN!\• 4.21
Std. OW. •4AM
$11d.O.v. •2,_41t fi • Ot
.,. i »-
~
•...c " ...!
•... " I!
... ,._ 10"
n 10
A
,.
,. ~ .,
10
Attitudes toward housework
12
income
race
Whether eurrentty married
.....
....,.,.71
..
MNn•1.ot SNCM\'•1'1-1
,
$l4,0tv. • .4 1$ 800
...." ~
c
...!
nee
d isconnected from the other cases. It is not clear whether these are record ing errors or if these two women actually enjoy housework that much. In any event, the decision is made to delete from further analysis the data from the two women with extremely favorable a ttitudes toward housework.
Cleaning Up Your Act
Informa tion abou t these d eletions is included in the repo rt of results. The sing le missing value is replaced with the mean. (Section 10.7.1.1 illustrates a more sophisticated way of dea ling w ith missing data in IBM SPSS when the a mount missing is greater than 5%.) On INCOME, however, there a re 26 cases w ith Missing values- more than 5% of the sample. If INCOME is not critica l to the hypo theses, we delete it in subsequent ana lyses. If INCOME is impo rtant to the hypotheses, we could replace the missing va lues. The two remaining variables are d ichotomous and not evenl y split. MSTATUS has a 362 to 103 split, roughly a 3.5 to 1 ra tio, that is not particularly d is turbing. Bu t RACE, w ith a split greater than 10 to 1 is marginal. For this analysis, we choose to retain the variable, realizing that its association with other variables is deflated because of the uneven split. Table 4.6 shows the d is tribu tion of ATTHOUSE with e limina tion of the univaria te outliers. The mean for ATTHOUSE changes to 23.634, the value used to replace the missing ATTHOUSE score in subsequen t analyses. The case w ith a missing value on ATTHOUSE becomes comple te and available for use in a ll comp utations. The COMPUTE ins tructions filter out cases w ith values equal to or less than 2 on ATTHOUSE (univaria te ou tliers) and the RECODE instruction sets the missing value to 23.63. At this point, we have investiga ted the accuracy of data entry and the distributions of all variables, d ete rmined the number of missing values, found the mean for replacement of missing d ata, and found two univariate ou tliers that, when dele ted, result inN = 463.
Table 4.6
Syntax and IBM SPSS FREQUENCIES Output Showing Descriptive Statistics and Histograms for ATTHOUSE with Univariate Outliers Deleted USE ALL . EXECUTE . USE ALL . COMPUTE filter_$=(atthouse>2) . VARIABLE LABELS filter_$ ' atthouse>2 (FILTER) '. VALUE LABELS filter_$ 0 ' Not Selected' 1 selected' FORMATS FILTER_$ (fl . O) . FILTER BY filter_$ . EXECUTE . RECODE atthouse (SYSMIS=23 . 63) . EXECUTE . FREQUENCIES VARIABLES=atthouse /FORMAT=NOTABLE /STATISTICS=STDDEV VARIANCE MINIMUM MAXIMUM MEAN SKEWNESS SESKEW KURTOSIS SEKURT /HISTOGRAM NORMAL /ORDER=ANALYSIS . Histogram
Frequencies
.....
Mean • 23.$3
SlO.On. • • .258
Statistics AltiludU 'O'Wafd f\OuttwOrk aD 00 00
•
0
0
0
0
0
0 0 000 0000 CD COlD
7
0
0
0
0
o o o oooooooco::a:o
0
0
0
0
0
0 0
0
0
0
0
0 0
..
0
0 00
o c:o
0
000
00
0
c0
"'..
:;; "
•E
i2 .. ....
.43
...
•.I$•
.2t
.18
.12
.00
S.6S6
1.000
.00
.210
5.193
.060
.4$
.0$
3
Outlier Statistics
Case
Number Mahal. 01s1ance
Slallstlc
11 7
21 .837
193
20.650
435
19.968
99
18.499
5
335
18.469
6
292
17.518
10
58
17.373
71
17.172
102
16.942
196
16.723
a. Dependent Variable: Subjeel number
x;
Table 4.8 greater than = 20.515 (cf. Appendix C, Table C.4), then, is a multivariate outlier. As shown in Table 4.8, cases 117 and 193 are ou tliers among these variables in this data set. There a re 461 cases remaining if the two mu ltivaria te outliers a re deleted. Little is lost by deleting the ad ditional two outliers from the sample. It is s till necessa ry to determine why the two cases are mu ltivariate outliers, to better understand h ow their deletion limits generalizability, and to in clu de tha t information in the Results section. 4.2.1.5 VARIABLES CAUSING CASES TO BE OUTLIERS IBM SPSS REGRESSION is used to identify the comb ination of variables on which case 117 (case number 137 as found in the d ata editor) and case 193 (case number 262) deviate from the remaining 462 cases. Each outlying case is evalu ated in a separate IBM SPSS REGRESSION run after a dummy variable is created to separate the outlying case from the remaining cases. In Table 4.9, the dummy variable for case 137 is created in the COMPUTE instruction with dummy = 0 and if (subno = 1 37) dummy = 1. With the dummy variable as the DV and the remaining variables as IVs, you can find the variables that d istinguish the outlier from the other cases. For the 117th case (subno = 137) , RACE,ATTDRUG, and LTIMEDRS show up as significant p redictors of the case (Table 4.9). Variab les separating case 193 (s ubno = 2 62) from the o ther cas es are RACE and LTIMEDRS, as seen in Tab le 4.10. The final step in evalu ating outlying cas es is to d ete rm ine how their scores on the va riables that cause them to be outliers differ from the scores of the remaining sample. The IBM SPSS LIST and DESCRIPTIVES procedures a re used, as shown
Cleaning Up Your Act
Table 4.9
IBM SPSS REGRESSION Syntax and Partial Output Showing Variables Causing the 117th Case to Be an Outlier COMPUTE dummy = 0 . EXECUTE . IF (subno=137) dummy 1. EXECUTE . REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS /CRITERIA=PIN ( . 05) POUT ( . 10) /NOORIGIN /DEPENDENT dummy /METHOD=STEPWISE attdrug atthouse emp1mnt mstatus race ltimedrs .
Regression 3
Coeffieients
Unstandardized Coemclents Std. Error B
Model (Constano
1 2
· .024
.008
race
.024
.008
(Constano
.009
.016
race
.025
.007
· .004
.002
Anrtudes toward medication 3
StaMardozed Coemclents Beta
Sig. ·2.881
.004
3.241
.001
.577
.564
.151
3.304
.001
·.111
· 2.419
.016
.H 9
(Constant)
.003
.016
.169
.866
race
.0 26
.007
.159
3.481
.001
· .005
.002
·.123
· 2.681
.008
.012
.005
.109
2.360
.01 9
Anrtudes toward medication ltimedrs a. OepeMentvarlable: dummy
Table 4.10
IBM SPSS REGRESSION Syntax and Partial Output Showing Variables Causing the 193rd Case to Be an Outlier
COMPUTE dummy = 0 . EXECUTE . IF (subno=262) dummy 1. EXECUTE . REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS /CRITERIA=PIN( . 05) POUT( . 10) /NOORIGIN /DEPENDENT dummy /METHOD=STEPWISE attdrug atthouse emplmnt mstatus race ltimedrs .
Regression
Coefficientsa Unstandardized Coefficients 8 Std. Error
Model (Constant) race 2
(Constant)
·.02 4
.008
.024
.008
StandardiZed Coefficients Beta
.149
Sio. ·2.881
.00 4
3.241
.001
·.036
.009
·3.787
.000
race
.026
.007
.158
3.436
.001
llimedrs
.014
.005
.121
2.63 4
.009
a. Dependent Variable: dummy
87
88 Chapter4
Table 4.11 Syntax and IBM SPSS Output Showing Variable Scores for Multivariate Outliers and Descriptive Statistics for All Cases LIST VARIABLES=subno attdrug atthouse mstatus race ltimedrs /CASES FROM 117 TO 117 .
List subno attdrug atthouse mstatus race ltimedrs
137
5
24
2
2
1. 49
LIST VARIABLES=subno attdrug atthouse mstatus race ltimedrs /CASES FROM 193 TO 193 .
List subno attdrug atthouse mstatus race ltimedrs
262
9
2
31
2
1. 72
DESCRIPTIVES variables=subno attdrug atthouse mstatus race ltimedrs .
Descriptlves Ot scriptfvt Statlstfu N
Sub)Kt number N:Muclts toward
medication AM.Iodts toward houtewort
lll ~09996
0 06381
·1 57 0 1179
lPHYHEAl lPHYHEAl
0 86373
0 09238
9 35 < 0001
LIENHEAl
MENHEAL
0 00255
0 00464
0 55 0 5826
SSTRESS
SSTRESS
0 01954
0 00354
5 51 F
Source
24 27240 8 09080
Model 461
Error
ss73m
6692 .05, supporting the conclusion of homogeneity of variance-covariance matrices. 7.6.1.6 HOMOGENEITY OF REGRESSION Because Roy- Bargmann stepdown analysis is planned to assess the importance of DVs after MANOVA, a test of homogeneity of regression is necessary for each step of the stepd own analysis. Table 7.13 shows the IBM SPSS MANOVA syntax for tests of homogeneity of regression where each DV, in tum, serves as DV on one step and then becomes a covariate on the next and all remaining steps (the split file instruction first is turned off).
Table 7.13
Test for Homogeneity of Regression for MANOVA Stepdown Analysis (Syntax and Selected Output for Last Two Tests from IBM SPSS MANOVA)
SPLIT FILE OFF . MANOVA ESTEEM, ATTROLE, NEUROTIC , INTEXT , CONTROL, SEL2 BY ~EM , MASC(1,2) IPRINT=SIGNIF(BRIEF) IANALYSIS=ATTROLE /DESIGN=ESTEEM,FEM , MASC, FEM BY MASC , ESTEEM BY FEM, ESTEEM BY MASC, ESTEEM BY FEM BY MASC IANALYSIS=NEUROTIC /DESIGN=ATTROLE, ESTEEM , FEM , MASC , FEM BY MASC , POOL(ATTROLE , ESTEEM) BY FEM + POOL(ATTROLE, ESTEEM) BY MASC + POOL(ATTROL£, ESTEEM)BY FEM BY MASCI IANALYSIS=INTEXT /DESIGN=NEUROTIC,ATTROLE,ESTEEM, FEM, MASC, FEM BY MASC, POOL(NEUROTIC, ATTROLE, ESTEEM) BY FEM + POOL(NEUROTIC, ATTROLE, ESTEEM) BY MASC + POOL(NEUROTIC, ATTROLE , ESTEEM) BY FEM BY MASC IANALYSIS=CONTROL /DESIGN=INTEXT , NEUROTIC, ATTROLE , ESTEEM FEM, MASC FEM BY MASC, POOL(INTEXT, NEUROTIC,ATTROLE, ESTEEM) BY FEM + POOL(INTEXT, NEUROTIC , ATTROLE, ESTEEM) BY MASC + POOL(INTEXT, NEUROTIC,ATTROLE, ESTEEM) BY FEM BY MASC I ANALYS I S=SEL2 /DESIGN=CONTROL, INTEXT,NEUROTIC , ATTROLE , ESTEEM, FEM, MASC,FEM BY MASC, POOL(CONTROL,INTEXT, NEUROTIC, ATTROLE, ESTEEM) BY FEM + POOL(CONTROL,INTEXT, NEUROTIC, ATTROLE, ESTEEM) BY MASC + POOL(CONTROL,INTEXT, NEUROTIC, ATTROLE, ESTEEM) BY FEM BY MASC . Tests of Significance for CONTROL using UNIQUE sums of squares Source of variation WITHIN+RESIDUAL I NT EXT NEUROTIC ATTROLE ESTEEM FEM MASC FEM BY MASC POOL(INTEXT NEUROTIC ATTROLE ESTEEM) BY FEM + POOL(INTEXT NEUROTIC ATTROLE ESTEEM) BY MASC + POOL(IN TEXT NEUROTIC ATTROLE ESTEEM) BY FEM BY MASC
ss
OF
MS
F
Sig of F
442 . 61 2 . 19 42 . 16 . 67 14 . 52 2 . 80 3 . 02 . 00
348
1.27 2 . 19 42 . 16 . 67 14 . 52 2 . 80 3 . 02 . 00
1 . 72 33 . 15 . 52 11 . 42 2.20 2 . 38 .00
. 190 . 000 . 470 . 001 . 139 . 124 . 995
19 . 78
12
1.65
1.30
. 219
(Continued)
233
234 Chapter 7
Table 7.13
(Continued)
Tests of Significance of SEL2 usin9 UNIQUE sums of squares Source of Variation WITHIN+RESIDUAL CONTROL IN TEXT NEUR ATT EST FEM MASC
FEM BY MASC POOL (CONTROL INTEXT NEUROTIC ATTROLE ESTEEM) BY FEM + POOL(CONTROL INTEXT NEUROTIC ATTROLE ESTEEM) BY MASC + POOL(CONTROL INTEXT NEUROTIC AT TROLE ESTEEM) BY FEM BY MASC)
ss
DF
MS
F
Si9 of F
220340 . 10 1525 23 . 99 262 . 94 182 . 98 157 . 77 1069 23 37 . 34
344 1
640 . 52 1525 . 23 . 99 262 . 94 182 . 98 157 . 77 1069 . 23 37 . 34
2 . 38 . 00 . 41 . 29 . 25 1. 67 . 06
. 124 . 969 . 522 . 593 . 620 . 197 .809
1530 . 73 14017 . 22
15
1530 . 73
2 . 39
. 123
934 . 48
1. 4 6
. 118
Table 7.13 also contains outpu t for the last two steps where CONTROL serves as DV with ESTEEM, ATTROLE, NEUROTIC, and INTEXT as cova riates, and then SEL2 is the DV with ESTEEM, ATTROLE, NEUROTIC, INTEXT, and CONTROL as covaria tes. At each step, the rel evant effect is the one appearing last in the column labeled Source o f Va riation, so that for SEL2 the F va lue for homogeneity of regression is F(15,344) = 1.46,p > .01. (The more s tringent cutoff is used here because robus tness is expected.) Homogeneity of regression is established for all steps. Fo r MANCOVA, an overall test of homogeneity of regression is required, in addition to stepdown tests. Syntax for a ll tests is shown in Table 7.14. The ANALYSIS sentence with three DVs specifies the overa ll test, while the ANALYSIS sen tences w ith one DV each are for stepd own analysis. Outpu t fo r the overall test and the last stepdown test is also shown in Tab le 7.14. Multivariate outp ut is printed for the overa ll test because there are three DVs; univaria te results are given for the stepdown tests. All runs show sufficien t homogeneity of regression fo r this ana lysis.
Table 7.14
Tests of Homogeneity of Regression for MANCOVA and Stepdown Analysis (Syntax and Partial Output for Overall Tests and Last Stepdown Test from IBM SPSS MANOVA)
MANOVA ESTEEM,ATTROLE,NEUROTIC, INTEXT,CONTROL , SEL2 BY FEM MASC(1 , 2) /PRINT=SIGNIF(BRIEF) /ANALYSIS=ESTEEM, INTEXT , NEUROTIC /DESIGN=CONTROL, ATTROLE , SEL2 , FEM, MASC,FEM BY MASC, POOL{CONTROL, ATTROLE, SEL2) BY FEM + POOL{CONTROL, ATTROLE, SEL2) BY MASC + POOL{CONTROL, ATTROLE, SEL2) BY FEM BY MASC /ANALYSIS=ESTEEM /DESIGN=CONTROL, ATTROLE , SEL2 , FEM, MASC,FEM BY MASC, POOL{CONTROL, ATTROLE, SEL2) BY FEM + POOL{CONTROL, ATTROLE, SEL2) BY MASC + POOL{CONTROL, ATTROLE, SEL2) BY FEM BY MASC /ANALYSIS=INTEXT /DESIGN=ESTEEM, CONTROL, ATTROLE, SEL2,FEM,MASC, FEM BY MASC, POOL(ESTEEM, CONTROL, ATTROLE , SEL2) BY FEM + POOL(ESTEEM, CONTROL, ATTROLE , SEL2) BY MASC + POOL(ESTEEM, CONTROL, ATTROLE , SEL2) BY FEM BY MASC /ANALYSIS=NEUROTIC /DESIGN=INTEXT , ESTEEM , CONTROL , ATTROLE,SEL2 , FEM , MASC,~EM BY MASC, POOL{INTEXT , ESTEEM, CONTROL, ATTROLE, SEL2) BY FEM+ POOL{INTEXT , ESTEEM, CONTROL, ATTROLE, SEL2) BY MASC + POOL{INTEXT , ESTEEM, CONTROL, ATTROLE, SEL2) BY FEM BY MASC .
Multivariate Analysis of Variance and Covariance
Table 7.14
(Continued)
• .., * * * *
A n a
1 y s
i
s
o
f
V a
r
i
a
n c e
-
design
1
* * * * * *
Multivariate Tests of Significance Tests using UNIQUE Sums of Squares and WITHIN+RESIDUAL error term Source of variation Wilks Approx F Hyp . D~ Error OF CONTROL ATTROLE SEL2 FE.M MASC FEM BY MASC POOL(CONTROL ATTROL E SEL2) BY ~EM + POOL (CONTROL ATTROLE SEL2) BY MASC + POOL(CONTROL ATTROLE SEL2) BY ~EM BY MASC
* *
. 814 . 973 . 999 . 992 . 993 . 988 . 933
A n a 1 y s i
s
26 . 656 3 . 221 . 105 . 949 . 824 1.414 . 911
0
f
V a
r i
3 . 00 3.00 3 . 00 3.00 3 . 00 3.00 27 . 00
a n c e -
Tests of Significance for NEUROTIC using UNIQUE sums of squares source of variation SS OF MS WITHIN+RESIDUAL I NT EXT ESTEEM CONTROL ATTROLE SEL2 FE.M MASC FE.M BY MASC POOL(INTEXT ESTEEM C ONTROL ATTROLE SEL2) BY FEM + POOL(INTEXT ESTEEM CONTROL ATTROLE SEL2) BY MASC + POOL(INTEXT ESTEEM CONTROL ATTROLE SEL2) BY FEM BY MASC
6662 . 67 82 . 19 308 . 32 699 04 . 18 2 . 24 . 07 74 . 66 1 . 65 420 19
344
1 15
19 . 37 82 . 19 308 . 32 699 . 04 1.18 2 . 24 . 07 74 . 66 1. 65 28 . 01
Si9 of
350 . 000 350 . 000 350 . 000 350 . 000 350 . 000 350 . 000 1022 . 823
Design
1 F
4 . 24 15 . 92 36 . 09 . 06 . 12 . 00 3 . 85 . 09 1. 45
~
. 000 .023 . 957 . 417 . 481 .238 . 596
* Si9 of
~
.040 .000 . 000 . 805 . 734 . 952 .050 . 770 . 124
7.6.1.7 RELIABILITY OF COVARIATES For the s tepdown ana lysis in MANOVA, all DVs except ESTEEM must be reliable because all act as covariates. Based on the nature of scale development and data collection procedures, there is no reason to expect unreliability of a magnitude harmful to covariance analysis for ATTROLE, NEUROTIC, INTEXT, CONTROL, and SEL2. These same variables act as true o r stepdown covariates in the MANCOVA analysis. 7.6.1.8 MULTICOLLINEARITY AND SINGULARITY The log-determinant of the pooled witltin-cells correlation matrix is found (through IBM SPSS MANOVA syntax in Table 7.15) to be - .4336, yie lding a determinant of 2.71. This is sufficien tly different from zero that multicollinearity is not judged to be a prob lem.
7.6.2 Multivariate Analysis of Variance Syntax and partial outpu t of OmrUbus MANOVA p roduced by IBM SPSS MANOVA appear in Table 7.15. The order of IVs listed in the MANOVA statement together with METHOD=SEQUE NT I AL sets up the priority for testing FEM before MASC in this unequal-11 design. Results are reported for FEM by MASC, MASC, and FEM, in tum. Tests are reported out in o rder of adjustment where FEM by MASC is adjus ted for both MASC and FEM, and MASC is adjus ted for FEM.
235
236 Chapter 7
Table 7.15
Multivariate Analysis of Variance of Composite of DVs (ESTEEM, CONTROL, ATIROLE, SEL2, INTEXT, and NEUROTIC), as a Function of (Top to Bottom) FEMININITY by MASCULINITY Interaction, MASCULINITY, and FEMININITY (Syntax and Selected Output from IBM SPSS MANOVA)
MANOVA
ESTEEM, ATTROLE, NEUROTIC, INTEXT , CONTROL, SEL2 BY FEM, MASC(1 , 2) /PRINT=SIGNI . OS . The re was a mod -
est association between DVs and covariates , partial ~2
=
. 10 with
confidence limits from . 06 to . 29 . A somewhat larger association was found between combined DVs and t he main effect of masculinity , partial ~2 = . 15 with confidence limits from . 08 to . 21 , but the association between the main effect of femininity and the combined DVs was smaller, partial ~ 2 = . 04 with confidence limits from . 00 to . 08 . Effect size for the nonsignificant interaction was . 00 with confidence limits from . 00 to . 01 . [F is from Table 7 . 25; partial ~ 2 and their confidence limits are found through Smithson ' s NoncF3 . sps for main effects , interaction, and covari ates . ] To investigate more specifically t he power of the covari ates to adjust dependent variables , multiple regressions were run for each DV in turn , with covariates acting as multiple predictors . Two of the three covariates , locu s of control and atti tudes toward women ' s role , provided significant adjustment to self- esteem . The B value of . 98 (confidence interval from . 71 to 1 . 25) for locus of control was significantly different from zero , t{361) = 7 . 21 , p< . 001 , as was theBvalueof . 09 (confidence interval from . 03 to . 14) for attitudes toward women ' s role , t{361) = 3 . 21 , p < . 01. None of the covariates provided adjustment to the introversion-extraversion scale . For neuroticism, only locus of control reached statistical significance , with B = 1 . 53 (confidence interval from 1.16 to 1.91) ,
t{361) = 8 . 04 , p< . 00 1.
Multivariate Analysis of Variance and Covariance
For none of the DVs did socioeconomic status provide significant adjustment . [Information about relationships for individual DVs
and CVs is from Table 7 . 26 . ] Effects of masculinity and femininity on the DVs after adjustment for covariates were investigated in univariate and RoyBargmann stepdown analysis , in which self- esteem was given the highest priority, introversion-ex traversion second priority (so that adjustment was made for self- esteem as well as for the three covariates) , and neuroticism third priority (so that adjustment was made for self- esteem and introversion-extraversion as well as for the three covariates) . Homogeneity of regression was satis factory for this analysis , and DVs were judged to be sufficiently reliable to act as covariates . Results of this analysis are summarized in Table 7 . 29 . An experimentwise error rate of 5 % for each effect was achieved by apportioning alpha according to the values shown in the last column of the table . After adjusting for differences on the covariates , self- esteem made a significant contribution to the composite of the DVs that best distinguished between women who were high or low in femininity , stepdown F(1 , 361}=9 . 99 , p< . 01 , partial TJ2 = .03 with confidence limits from . 00 to . 08 . With self- esteem scored inversely, women with higher femininity scores showed greater self- esteem after adjustment for covariates
(Madj usted
on femininity
= 15 . 35 ,
(Madj usted=
SE
= 0 . 22) than those scoring lower
16 . 72 , SE= 0 . 34} . Univariate analysis
revealed that a statistically significant difference was also present on the introversion-extraversion measure , with higher - femininity women more extraverted, univariate F(1 , 361} = 5 . 83 , a difference already accounted for by
covariates and the higher - priority DV .
[Adjusted means are
from Tables 7 . 30 and 7 . 31 ; partial TJ2 and confidence limits are from Table 7 . 32 . ) Lower- versus higher - masculinity women differed in selfesteem, the highest- priority DV , after adjustment for covari ates , stepdown F(1 , 361} = 49 . 60 , p < . 01 , partial TJ2 = . 12 with confidence limits from . 06 to . 20 . Greater self- esteem was found
251
252 Chapter 7
among higher - masculinity women (M• dJusced = 14 . 22, SE = 0 . 32) than among lower- masculinity women (M•djusced = 16.92, SE = 0.23) . The measure of introversion and extraversion, adjusted for covariates and self - esteem, was also related to differences in masculinity, stepdown F(1 , 360) = 10.93, p
18.467 is an ou tlier. Translating this critical value to leverage h;; for the firs t group using the va riation on Equa tion 4.3: .. 11 "
=
Mahalanobis distance + .!_ N - 1 N
= 18.467 + _1_ = 240
241
.
081
In Table 9.5, CASESEQ346 (H = .0941) and CASESEQ 407 (H = .0898) are id entified as outlie rs in the group of WORKING women. No add itional ou tlie rs were found.
11
Alternati\•e strategies for dealing with missing data are discussed in Chapter 4.
325
326 Chapter 9
Table 9.5 Identification of Multivariate Outliers (SAS SORT and REG Syntax and Selected Portion of Output File from SAS REG) pr·o c sort data = Sasuser . Disc rim;
by WORKSTAT ; run; proc r eg data = Sasuser . Discrim;
by WORKSTAT ; model CASESEQ= CONTROL ATTMAR ATTROLE ATTHOUSE/ selection= RSQUARE COLLIN ; output out=SASUSER . DISC_OUT H=H ; run;
I{;i VlEWTABLE: Sasuser.D1sc_out Case sequence
I
I L
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
34S 346 347 348 349 355 357 358 359 362 365 369 372 378 380 381 383 384 386 387 397 398 399 400 401 403 404
406 407 425
I
CLmnt employment status
I!Locus~ -control11Mitude AlUudes current toward marital I toward role of I status 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
8 5 6 7 5 6 6 8 7 8 5 7 7 5 7 6 7 7 7 7 5 6 9 7 5 6 7 7 6 6
women 19 20 28 23 25 25 17 42 21 24 14 23 18 26 35 30 25 20 23 16 30 12 25 42 35 20 39 20 30
38 41 34 26 31 27 38 36
22 35 36
27 45 23 29 30 44 41 30
29 35 23 25 23 35 27 30
35 42 33
Miudes toward housewori< 19 2 26 24 30 30 22 26 30 24 18 34 13 21 28 20 25 27 25 22 15 33 24 29 21 29 21 18 2 22
I
Leverage
0.0144003602 0.0941107394 0.0079273826 0.0103006485 0.0210723785 0.0147929944 0.0087458208 0.0276384488 0.0203930567 0.0088289495 0.0185871895 o.02491m63 0.0284221267 0.0284008123 0.022407031 1 0.0176472831 0.0136302175 0.0066955586 0.0081856807 0.0244622406 0.0266531324 0.0360983343 0.016045569 0.0393409292 0.0190822473 0.0087346339 0.0305524425 0.0898029757 0.0098487137
SOURCE: Created with Base SAS 9 .2 Software. Copyright 2008, SAS Institute Inc., Cary, NC, USA. All Rights Reserved. Reproduced with permission of SAS Institute Inc., Cary, NC.
The multivariate outliers are the same cases that have extreme univaria te scores on ATTHOUSE. Because transformation is questionable for ATTHOUSE (where it seems unreasonable to transform the predictor for only two cases), it is decided to delete the outliers. Therefore, of the origina l 465 cases, 7 are lost d ue to missing va lues and 2 are both univaria te and multivaria te ou tliers, leaving a to ta l of 456 cases for analysis. 9.7.1.5 HOMOGENEITY OF VARIANCE-COVARIAN CE MATRICES A SAS DISCRIM run, Table 9.6, deletes the ou tliers in order to evaluate homogeneity of variance- ChiSq
SO 7S3826 20 Sine. 11\t Chi.Squart value is stgnir.cant at tht 0.001
lt~l.
0 0002
dlt within toYarian(t maua• will be UMd In tht diKrin'linant funaion.
Reference: Uonison, O.F. (1976) Uulti'lariate Statistical Methods p2Sl.
Table 9.7 SAS REG Output Showing Collinearity Information for All Groups Combined (Syntax Is in Table 9.5) Collinearity Diagnostics Number Eigenvalue 4 83193
Condition Index
Proportion of Variation Intercept CONTROL ATIMAR ATIROLE ATIHOUSE
1 00000 0 00036897
0 00116
0.00508
0.00102 0 00091481
2
0 10975
6.63518
0.00379
0 00761
0 94924
0.02108
0 00531
3
0.03169
12.34795
0.00188
0.25092
0.04175
0.42438
0.10031
4
0.02018
15.47559
0.00266
0.61843
0.00227
0.01676
0.57008
5
0 00645
27 37452
0 99129
0.12189
0.00166
0.53676
0.32339
matrices. Most output h as been omitted here. The ins truction to p roduce the test of homogeneity of variance-covariance matrices is pool = test. This test sh ows significant heterogeneity of variance-covariance matrices. The program uses separate matrices in the classification p hase of discrimin ant analysis if pool = t est is specified and the test shows s ignificant heterogeneity. 9.7.1.6 MULTICOLLINEARITY AND SINGULARITY Because SAS DISCRIM, used for the major analysis, p rotects against mu lticollinea rity through ch ecks of tolerance, no formal evalu ation is necessary (cf. Ch apter 4). However, the SAS REG syntax of Tab le 9.5 that evalu ates mu ltivaria te outliers also requ ests collinearity information, shown in Table 9.7. No p rob lems with mu lticollinearity are noted.
9.7.2 Direct Discriminant Analysis Direct DISCRIM is performed throu gh SAS DISCRIM with the four attitu dina l predictors all forced in to the equation. The p rogram instructions and some of the output appea r in Tab le 9.8. Simple statistics are requested to p rovide pred ictor means, h elpful in in terpretation. Th e anova and manova ins tructions request un iva ria te s ta tistics on group differences separate ly fo r each of the p red ictors and a multivariate test for th e difference among groups; pco r r requ ests the pooled w ithin-groups correlation matrix, and c r ossval i da t e requests jackknifed classification . Th e prio r s p roport i ona l instruction specifies prior
328 Chapter 9
Table 9.8
Syntax and Partial Output from SAS DISCRIM Analysis of Four Attitudinal Variables
proc discrim data=SASUSER . DISCRIM simple anova manova pcorr can crossvalidate stdmean pool=te st ; class workstat ;
var CONTROL ATTMAR ATTROLE ATTHOUSE ; priors proportional ;
where CASESEQ' =346 and CASESEQ'=407 ; run;
The 01 SCRIM Procedure Pooled Within.Ciass Correlation Coefficients I Pr > lrl Variable
COtlTROL ATIMAR ATIROLE ATIHOUSE
CONTROL locus-ol F
Contrast
60.74947570
UNHOUSE VS. OTHERS
60.74947570
4.09 0.0430
Dependent Variable: ATIROLE Attitudes toward role otwomen OF Contrast SS fAean Square F Value Pr > F
Contrast
218 5434340
UNHOUSE VS. OTHERS
218 5434340
5 45 00201
Dependent Variable: An MAR AttJtude toward current marital s:latut OF Contrast SS Mean Square F Value Pr > F
Contrast
615.1203307
UNHOUSE VS. OTHERS
615.1203307
9 66 0 0020
Dependent Variable: CONTROL Locus-of.control OF Contrast SS Mean Square F Value Pr > F
Contrast UNHOUSE VS. OTHERS
1.18893484
1.18893484
0.78 0.3789
Table 9.14
Effect Sizes and 98.75% Confidence Limits for Contrasts of Each Group with the Two Ot her Groups Pooled for Each Predictor Adjusted for the Three Other Predictors Predictor (Adjusted for All Others) Contrast
Attitude toward housework
Attitude toward role of women
Attitude toward marriage
Locus-ofcontrol
WO (GCONV=1E-8) satisfied Score Test for the Proponlonal Odds Assumption Chi-Square OF
12.9147
Pr > ChiSq
9
0.1665
Model Fit Stadstlcs Criterion Intercept Only
Intercept and Covariate•
AIC
836.352
805.239
sc
851 940
832 519
-2log l
828.352
791.239
Testing Globot Null Hypolhesio: BETA-o Chi-Square OF Pr > ChiSq
Test likelihood Ratio
37 1128
3
1.1
1.6
.0
Pfesent
co.urw
• so 97 4
2. 2
1JO
988
98 3 3.7
Prtsent
Petcent '!loSysUis
0:::1§
,;.c
241
135
81
98.•
100.0
98.8 1.2 78
1.2 2.8 lndk alor varl1blts with less than 1._ mlsslno art nol dlspl;ryed, W ~sstng
~=!
95.1
4.9
(continued)
3 76 Chapter 10
Table 10.10 (Continued) .,,..,.. . ... l!·lr~ll: v.IIIP,.. .O,,, .
"
•,
11'
n1 IJ.I ~11
loi ) 1! )lt
4'5
~~~ ~t nli ll~l ..'·". ... ... ... .... ... ....... I
0
,
a
1
I
1
1
1
100
t ot
1100
1.00
21.00
t ot
10.00
700
2200
SOl
1100
I
1000
101
toO
1
I
~500
Sot
1
a
3000
1
I
2)00
101 tOt
1
•
27 00
)01
100
1
I
2200
101
UOO
1100 1 01
UOO
_ ..,~
111111'\tl
Jio
~ltttl!'lt
......
- - ...uti!M
. ,,
Uo
c..Mic
v..tlk
Woll ChiSq
Likelihood Ratio
23.2361
4
0.0001
Score
22.7390
4
0.0001
Wald
21.7497
4
0.0002
Analysis ol Maximum Likelihood Estimates Parameter OF Estimate
Standard Wald Error Chi-Square Pr> ChiSq
Intercept
31964
0 9580
11 1321
CONTROL
-0.0574
0.0781
0. 5415
0.4618
ATIMAR
0.0162
0.0120
1.8190
0. 1774
ATIROLE
-0.0681
0 0155
19 2971
< 0001
ATIHOUSE
-0.0282
0 0238
1.3996
0.2368
0 0008
Association of Predicted Probabilities and Observed Responses Percent Concordant
63.2 Somers· 0 0264
Percent Discordant
36.8 Gamma
0264
0.0 Tau-a
0.132
Percent Tied Pairs
53165 c
0.632
Logistic Regression
Table 10.13 (Continued) Odds Rado Estimates and Wald Confidence Intervals Unit
Effect
Estimate 95lf. Confidence limits
COttTROL
10000
0 944
0 810
1100
ATiltAR
1.0000
1.016
0.993
1.041
ATIROLE
1.0000
0.934
0.906
0.963
ATIHOUSE
10000
0972
0 928
1 019
CJassific:otion Tobie Coned
lnconeCI
Percentage.
Plob Non... Non. Sensi. Sped. fobe hbe level Event Event bent Event Correct tMry Rdty POS NEG
0.200
245
0
217
530
1000
00
47 0
0.220
2A4
0
217
528
99 6
00
471 1000
0.240
244
0
217
528
996
00
471 1000
0.~
2A4
215
532
996
09
468
333
0.280
243
214
53.2
99 2
H
468
400
0.300
241
212
53.2
98.4
23
468
444
0.320
237
208
8
53 2
96 7
41
46 7
47 1
0.340
232
10
207
13
524
947
46
472
565
0.360
231
15
202
14
532
943
69
467
483
0.380
226
25
192
19
543
92 2
11 s
45 9
43 2
o.AOO
220
34
183
25
55 0
89 8
15 1
45 4
42 4
0.420
207
48
169
38
55.2
84 s
22 1
44 9
44 2
0.440
197
67
150
48
571
804
309
432
417
0.~
189
74
143
56
56.9
77 1
34 1
43 1
43 1
0.480
180
90
127
65
58.A
73.5
41 5
4U
41 9
o.soo
164
103
114
81
57.8
66.9
47 s
41 0
44 0
0.520
151
117
100
94
58.0
616
539
398
44 5
0.540
142
129
88
103
587
580
594
383
444
0.560
127
138
79
118
574
51 8
63 6
38 3
46 1
0.580
114
153
64
131
578
46.5
70.5
36.0
46 1
0,600
90
164
53
155
55 0
36 7
75 6
371
48 6
0.620
71
185
32
174
55 4
29 0
85 3
31 1
48 S
0.640
53
196
21
192
53 9
21 6
90 3
28 4
49 s
0.660
39
201
16
206
51 9
15.9
92.6
29.1
so 6
0.680
28
208
217
511
114
959
243
51 1
0.100
16
214
229
49 8
6.5
98.6
15.8
51.7
0.120
215
236
48 s
31
991
18 2
52 3
0.740
216
238
48 3
29
99.5
12 5
52 4
216
243
472
08
99.5
33 3
52 9
216
244
470
04
995
500
530
0.760
2
0.180
3
0
216
244
470
04
995
500
530
0.820
0
216
245
46 8
00
995 1000
531
0.840
0
217
245
47 0
00
0.800
1000
530
p roper role of women. Nonsignificant coefficients are produced for locus of con trol, attitude toward marital s tatus, and attitude towa rd housework. The nega tive coefficient for attitude toward the role of women means that working women (coded 0: see Proba bility mod eled is WORKSTAT = 0) have lower scores on the variable, ind icating more liberal attitudes. Adjusted Odd s Ratios a re omitted here because they d uplicate the nonadjusted ones. (Note tha t these findings are consisten t w ith the results of the con trast of working women vs. the o ther groups in Table 9.11 of the discriminant ana lysis.) Somers' D indicates abou t 7% {.2632 = .07) of shared variance between work status and the set of predictors. Using Soper's (2012) online ca lculator
379
380 Chapter 10 software (cf. Figu re 5.4), the 95% confidence inter va l ranges from .02 to .11. Thus, the gain in pred iction is unimpressive. SAS LOGISTIC uses jackknife classification in the Classification Table. At a Prob Level of 0.500, the correct classifica tion ra te is 57.8%. Sensitivity is the propo rtion of cases in the " response" ca tegory (working women coded 0) correctly predicted. Specificity is the proportion of cases in the "reference" category (housewives) correctly p redicted. Because of difficulties associated with the Wald test (cf. Section 10.6.1.2), an add itional run is prudent to eva lua te the pred ictors in the model. Another SAS LOGISTIC run (Table 10.14)
Table 10.14 Syntax a nd Selected SAS LOGISTIC Output for Model that Excludes ATTROLE proc logistic data=SASUSER . LOGREGIN ; model WORKSTAT = CONTROL ATTMAR ATTHOUSE ; run;
Model Fit Statistics Criterion Intercept Only
Intercept and Covarlates
64o.n o
644.002
644.906
660.544
638.770
636.002
AIC
-+
sc
-2 Log L
Table 10.15 Relative Weight Analysis for Importa nce of Predictors: Input and Selected Output r-Ch oose to ...
I
• Oov.nloa .01, reinforcing the finding of the Wald test that attitud e toward women's ro le significantly enhances prediction of work s ta tus. Note also that Table 10.14 shows no statis tically significant d ifference between the model w ith the three remaining p redictors and the constant-only model, confirming that these p red ictors are unrelated to work status. Tab le 10.15 illustrates a relative weigh t analysis fo r this example. The R2 = .047 for the entire model was p rimarily due to p rediction b y ATIROLE, which con tr ibuted .042 to it. Tha t is, 88.5% (rescaled relative weigh t) of the predictable variance in diffe rences between working women and housewives was attribu table to ATIROLE. Clearly the on ly weight that was significantly d iffe rent from zero was that for ATIROLE which was overwhelmingly more important than any of the o ther predictors (relative weights ranging from 2% to 6.5%). Further, the last panel shows ATTROLE to be the only weight significantly different from CONTROL. There is no poin t in evaluating diffe rences among the remaining p redictors, considering their poor performance by a ll o ther criteria.
381
382 Chapter 10
Table 10.16 summarizes the s tatistics for the p red ictors. Table 10.17 contains a checklist for direct logistic regression with a two-category outcome. A Results section follows that might be ap propria te for submission to a journa l.
Table 10.16
Logistic Regression Analysis of Work Status as a Function of Attitudinal Variables
8
Wald Chi-Square
- 0.06
0.54
Attitude toward marital status
0.02
Attitude toward role of women Attitude toward housework
Relative Weight
Upper
(%)
0.94
0.81
1.10
2.31
1.82
1.0 2
0.99
1.04
6.52
- 0.07
19.30
0.93
0.91
0.96
88.54
- 0.03
1.40
0.97
0.93
1.02
2.62
3.20
11.13
(Constant)
Table 10.17
95% Confidence Interval for Odds Ratio lower
Variables Locus o f control
Odds Ratio
Checklist for Standard Logistic Regression with Dichotomous Outcome
1. Issues
a
Ratio of cases to variables and missing data
b. Adequacy of expected frequencies 6f necessary) c. Outiers in the solution (if fit inadequate) d. Multicollinearity e. Uneatity in the legit 2. Major analysis
a
Evaluation o f overall fit. If adequate: {1) Significance tests for each predictor (2) Parameter estimates
b. Effect size for model c. Relative Importance analysis d. Evaluation o f models without predictors 3. Additional analyses
a
Odds ratios
b. Classification or prediction success table
Resu lts A direct logistic regression analysis was performed on work statu s as outcome and four attitudinal predictors : locus of control , attitude toward current marital statu s , attitude toward women's role , and attitude toward housework . Analysis was performed using SAS LOGISTIC Version 9 . 4 . Twenty - two c ases with missing v al ues on continuous predictors were imputed using the EM al gorithm through SPSS MVA after finding no statistically
Logistic Regression
significant deviation from randomness using Little's MCAR test , p = . 331 . Data from 462 women were available for analysis :
217 housewives and 245 women who work outside the home more than 20 hours a week for pay . A test of the full model with all four predictors against a constant- only model was statistically significant, ,i(4 , N = 440) = 23 . 24 , p < . 001 , indicating that the predictors , as a set, significantly distinguished between working women and housewives . The variance in work status accounted for is small , however , with Somers '
D= . 263 , with a 95% confidence interval
for the effect size of . 07 ranging from . 02 to . 11 using Soper' s (2016) online calculator . Classification was unimpressive , with 67% of the working women and 48% of the housewives correctly predicted,
for an overall success rate of 58% .
Table 10 . 15 shows regression coefficients , Wald statistics, odds ratios , and 95% confidence intervals for odds ratios for each of the four predictors . According to the Wald criterion, only attitude toward role of women significantly predicted work status ,
x2 (1 , N =
440} = 19 . 30 , p< . 001. A model run with attitude
toward role of women omitted was not significantly different from a constant- only model ; however , this model was significantly different from the full model ,
x2 (1 ,
N = 440) = 20 . 47 , p < . 001 .
This confirmed the finding that attitude toward women ' s role was the only statistically significant predictor of work status among the four attitudinal variables . The rescaled relative weight of 88% underscored the importance of attitude toward women ' s role relative to the other three predictors in distinguishing the groups . However ,
the odds ratio of . 93 showed little change in
the odds of working on the basis of a one - unit change in attitude toward women ' s role . Thus , attitude towards the proper role of women in society distinguished between women who did and did not work outside the home at least 20 hours per week , but the distinction was not a very strong one .
383
384 Chapter 10
10.7.3 Sequential Logistic Regression with Three Categories of Outcome The sequential ana lysis is done through two major runs, on e with and one without a ttitudina l p red ictors. 10.7.3.1 LIM ITAT IONS OF M ULT INOMIAL LOG ISTIC REGRESSION 1 0.7.3.1.1 Adequacy of Expected Freque ncies Only if a goodness-of-fit criterion is to be used
to compare observed and expected frequencies, is ther e a limitation to logistic regression. As discussed in Section 10.3.2.2, the expected frequencies for all pairs of discrete p redictors mus t meet the us ual "chi-squa re" requirements. (This requirement is only for this section on sequentia l analysis because the predicto rs in the direct analysis are all con tinuous.) After filtering out cases with missing d ata on RELIGION, Table 10.18 shows the resu lts of an IBM SPSS CROSSTABS run to check the adequacy of expected frequencies for a ll pa irs of discrete p redictors. (Only the first seven tab les are shown, the remaining three are omitted.) Observed ( COUNT) and E XPECTE D frequencies are requested . The las t crosstabs in the table show one expected cell frequency under 5: 2.7 for single, nonwhite women. This is the only expected frequency that is less than 5, so that in no two-way table do more than 20% of the cells have frequencies less than 5, nor are any expected frequencies less than 1. Therefore, there is no restriction on the goodness-of-fit criteria used to evaluate the model.
Table 10.18 Syntax and Partial Output of IBM SPSS CROSSTABS for Screening All Two-Way Tables for Adequacy of Expected Frequencies USE ALL . COMPUTE FILTER $=(RELIGION < 9) . FILTER BY FILTER $ . EXECUTE . CROSSTABS /TABLES=MARITAL CHILDREN RELIGION RACE BY WORKSTAT /FORMAT= AVALUE TABLES /CELLS= COUNT EXPECTED . CROSSTABS /TABLES=CHILDREN RELIGION RACE BY MARITAL /FORMAT= AVALUE TABLES /CELLS= COUNT EXPECTED .
Current marital status • Work Status Crosstabulation Work Status
Working Current mantal status
Single
Count Expected Count
l~arrled
Count
Total
24
3
4
31
16.4
9.1
5.5
31.0
168
127
64
359
104.9
63.7
359.0
53
5
14
Expected Count
38.2
21.0
12.8
Count
245
135
82
462
245.0
135.0
82.0
462.0
Count
Expected Count
Presence of children • Work Status Crosstabulatlon Work Status ROI&-
Presence or cnlldren
NO Y•s
count Expectod Count Count
Expected Count Total
Total
190.4
&peeled Count Broken
Roledissatisfied housewives
Role-satisfied housewives
Count Expoctod Count
dissatisfied
Worklno
Role-satisfied housewltes
57
13
12
82
43.5
24.0
14.6
82.0
housewi·,"ts
Total
188
122
70
380
201 .5
111 .0
67.4
380.0
245
135
82
462
245.0
135.0
82.0
462.0
72
--
72.0
Logistic Regression
Table 10.18
(Continued) Religious offtllatfon • Work Sto.tus Crossto.bulatlon Work Status
Rolt->aijSfied
Rot• dissahfltd
n9
hOUStWIYU
hOUSEh\ltt'eS
46 40.3 63 63.1 92 92.8
21 222 29 34.8 52 51.1 33
48.8 245 245.0
26.9
Wo r~
Retlotous athlla'llOn
Nont-ot·ol\er
Count ~.cled Count
camotk
count E~etltd coon1
count Ex.peeted coun1 Count Eq)ected Count Coool Elf\t 6(.3333) + 7(.8571) "" '1423
Note the d istinction between the hazard function and the p robab ility dens ity function. The hazard function is the instantaneous rate of dropping ou t a t a specific time (e.g., 1.8 months which is the midpoin t of the second interval), among cases who survived to a t least the beginning of that time interval (e.g., 1.2 months). The p robability density function is the probability of a given case dropping out at a specified time point.
11.4.4 Plot of Life Tables Life table plots are simply the cumula tive proportion surviving (Pi), p lotted as a function of each time in terval. Referring to Table 11.2, for example, the first point p lotted is the in terval beginning at 0 months, and the cumulative proportion surviving is 1.0 for both the groups. At the beginning of the second interval, 1.2 months, the cumulative proportion surviving is still 1.0 for those dancers in the treatment group bu t is .8571 for dancers in the con trol group, and so on. Figure 11.1 shows the resulting survival function.
---------------------, .............................................. .. '\
0.9 ....... ·························································\:\ \
0.8 .....................................................................\ .......... 0>
c:
. . . . . . . . . . . . . . . . .............../..>\ ..
0.7
·;;;
Treatment
\
0.6
.................................. ···············'- - -,····
ci. 0.5
........\ \ •
·~ :::>
\
(J)
e
Q.
E :::>
(.)
\
0.4
/
, ..... .............................................~\ ... ............\ ... .
0.3 ................................... Control \ 0.2 ........................................................ ...........................................\--0.1 0
0
1.2
2.4
3.6
4.8
6
7.2
8.4
9.6
10.8
12
13.2
Months
Figure 11.1
Cumulative proportion of treatment and control d ancers surviving.
409
41 0 Chapter 11
11.4.5 Test for Group Differences Group d ifferences in survival are tested through XZ with degrees of freedom equal to the number of grou ps minus 1. Of the severa l tests tha t a re available, the one demonstrated here is labeled Log-Rank in SAS UFETEST and IBM SPSS KM. When there are only two groups, the overall test is (11.7)
XZ equals the squared value of the observed minus expected frequencies of number of surv ivors summed over all intervals for one of the groups (v2 ) under the null 1 hypothesis of no group differences, d ivid ed by Vi, the variance of the group. The degree of freedom for this test is (n umber of grou ps - 1). When there are only two groups, the value in the numerator is the same for both grou ps, bu t opposite in sign. The va lue in the denominator is also the same for both groups. Therefore, compu tations for e ither group p roduce the same x2 • The value, v;, of observed minus expected frequencies is calcula ted separately for each interval for one of the groups. For the con trol group of d ancers (the group coded 0): k
Vo =
L (do; -
11o;drJnr;)
(11.8)
i= l
The difference between observed and expected frequencies for a grou p, Vo, is the summed differences over the intervals between the nu mber of surv ivors in each inter va l, do, minus the ratio of the number of cases a t ris k in the in terva l (no;) times the tota l number of su rvivors in that interval summed over a ll groups (dr;) d ivided by the total n umber of cases a t risk in that in terval summed over a ll groups (nr;). For example, using the control grou p, in the first interval there are 6 surv ivors (ou t of a possible 7 a t ris k), and a total of 11 survivors (out of a possible 12 at risk) for the two groups combined . Therefore, V
01
= 6 - 7(11)/ 12 = - 0.417
For the second interval, there are 4 survivors (ou t of a possible 6 at risk) in the con trol group, and a total of 9 survivors (ou t of a possible 11 at risk) for the two groups combined. Therefore, V
02
= 4 - 6(9)/ 11 = - 0.909
and soon. The sum over all the 10 in tervals for the con trol grou p is - 2.854. (For the treated group it is 2.854.) The variance, V, for a group is
Vo
k
=
L [ (nr;llo;- nf,,)d7 ,5r;)/[ n},(nr;- 1) )
(11.9)
l= l
The variance for the contro l group, Vo, is the sum over all in te rvals of the difference between the total number of surv ivors in an in terva l (nr;) times the number of survivors in the control grou p in the interval (no;) minus the squared n umber of survivors in the control group in the in terval; this difference is multiplied by the p roduct of the total number of survivors in the in terval (dr;) times sr; ( = nr;- dr,); all of this is d ivid ed by the squared total number of survivors in the interval (nn ) times the to ta l number of survivors in the in terval minus 1. In jargon, the total number of cases tha t have su rv ived to an interval (nr;) is ca lled the
risk set.
SurvivaVFailure Analysis
The variance for the control group, V0, for the first interval is V01 = ((12 · 5 - 25)11 (1))/ (144(12 - 1))
= 385/ 1584 = 0.24306
and for the second interval: V02
= [(11 ·5 -
25)9(2)]/ [121(11- 1)]
= 540/ 1210 = 0.4462
and so on. The va lue, summed over a1110 intervals is 2.1736. (This also is the value for the second group when there are only two groups.) Using these va lues in Equation 11.7:
_x2 =
( - 2.854)2 2.1736 = 3.747
Table C.4 shows that the critica l value of x2 with 1 df at a = .05 is 3.84. Therefore, the grou ps are not significantly d ifferent by the log-rank test. Ma trix equa tions are more convenient to use if there are more than two groups. (The matrix procedure is not a t all convenient to use if there a re only two groups because the p rocedure requires inversion of a singular variancecova riance matrix.)
11.4.6 Computer Analyses of Small-Sample Example Tables 11.3 and 11.4 show syntax and selected ou tpu t for computer analyses of the data in Table 11.1 for IBM SPSS SURVNAL and SAS LIFETEST, respectively. Syn tax and outpu t from IBM SPSS SURVNAL a re in Table 11.3. The DV (MONTHS), and the grou ping variable (TREATMNT) and its levels, are shown in the T ABLE instruction. The d ropout variable and the level ind icating dropou t are indicated by the STAT US= DANC I NG ( 1) ins truction as per Table 11.1. IBM SPSS requires explicit ins truction as to the setup of time in ter va ls and a request for a survival (or hazard) plot. The COMPARE ins truction requests a test of equality of surviva l functions for the two groups.
Table 11.3
Syntax and Output for Small-Sample Example Through IBM SPSS SURVIVAL
SURVIVAL TABLE=MONTHS BY TREATMNT(O 1) /INTERVAL=THRU 12 BY 1 . 2 /STATUS=DANCING(1) /PRINT=TABLE /PLOTS (SURVIVAL)=MONTHS BY TREATMNT /COMPARE=MONTHS BY TREATMNT .
,_ -· -.... - -l hT*'
, _t_
_ ..... _
ht...__
I
_..
~
............
...
~.....-.
r.-
...
..
_,,
.."
.,
.sf
H
u
.. ..... ...... ...... ..... ... "
.....
-........... -.......... ...
..... .... .... "
...."
H
"
....
...... ...... .. ~ ~
........ -·-_-,. ......--- --.... --
....:.=: ....
.... .... .. ...... .. .. "
"'
.. ...... ..
.."' .......... .....• "'"' .... ..... .. .. "'
"
.... .... .... .. .. .. .." .... .... .... .... ....
ow
(continued)
411
412 Chapter 11
Table 11.3
(Continued)
Medi an Survival Time MedT1me
Fnst-order Controls Treatment
control
3.0000
treated
9.9000
First-order Control: TREA TMNT Survival Function Treatmt.nt
, 0
-r'lcortrol tJeoted
0.8
~
0.6
-~ ::0 fJ)
E
J
0 8.311>
(continued)
416 Chapter 11
Table
11.5 (Continued) Survival Table Cumulab~e
Proport1on SuMVmg at the Time
Time
Treatment
Estimate
Status
Nof Cumulatt 06
~ :I
(II
E :I u 0.4
0.2
o.o .00
2.00
4.00
6.00
8.00
Time since beginning to dance
10.00
t2.00
SurvivaVFailure Analysis
11.5.2 Prediction of Group Survival Times from Covariates Pred iction of survival (or failu re) time from cova ria tes is simila r to logistic regression (Chapte r 10) b u t w ith p rov ision for censored data. This method a lso d iffers in ana lyzing the time between events ra ther than pred icting the occu rrence of events. Cox p roportiona l haza rds (Cox regression) is the most popu lar method . Accelera ted failure- time models a re also ava ilable for the more sop histica ted user. As in other forms of regression (cf. Chap ter 5), ana lysis of su rvival can be d irect, sequ ential, or s tatistical. A treatmen t IV, if p resent, is analyzed the same as any o ther d iscrete covaria te. When there are only two levels of trea tment, the treated group is usua lly coded 1 and the control group 0. If th ere are more than two levels of treatmen t, d ummy variable coding is used to represen t group membership, as d escribed in Section 5.2. Successful p red iction on the basis of this IV ind icates significan t trea tment effects. 11.5.2.1 DI RECT, SEQUENTIAL, AN D STATISTICAL ANALYSIS The three major analytic stra tegies in s urvival analysis with covariates are direct (standard), sequen tial (hierarchical), and s tatistical (stepwise or setwise). Differences am ong the s trategies in volve wha t happens to overlapping va riabi lity due to correlated covaria tes (inclu ding treatmen t groups) and who d ete rmines the order of entry of covaria tes in to the equation. In the direct, o r sim u ltaneous, model a ll covaria tes enter the regression equation a t one time and each is assessed as if it entered last. Therefore, each covariate is evalu ated as to wh at it ad ds to p red iction of survival time that is different from the p rediction afforded by all the other covaria tes. In the sequentia l (sometimes ca lled hiera rch ica l) m odel, cova riates enter the equ ation in an order sp ecified b y the resea rcher. Each covariate is assessed by wha t it ad ds to the equ ation at its own poin t of en try. Covariates are en tered one a t a time or in b locks. The analysis proceeds in steps, with informa tion ab ou t the covariates in the equa tion given a t each step. A typical s tra tegy in survival ana lysis with an experim ental IV is to en te r all the n ontreatment cova riates a t the first step, and then en ter the covaria te(s) representing the treatmen t va riable at the second step. Ou tput after the second s tep ind icates the importance of the treatment va riab le to prediction of surv iva l after statistical adjustment fo r the effects of other cova riates. Statistical regression (sometimes generically called s tepwise regression) is a controversial p rocedure in which o rder of entry of variab les is based solely on statistica l criteria. The meaning of th e va riables is n ot relevant. Decisions about which variables are inclu ded in the equation are based solely on s tatistics comp uted from the particular sample drawn; minor differences in these statistics can have a p rofound effect on the apparent importance of a covariate, including the on e rep resenting treatment groups. The p rocedure is typically used during early stages of research, when nontreatment cova riates a re being assessed for their relationship with survival. Covariates which contribute little to prediction a re then dropped from subsequen t resea rch into the effects of treatment. As with logistic regression, data-d riven strategies are especially d angerous when importan t decisions are based on resu lts that may not generalize beyon d the sample chosen. Cross-va lidation is crucial if statistica l/ s tepwise techniques are used for any bu t the most p reliminary investigations. Both of the rev iewed programs p rov ide d irect analysis. IBM SPSS COXREG also p rovides both sequential and stepwise analysis. SAS LIFEREG p rovides only direct analysis, but SAS PHREG prov ides d irect, sequential, stepwise, and setwise analysis (in which models including all possib le combin ations of covariates are evaluated). 11.5.2.2 COX PROPORTIONAL-HAZARDS MODEL This method models event (failure, death) rates as a log-linear function of p red ictors, called covariates. Regression coefficients give the relative effect of each covariate on the survivor function. Cox modeling is availab le through IBM SPSS COXREG and SAS PHREG. Table 11.6 shows the results of a d irect Cox regression analysis throu gh SAS PHREG using the sma ll-sample d ata of Table 11.1. Trea tment (a d ichotomous variable) and age are considered
417
418 Chapter 11
Table 11.6 Syntax and Output for Direct Cox Regression Analysis Through SAS PHREG proc phreg daca =SASUSER . SURVI VE ; model months*dancing( O) = age treatmnt ;
run ; The PH REG Procedure Model Information
SASUSER SURVIVE
Dota Set
Dependent Variable MONTHS Censoring Vorlable
DANCING
Censoring Value(s)
0
Ties Hondling
BRESLOW
Number of Observotions Reod 12 Number of Observotions Used 12
Summary of the Number of Event and Censored Values Total
Event
12
11
Percent Censored
Censored
8 33
Model Fit Statistics Without With Criterion Covariates Covariates .2 LOG L
40 740
21 417
AIC
40 740
25 417
sac
40. 740
26.213
Testing Global Null Hypothesis: BETA•O Chi-Square OF Pr > ChiSq
Test likelihood Rolio
19 3233
2
< 0001
Score
14.7061
2
0.0006
Wald
6 6154
2
0 0366
Analysis or Uaximum likelihood Estimates Parameter OF
Parameter Standard Hazard Estimate Enor Chi.Square Pr > ChiSq Ratio
AGE
-0 22989
0 08945
6 6047
0 0102
0 795
TRfATMNT
-2 54t37
I 54632
27011
0 1003
0 079
covaria tes for p urposes of this analysis. This analysis assumes p roportionality of hazard (that the shapes of surv ivor functions are the same for all levels of treatment). Section 11.6.1 shows how to test the assumption and eva luate group differences if it is violated. The time va riable (months) and the response va riable (dancing, with 0 indicating censored data) are iden tified in a model instruction and are p redicted by two covaria tes: age and treatment. Model Fit Statistics a re useful for comparing models, as in logistic regression analysis (Chapter 10). Three tests are available to evaluate the hypothesis that all regression coefficients are zero. For example, the Likelihood Ratio test ind icates that the combination of age and treatment significantly p redicts survival time, x2 {2, N = 12) = 19.32, p < .0001. (Note that this is a large-sample test and cannot be taken too seriously w ith only 12 cases.) Significance tests
SurvivaVFailure Analysis
for the indiv idual p redictors a lso a re shown as Chi-Square tests. Thus, age significantly p red icts s urvival time, after adjusting for differences in treatment, r(l, N = 12) = 6.60, p = .01. However, treatment does not p red ict survival time, after adjus ting for differences in age, r(1, N = 12) = 2.70, p = .10. Thus, this analysis shows no s ignificant trea tment effect. Regression coefficients (Paramete r Estimate) for significant effects and Hazard Ratio may be in terp reted as per Section 11.6.5. Sequential Cox regression analysis through IBM SPSS COXREG is shown in Table 11.7. Age enters the pred iction equa tion first, followed b y treatment. Block 0 shows the model fit, - 2 Log Ukelihood, corresponding to - 2Log L in SAS, usefu l for comparing mod els (cf. Section 10.6.1.1). Step one (Biock1) shows a significant effect of age alone, by both the Wald test (the squared z test with the coefficient divided by its standa rd e rror, p = .006) and the likelihood ra tio test [ r (l, N = 12) = 15.345, p < .001 ). However, the results for trea tment differ for the two criteria. The Wald test gives the same result as reported for the direct analysis above in which trea tment is adjus ted for d ifferences in age and is not statistica lly significant. The likelihood ratio test, on the other hand, shows a significant change with the add ition of trea tment as a p red ictor, x2 {1, N = 12) = 3.98, p < .05. With a sample size this small, it is p robably safer to re ly on the Wald test indicating no statistically s ignificant treatment effect. 11.5.2.3 ACCELERATED FAILURE-TIME MODELS These models replace the genera l hazard function of the Cox model w ith a specific d istribu tion (exponential, normal, or some other). However, greater user sophistication is required to choose the distribu tion. Accelerated
Table 11.7 Syntax and Output for Sequential Cox Regression Analysis Through IBM SPSS COXREG COXREG months /STATUS=dancing(l) /CONTRAST (t r eatmnt)=Indicator /METHOD=ENTER age /METHOD=ENTER treatmnt /CRITERIA=PIN( . OS) POUT( . lO) ITERATE(20) .
Cox Regression Cau Procassing Summary Percent
N Cases ava lable in anatysis
Cases dropped
Evenr'
11
91.7'16
12
100.0'16
0
0.0'16
Cases wrtn negative time
0
0.0'16
Censored cases before
0
0.0'16
0 12
0.0'16 100.0'16
Censored
8.3'16
To1a1 Cases with missing values
Che aartiesl &Vent In a
stratum
Tolal Toto I a. Dependent Variable: Time since beginning to dance
Categori eal Variable Codingsa Frequency Trealmentb
.OO=control
7
1.00=treated
5
(1)0
0
a. Category variable: Treatment CTREATMNT) b. lndlcator Parameter Coding c. The (0.1) variable has been recoded. so Hs coenicients will not be the same as lor Indicator (0.1) coding.
(continued )
419
420 Chapter 11
Table 11.7
(Continued)
Block 0: Beginning Block Omnibus Tests of Model Co•lficl•n ts ·2 Log Ukelihood 40.740
Block 1: Method= Enter Omnibus Tesu of Model Coemclents• Ovtral (score)
·2L.og
«
Cht-$QUiit
l.JkebftOOd
Chan;. from Prt'rious Step Slg
Cnl-sauare
11.185 .001 a. Beginning Block Numbtt t . Mtlhod: Enttr
Change from Prtvtous 8eock
Sl~
d1
15.345
000
Chf.SQUift
15.345
df
Sto .000
Vuiables in the Equation B
AGE
SE
cl1
W31d
.on
•.199
Slg
E•p(B)
.006
7.640
.819
Vorbbles not In the Equodon• Score 3.477
Treatment
.062
a. Residual Chi Square • 3.477 wrth 1 df
Sig. • .062
Block 2: Method= Enter Omnibus Tests of Model Coemclents0 ·llog UktllhOOd
Oveutl (store) Chi-square d1
21.417
Sig.
14.706
Change from Previous Step cl1 Sig.
Ch~squart
.001
3.978
.046
cnanoe from Previous Bk>ck cl Slg.
Chf.s:quare
3.978
a. Beginning Block M.Jmber 2. Md'lod • En1tr
Variables In the Equation B
sE
W•Jd
Sig
E>p(8)
AGE
· .230
.099
6.605
010
.795
Trea1ment
2.542
1.546
2.701
.100
12.699
wean AGE Tre.mnen1
31.500 .583
failure-time models are hand led by SAS LIFEREG. IBM SPSS has no program for accelerated failure-time modeling. Table 11.8 shows an accelerated failure-time analysis corresponding to the Cox model of Section 11.5.2.2 through SAS LIFEREG, using the defau lt Weibull distribution. It is clear in this analysis that both age and treatment significantly p red ict surviva l (Pr>Chisq is less than .05), leading to the conclusion that treatment s ignificantly affects surviva l in belly dance classes, after adjusting for differences in age at which ins truction begins. The Type Ill Analysis of Effects is useful only when a ca tegorica l p redictor has more than 1 df.
SurvivaVFailure Analysis
Table 11.8
Syntax and Output for Accelerated Failure-Time Model Through SAS LIFEREG
p r oc lifereq data=SASUSER . SURVIVE ; model MONTHS*DANCING( O)= AGE TREATMNT ; run;
The LIFEREG Procedure Model Information Data Set
SASUSER SURVIVE
Dependent Variable
Log(MONTHS)
Censoring Variable
DANCING
0
Censoring Value(s) Number of Observations
12
Noncensored Values
11
Right Censored Values Leh Censored Values
0
Interval Censored Values
0
Number of Parameters
4
Wei bull
Name of Distribution
-4 960832804
Log Likelihood
Number of Observetions Reed 12 Number of Observations Used
12
Algorithm converged. Type Ill Analysis of Effecu Wald
OF Chi.Square Pr > Chi Sq
Effect AGE TREATMNT
15.3318
ChiSq
Intercept
0 4355
02849
.01229
0.9939
2~
01264
AGE
0.0358
0.0091
0.0119
0.0537
15.33
ChiSq Estimate
AGE
·0.22563
0.21790
1.0722
0.3004
0.798
TREATMNT
·3.54911
8.57639
0.1713
0 6790
0.029
MONTHAGE
·0.0001975
0 12658
00000
0 9988
1.000
MOIITHTRT
0.47902
3.97015
0 0146
0.9040
1.614
SurvivaVFailure Analysis
11.6.2.1 RIGHT-CENSORED DATA Right-censoring is the most common form of censoring and occurs when the event being studied has not occurred by the time data collection ceases. When the term "censoring" is used generically in some texts and compu ter programs, it refers to right-censoring. Sometimes right-censoring is under the control of the researcher. For example, the resea rcher decides to monitor cases unti l some predetermined n umber has fa iled, or until every case has been followed fo r three years. Cases are censored, then, because the researcher terminates data collection befo re the event occurs for some cases. O ther times the resea rcher has no contro l over right-censoring. For example, a case might be lost because a pa rticipant refuses to contin ue to the end of the study o r d ies for some reason other than the disease under study. Or, surviva l time may be un known because the entry time of a case is not under the con trol of the resea rcher. For example, cases a re monitored until a p redetermined time, bu t the time of en try in to the study (e.g., the time of surgery) varies randomly among cases so tha t to tal survival time is unknown. That is, a ll you know about the time of occurrence of an even t (failure, recovery) is tha t it occurred after some particular time; that is, it is greater than some value (Allison, 2010). Most methods of s urvival ana lysis do not d istinguish among types of right-censoring, but cases that are lost from the study may pose problems because it is assumed tha t there are no systematic d ifferences between them and the cases tha t remain (Section 11.3.2.4). This assumption is likely to be viola ted when cases volunta rily leave the s tudy. For example, students who drop ou t of a graduate p rogram are unlikely to have graduated (had they stayed) as soon as studen ts who con tinued. Instead, those who d rop out are probably among those who would have taken longer to graduate. About the only solution to the p roblem is to try to include covaria tes that are related to this form of censoring. All of the p rograms reviewed here deal with righ t-censored data, bu t none distinguishes among the various types of right-censoring. Therefore, results are misleading if assumptions abou t censoring are violated. 11.6.2.2 OTHER FORMS OF CENSORING A case is left-censored if the event of interest occurred before an observed time, so that you know only that survival time is less than the tota l observation time. Left-censoring is unlikely to occur in an experiment, because random assignment to conditions is normally made only for intact cases. However, left-censoring can occur in a nonexperimental study. For example, if you are studying the fa ilure time of a component, some components may have failed before you start your observations, so you don't know their total survival time. With interva l censoring, you know the interval within which the event occurred, but not the exact time w ithin the interval. Interval censoring is likely to occur when even ts are monitored infrequently. Allison (2010) p rovid es as an example annual testing for HIV infection in which a person who tested negative at year 2, bu t tested positive at year 3, is in terval-censored between 2 and 3. SAS UFEREG handles righ t-, left-, and in terval-censoring by requiring two time variables, upper time and lower time. Fo r right-censored cases, the upper time value is missing; for leftcensored cases, the lower time value is missing. Interval-censoring is indica ted b y different values for the two variab les.
11.6.3 Effect Size and Power Cox and Snell (1989) p rovide a measu re of effect size for logistic regression that is demonstrated for survival analysis b y Allison (2010, p. 282). It is based on G2, a likelihood-ratio chisquare s tatistic (Section 11.6.4.2), that can be calculated from SAS PHREG and UFEREG and IBM SPSS COXREG. Models are fit both with and without covaria tes, and a difference G2 is found by: G2 = [ ( - 2log-likelihood for smaller model) - ( - 2log-likelihood for larger model)]
(11.10)
425
426 Chapter 11 Then, R2 is found by R2 = 1 - e! -C'/n)
(11.11)
When applied to experiments, the R2 of greatest interest is the association between survival and treatment, after adjustment for other covariates. Therefore, the smaller model is the one that includes covariates but not treatment, and the larger model is the one that includes covariates nnd treatment. For the example of Table 11.7 (the sequential ana lysis in which trea tment is adjusted for age), - 2 log likelihood with age alone is 25.395 and - 2 log likelihood w ith age and treatment is 21.417, so that G2 = 25.395 - 21.417 = 3.978 for treatment. (Note that this value is a lso prov ided by IBM SPSS COXREG, as Change from Previous Block.) Applying Equation 11.11: R2 = 1 - e! - 3·97$/1 2) = 1 - .7178 = .28 Soper's (2016) online software may be used to find confidence limits around this va lue (cf. Figure 5.4). The nu mber of p red ictor variables is 2 (treatment and covariate), with N = 12.
The software p rov ides a 95% confidence limit ranging from .01 to .61. Allison (2010) points ou t that this R2 is not the p roportion of variance in su rvival that is explained b y the covaria tes, bu t merely represents relative association between su rvival and the covariates tested, in this case treatment after adjustmen t for age. Power in surviva l analysis is, as us ual, enhanced by larger sample sizes and covariates with stronger effects. Am ount of censoring and patterns of entry of cases into the stud y also affect power, as does the relative size of trea tment groups. Unequal sample sizes reduce power while equal sample sizes increase it. Estimating sample sizes and power for survival analysis is not included in the software d iscussed in this book. Two stand-a lone p rograms p rovide tests for a large nu mber of s urvival p rocedures: NCSS pass (Hintze, 2017), and nQuery Advisor 7 (Elashoff, 2007). There are also several ca lculators ava ilab le online.
11.6.4 Statistical Criteria Numerous statis tical tests are available for eva luating group d ifferences d ue to treatment effects from an actuaria l life table o r p roduct-limit analysis, as discussed in Section 11.6.4.1. Tests for eva luating the relationships among survival time and various covariates (including treatment) are discussed in Section 11.6.4.2. 11.6.4.1 TEST STATISTICS FOR GROUP DIFFERENCES IN SURVIVAL FUNCTIONS Severa l s tatistical tests a re available for evaluating group differences, and there is inconsisten t labeling among p rograms. The tests differ p rimarily in how cases are weighted, with weighting based on the time that groups begin to d iverge during the course of surviva l. Fo r example, if the grou ps begin to d iverge right away (un treated cases fa il quickly b ut trea ted cases do not), s ta tistics based on heavier weighting of cases that fail quickly show grea ter grou p differences than s tatistics for which a ll cases are weighted equally. Table 11.11 s ummarizes statistics for d ifferences among groups tha t a re available in the p rograms. SAS LIFETEST p rovides three tests: The Log-Rank and Wilcoxon statistics and the likelihood-ratio test, labeled - 2Log(LR), which assumes an exponen tia l distribution of failures in each of the groups. IBM SPSS KM offers three statis tics as pa rt of the Kaplan- Meier analysis: the Log Rank test, the Tarone-Ware statistic, and the Breslow statistic, which is equivalen t to the Wilcoxon statistic of SAS. IBM SPSS SURVIVAL p rovid es an a lternative form of the Wilcoxon test, the Gehan statistic, which ap pears to use weights in te rmedia te between Breslow (Wilcoxon) and Tarone-Ware.
SurvivaVFailure Analysis
Table 11.11 Tests for Differences Among Groups in Actuarial and Product-limits Method s No menclature Test
SAS" UFETEST
IBMSPSS SURVIVAL
IBMSPSS KM
Comments
Log-Rank
N.A.
Log Rank
Equal weight to all observations
2
Tarone
N.A.
Tarone-Ware
Slightly greater weight to early observations, between test 1 and test 3
3
Wilcoxon
NA
Breslow
Greater weight to early observations
4
N.A.
Wilcoxon (Gehan)
N.A.
Differs slightly from test 3
5
- 2Log(LR)
N.A.
N.A.
Assumes an exponential distribution of lail.Jres In each group
aAdditional SAS tests are listed in Table 11.24.
11.6.4.2 T EST STATISTICS FOR PREDICTION FROM COVARIATES Log-likelihood chi-square tests (G2 as d escribed in Section 11.6.3) are used both to test the hypothesis that all regression coefficients for covariates are zero in a Cox p roportional hazards model and to evaluate differences in models with and w ithou t a particular set of covariates, as illus tra ted in Section 11.6.3. The latter ap p lication, using Equation 11.10, most often evaluates the effects of trea tmen t after adjustment for other covariates. All of these likelihood-ratio statistics a re large sample tests and are not to be taken serious ly with small samples such as the example of Section 11.4. Statistics are also available to test regression coefficients separately for each covariate. These Wald tests are z tests where the coefficient is d ivided by its s tanda rd e rror. When the test is applied to the trea tment covaria te, it is another test of the effect of treatment after adjustment for all other cova riates. IBM SPSS COXREG p rovid es all of the required information in a sequen tial run, as illustrated in Table 11.7. The last step (in which treatmen t is included) shows Chi-Squa re for Change (- 2 Log Likelihood) from Previous Block as the likelihood ratio test of treatment as well as Wald tests for both treatmen t and age, the covariates. SAS PHREG provid es Mod e l Ch i-Square which is overa ll G2 w ith and w ith ou t a ll covariates. A li kelihood -ra tio test fo r models with and w ith ou t trea tmen t (in which o ther covariates a re included in both models) requires a sequ ential run fo llowed by applica tion of Equation 11.10 to the models with and withou t treatmen t. SAS LIFEREG, on the o ther h and , p rov id es n o overa ll chi-squ a re likelihood-ratio test bu t does p rov id e chi-squ are tests for each covariate, adjus ted for a ll oth ers, based on the squ a res of coefficients d ivided by their stan dard e rrors. A log-likelih ood valu e fo r th e whole model is also p rov id ed, so th at two runs, one with and th e o ther w ith out trea tment, p rovide the s tatis tics n ecessa ry for Equation 11.10.
11.6.5 Predicting Survival Rate 11.6.5.1 REGRESSION COEFFICIENTS (PARAMETER ESTIMATES) Statistics for predicting s urvival from covariates require ca lculating regression coefficients for each covariate where on e or more of the "covariates" may represent treatmen t. The regression coefficients give the relative effect of each covaria te on the survival function, but the size d epends on the scale of the covariate. These coefficients may be used to d evelop a regression equ ation for risk as a DV. An example of this is given in Section 11.7.2.2. 11.6.5.2 HAZARD RATIOS Because su rvival analysis is based on a linear combina tion of effects in the exponent (like logistic regression, Chapter 10) ra ther than a simple linear comb ina tion of effects (like mu ltip le regression, Chapter 5), effects are most often in terpreted as hazard or ris k ratios. How does a covariate ch ange the risk of failure? For examp le, how does a on e-year increase in age ch ange the risk of drop ping out of dance classes? Hazards are found from a regression coefficient (B) as e8 . However, for correct interp retation, you also have to consider the d irection of coding for s urvival. In the small-sample example (Tab le 11.1), dropping ou t is coded 1 and "surviving" (still d ancing) is coded 0. Therefore, a positive regression coefficient means that an increase in age increases the likelihood of dropp ing out while a
427
428 Chapter 11
negative regression coefficient means that an increase in age d ecreases the likelihood of dropping out. Treatment is also cod ed I, 0 where I is used for the group that had a preinstruction night out on the town and 0 for the control group. For this variable, a change in the value of the treatment covariate from 0 to I means that the d ancer is more likely to drop out following a night out if the regression coefficient is positive, and less likely to d rop out following a night out if the regression coefficient is negative. This is because a positive regression coefficient leads to a hazard (risk) ratio greater than one while a negative coefficient leads to a hazard (risk) ratio less than one. Programs for Cox p rop ortiona l-haza rds models show both the regression coefficients and haza rd ratios (see Tables 11.6 and 11.7). Regression coefficients are labeled B or Estima te. Hazard ra tios a re labeled Exp (B) o r Hazard Ratio. Tab le 11.7 shows that age is significantly related to s urviva l as a belly d ancer. The negative regression coefficien t (and hazard ra tio less than I) ind icates tha t old er dancers are less likely to d rop ou t. Reca ll that e8 = 0.79; this ind icates that the risks of d ropping ou t a re decreased by abou t 2I % [ ( I - .79(IOO) J with each yea r of increasing age. The hazard of d ropp ing ou t for a 25-year-old , for instance, is only 79% of tha t for a 24-year-old . (If the hazard ra tio were .5 for age, it wou ld indicate that the risk of drop ping ou t is halved with each year of increasing age.) In some tests, the trea tment covariate fails to reach s tatistical significance, bu t if we a ttribute this to lack of power w ith such a small sample, rather than a lack of treatment effectiveness, we can in te rpret the hazard ra tio for illustra tive purposes. The h azard ratio of .08 (e- 2 ·542 ) for trea tment indica tes that treatmen t decreases the risk of d ropp ing ou t by 92%. 11.6.5.3 EXPECTED SURVIVAL RATES More complex methods a re required for p redicting expected s urvival ra tes at various time p eriods for particu lar valu es of covaria tes, as described using SAS p rocedu res by Allison (20IO, I86-I88.). For example, what is the s urvivor function for 25-year-olds in the control group? This requ ires creating a data set with the p articu lar covaria te va lues of interest, for examp le, 0 for treatmen t an d 25 for age. The mod el is run with the o riginal data set, and then a print p rocedure app lied to the newly crea ted data set. Table 11.I2 shows syntax and p artial ou tpu t for p red iction runs for two cases: a 25-year-old dancer in the contro l group and 30-yea r-old d ancer in the trea ted group. The likelihood of s urvival, columns, for a 25-year-old in the control condition drops quickly after the first month and is very low by the fifth month; on the o ther hand, the likelihood of survival for a 30-year-old in the treated condition stays p retty s teady through the fifth month.
Table
1 1.12 Predicted Survivor Functions for 25-Year-Oid Control Dancers and 30-Year-Oid Treated Dancers (Syntax and Partial Output Using SAS DATA and PHREG) data surv; set SASUSER . SURVIVE ; data covals ; input TREATMNT AGE ; datalines ; 0 25 1 30 run;
proc
phreg
data~ SASUSER . SURVIVE ;
model MONTHS'DANCING( O l~AGE TREATMNT ; baseline out=predict covariates=covals survival =s lower=lcl upper=ucl I nomean ; run;
proc
print data=predict ;
run;
Model Fit SUitistics Without
With
Criterion Covariates Covariates -2LOG l
40.740
21417
AIC
40.740
25.417
40.740
26.213
SBC
J
SurvivaVFailure Analysis
Table 11.12
(Continued) Testing Global Null Hypothesis: BETA•O Test
Chi-Square OF Pr > ChiSq
Likelihood Ratio
19.3233
2
ChiSq Ratio
AGE
-0.22989
0 08945
6.6047
0.0102
0.795
TREATMNT
-2 54137
1 54632
2.7011
0.1003
0.079
Obs TREATMNT AGE MONTHS
s
lei ucl
0 1.00000
0
25
2
0
25
3
0
25
2 0.78242 0.55207
4
0
25
3 061716 0.33466
0.94707 0.83814
5
0
25
4 0.46688 0. 19552
6
0
25
5 0.31702 0.09016
7
0
25
7 0.00000 0.00000
8
0
25
8 0.00000 0.00000
9
0
25
10 0.00000 0.00000
10
0
25
11 0.00000 0.00000
11
30
0 1.00000
12
30
1 0.99864 0.99201
13
30
2 0.99390 0.96734
14
30
3 0.98803 0.93913
15
30
4 0.98117 0.90688
16
30
5 0.97174 0.86293
17
30
7 0.69716 0.35039
18
30
8 0.09552 0.00078
19
30
10 0.00003 0.00000
20
30
11 0.00000 0.00000
11.7 Complete Example of Survival Analysis These experimenta l d ata are from a clinica l trial of a new drug (D-penicillamine) versus a placebo for trea tment of p rimary biliary cirrhosis (PBC) cond ucted at the Mayo Clinic between 1974 and 1984. The data were copied to the In ternet from Appendix D of Fleming and Harrington (1991), who describe the da ta set as follows: A total of 424 PBC patients, referred to Mayo Clinic during that 10-year interval, met eligibility criteria for the randomized p lacebo controlled trial of the drug 0 -penicillamin e. The first 312 cases in the data set pa rticipa ted in the randomized trial and contain largely complete da ta. (p. 359)
429
430 Chapter 11
Thus, differences in s urvival time following treatment w ith e ither the experimental drug or the placebo a re examined in the 312 cases, with nearly complete da ta, who participated in the trial. Coding for d rug is 1 = D-penicillamine and 2 = p lacebo. Add itional covaria tes a re those in the Mayo model for "assessing surviva l in relation to the natural history of p rimary biliary cirrhosis" (Markus et a l., 1989, p. 1710). These include age (in d ays), serum bilirubin in mg/ d l, serum albumin in gm / dl, p rothrombin time in seconds, and p resence of ed ema. Edema has three levels trea ted as continuous: (1) no ed ema and no diuretic therap y for edema, coded 0.00; (2) edema p resent without diu retics or ed ema resolved by d iure tics, cod ed 0.50; and (3) edema despite diu retic therapy, coded 1.00. Codes for STATUS are 0 = censored, 1 = liver transplant, and 2 = event (nonsurviva l). Remaining va riables in the d ata set are sex, p resence versus absence of ascites, p resence o r absence of hepatomegaly, p resence o r absence of spid ers, serum cholesterol in mg/ d l, urine copper in ug/ day, alkaline p hosp hatase in U / liter, SGOT in U / ml, triglycer ide in mg/ d l, plateJets per cubic ml/100, and histologic stage of disease. These va riables were not used in the p resent ana lysis. The prima ry goal of the clinical tria l is to assess the effect of the expe rimental d rug on s urvival time after statistically adjusting for the other cova riates. A seconda ry goal is to assess the effects of the o ther covariates on survival time. Data files a re SURVIVAL.•.
11.7.1 Evaluation of Assumptions 11.7.1.1 ACCURACY OF INPUT, ADEQUACY OF SAMPLE SIZE, MISSING DATA, AN D DISTRIBUTI ONS IBM SPSS DESCRIPTIVES is used for a p reliminary look at the da ta, as seen in Table 11.13. The SAVE request produces s tanda rd scores for each covariate for each case used to assess univa riate outliers. The values for most of the covaria tes appear reasonable; for example, the average age is abou t 50. The sample size of 312 is adequa te for survival analysis, and cases are evenly split between experimen tal and placebo grou ps (mean = 1.49 w ith coding of 1 and 2 for the groups). None of the cova riates have missing da ta. However, except for age and drug (the treatment), all of the covaria tes a re seriously s kewed, with z-scores for s kewness ranging from ( - .582)/ 0.138 = - 4.22 for serum albumin to {2.85)/ 0.138 = 20.64 for bilirubin. Kurtosis values listed in Table 11.13 pose no prob lem in this large sample (cf. Section 11.3.2.2). Decisions abou t transformation a re postponed until ou tliers a re assessed. 11.7.1.2 OUTLIERS Univariate outliers are assessed by finding z = (Y - Y)/S for each covaria te's lowest and highest scores. The /SAVE instruction in the IBM SPSS DESCRIPTIVES run of Table 11.13 adds a column to the data file of z-scores for each case on each covariate. An IBM SPSS DESCRIPTIVES run on these standard scores shows minimum and maximum values (Table 11.14).
Table 11.13
Description of Covariates Through IBM SPSS DESCRIPTIVES (Syntax and
Output) DESCRIPTI VE$ VARIABLES=AGE ALBUMIN BILIRUBI DRUG EDEMA PROTHOM /SAVE /STATISTI CS=MEAN STDDEV MIN MAX KURTOSIS SKEWNESS .
.. StahtJc:
Descriptive Sutistics Minimum
Maximum
......,
Stattstl~
StatiW
F
Canonical Correlation
Table 12.4
(Continued) Standardized Canonical Coefficients for the VAR Variables V1
V2
TS
·0.6253
0.7972
TC
0.6861
0.7456
Standardized Canonical Coefficients for the WITH Variables W1
W2
BS
-0.4823
0.8775
BC
0.9010
0 4368
Canonical Structure
Correlations Between the VAR Variables and Their Canonical Variables V1
V2
TS
.0.7358
0.6772
TC
0 7868
0.6172
Correlations Between the
WITH Variables and Their Canonical Variables W1
W2
BS
.0 4363
0 8998
BC
0 8764
0 4816
Correlations Between the VAR Variables and the Canonical Variables of the WITH Variables W1
W2
TS
-0 6727
0 5163
TC
0.7193
0.4706
Correlations Between the WITH Variables and the Canonical Variables of the VAR Variables V1
V2
BS
-0.3988
0.6861
BC
0.8011
0.3672
In SAS CANCORR (Table 12.4), the one set of variables (DVs) is lis ted in the inpu t statemen t that begins var, the other set (IVs) in the statement tha t begins with . Redundancy analysis also is available. Th e first segment of outpu t contains the canonical correla tions for each of the canonical varia tes (labeled I and 2), including adjusted an d squared correla tions as well as s tan dard e rrors for the correlations. The next part of the tab le shows the eigenvalues, the differen ce between eigenvalues, the p roportion, and the cumulative proportion of variance in the solution accounted
459
460 Chapter 12
Table 12.5
Syntax and Selected IBM SPSS CANCORR Output for Canonical Correlation Analysis on Sample Data in Table 12.1 INCLUDE ' Canonical correlation . sps' . ts , tc / SET2 = bs , be/ .
CANCORR SETl
Run MATRIX procedure :
Correlations for Set- 1 TS TC TS TC
- . 1611
. 0000 - . 1611
1. 0000
Correlations for Set- 2 BS
BC
BS
. 0000
. 0511
BC
. 0511
1 . 0000
Correlations Between Set- 1 and Set- 2 BS
BC
TS
. 7580
- . 3408
TC
. 1096
. 8570
Canonical Correlations . 914 . 762
2
Test that
Wilk' s 2
remaining correlations are :zero: Chi - SQ OF Sig .
. 069
12 . 045
4 . 000
. 017
. 419
3 . 918
1.000
. 048
Standardized Canonical Coefficients for Set- 1 2 - . 625 . 797 TS TC
. 686
. 746
Raw Canonical Coefficients for Set- 1 2 TS
- . 230
. 293
TC
. 249
. 270
Standardized Canonical Coefficients for Set- 2 2 BS
- .482
. 878
BC
. 901
. 437
Ra\\• Canonical Coefficients for Set- 2
2 BS
- . 170
. 309
BC
. 372
. 180
Canonical Loadin9s for Set- 1 1 2 TS - . 736 . 677 TC
.787
.61 7
Canonical Correlation
Table 12.5 (Continued) Cross Loadings for Set- 1
2 TS
- . 673
. 516
TC
. 719
. 471
Canonical Loadings for Set - 2 1 2 BS
- .436
. 900
BC
. 876
. 482
Cross Loadings for Set - 2 1 2 BS
- .399
. 686
BC
. 801
. 367
Redundancy Analysis : Proportion of variance of Set- 1 Explained by Its Own Can . var .
Prop Var CV1 - 1
. 580
CV1 - 2
. 420
Proportion of variance of Set- 1 Explained by Opposite Can . var .
Prop Var CV2 - 1 CV2 - 2
. 485 . 244
Proportion of variance of Set- 2 Explained by Its Own Can . var .
Pr op Var CV2 - 1 CV2 - 2
. 479 . 521
Proportion of Variance of Set- 2 Explained by Opposite Can . Var .
Prop Var CV1 - 1 CV1 - 2
.400 .303
-END MATRIX-
for by each canonical varia te pair. The Test of HO: ... table shows "peel off" s ignificance tests for canonical variate pairs evaluated through F followed in the next table by several multivaria te significance tests. Matrices of raw and s tandardized canonical coefficients for each canonical va riate labeled 'var' and 'with' in the syntax follow; loading matrices are labeled Corre lations Between the . . . Variables and Their Canonical Variables. The po rtion labeled Canonical Structure is a part of the redundancy analysis and shows another type of loading matrices: the correlations between each set of variables and the canonical variates of the other set. Table 12.5 shows the canonical correla tion analysis as run throu gh IBM SPSS CANCORR- a macro availab le through syntax. (IBM SPSS MANOVA may also be used through syntax for a canonica l ana lysis, but the ou tpu t is much more difficult to interpret.) The INCLUDE instruction in vokes the IBM SPSS CANCORR macro by running the syntax file: canonical correlation.sps. 6 The rather compact output begins with correlation matrices for both sets of variables inctividually and together. Canon i cal Co rrela t i on s are then given, followed by their peel down~
6
A copy of this syntax file is included with the SPSS data files for this book online.
461
462 Chapter 12 tests. Standardized and raw canonical coefficients and loadings are then shown, in the same format as SAS. Correlations between one set of variables and the canonical variates of the other set are labeled Cross Loadings. A redundancy analysis is p rod uced by default, showing for each set the proportion of variance associated with its own and the other set. Compare these values with results of Equations 1211 through 12.13. The program writes canonical scores to the d ata file and writes a scoring p rogram to another file.
12.5 Some Important Issues 12.5.1 Importance of Canonical Variates As in most statis tical p rocedures, establishing s ignificance is usually the first step in evalu ating a solu tion. Conven tional s tatistical p rocedu res apply to significance tests for the number of canonical variate pairs. The resu lts of Equations 12.13 and 12.14, or a corresponding F test, a re ava ilab le in all p rograms reviewed in Section 12.7. But the number of statistically significant pairs of canonical va riates is often la rger than the number of interp retable pairs if N is a t all sizab le. The only potential source of confus ion is the meaning of the chain of s ignificance tests. The first test is for all pa irs taken together and is a test of independence between the two sets of va riables. The second test is for all pairs of variates with the first and most impo rtant pa ir of canonical variates removed; the third is d one with the first two p airs removed , and so forth. If the first test, b ut not the second, reach es significance, then only the first pa ir of canonica l varia tes is interpreted? If the first and secon d tests are significant b ut the third is not, then the first two pairs of va riates are interpreted, and so on. Because canonical correlations a re reported ou t in d escending order of importance, usu ally only the first few pairs of va riates a re in terp reted . Once significance is established , amount of variance accounted for is of critical importance. Because there a re two sets of va riables, severa l assessmen ts of variance are relevan t. First, there is variance overlap between varia tes in a pair; second is a variance overlap between a variate and its own set of variab les; and third is a va riance overlap between a variate and the o ther set of va riables. The first, and easiest, is the va riance overlap between each significan t set of canonical varia te p airs. As indica ted in Equation 12.2, the squared can onical correlation is the overlapping variance between a pa ir of canonica l variates. Most researchers do not interpret p airs with a canonical correlation lower than .30, even if s tatistica lly significant,8 because rc va lues of .30 or lower rep resent, squared , less than a 10% overlap in variance. The next consideration is the variance a canonical va riate extracts from its own set of variab les. A pair of canonica l variates may extract very d ifferent amounts of variance from thei r respective sets of variables. Equ ations 12.11 and 12.12 indicate that the variance extracted, pv, is the sum of squared loadings on a varia te d ivided b y the number of va riables in the sets.9 Because canonical va riates are independ ent of one an other (orthogon al), pvs are s ummed across a ll significant varia tes to a rrive at the total va riance extracted from the va riables by all the va riates of the set. The last cons idera tion is the variance a va riate from on e set extracts from the variables in the o ther set, ca lled red11ndnncy (Stewart and Love, 1968; Miller and Far r, 1971). Equation 12.12 shows tha t redundancy is the p rop ortion of variance extracted by a canonical varia te times the canonical correla tion for the pa ir. A canonical varia te from the IVs may be strongly correlated with the IVs, but weakly correlated with the DVs (and vice versa). Therefore, the redun d ancies for a pa ir of canonical varia tes a re usually not equal. Because canonica l va riates are orthogon al,
7 It is possible that the first canonical variate pair is not, by itself, significant, but rather achieves significance only in combination with the remaining canonical variate pairs. To date, there is no significance test for each pair by itself. 8 Significance depends, to a large extent, on N . 9 This calculation is identical to the one used in factor analysis for the same purpose, as shown in Table 13.4.
Canonical Correlation
redundancies for a set of variables are also added across canonical variates to get a total for the DVs relative to the IVs, and vice versa.
12.5.2 Interpretation of Canonical Variates Canonical correlation creates linear combina tions of variables, canonical variates, that represent mathematically viable combinations of variables. However, although mathematically viable, they are not necessarily in terpretable. A major task for the researcher is to discern, if possible, the meaning of pairs of canonical variates. Interpreta tion of significant pairs of canonical variates is based on the loading matrices, A, and AY (Equa tions 12.9 and 12.10, respectively). Each pair of canonical variates is interpreted as a pair, with a variate from one set of variables interpreted vis-a-vis the variate from the other set. A varia te is interpreted by considering th e pattern of va riables highly correlated (loaded) with it. Because the loading ma trices con tain correlations, and because squ ared correlations measure overlapping variance, variables with correlations of .30 (9% of variance) and above are us ually interpreted as part of the varia te, an d variab les with loadings below .30 are not. Deciding on a cu toff for interpreting loadings is, however, somewhat a matter of taste, although guidelines are p resen ted in Section 13.6.5.
12.6 Complete Example of Canonical Correlation For an example of canonical correla tion, va riables are selected from among those made available by research described in Appendix B, Section B.I. Th e goal of analysis is to discover the dimensions, if any, along which certain attitudinal variables are related to certain hea lth characteristics. Files are CANON.•. Selected attitudinal variab les (Set I) include attitudes toward the role of women (ATTROLE), toward locus of control (CONTROL), toward current marital s tatus (ATTMAR), and toward self (ESTEEM). Larger nu mbers indica te increasingly conservative attitudes about the proper role of women, increasing feelings of powerlessness to con trol one's fate (externa l as opposed to interna l locus of con trol), increasing d issatisfaction with current marital sta tus, and increasing!y poor self-esteem. Selected health variables (Set 2) inclu de mental hea lth (MENHEAL), p hysical health (PHYHEAL), n umber of visits to hea lth professionals (TIMEDRS), attitude toward the use of medication (ATTDRUG), and a frequ ency- d uration measu re of the use of psychotropic d rugs (DRUGUSE). Larger numbers reflect poorer men tal and physical h ealth, more visits, greater willingness to use drugs, and more use of them.
12.6.1 Evaluation of Assumptions 12.6.1.1 MISSING DATA A screening run through SAS MEANS, illustrated in Tab le 12.6, finds missing data for 6 of the 465 cases. One woman lacks a score on CONTROL, and five lack scores on ATTMAR. With deletion of these cases (Jess than 2%), remaining N = 459. 12.6.1.2 NORMALITY, LINEARITY, AND HOMOSCEDASTICITY SAS p rovides a particu la rly flexible scheme for assessing normality, lin earity, and homoscedasticity between pairs of canonical variates. Canonical va riate scores a re saved to a da ta file, and then PROC PLOT permits a scatterplot of them. Figure 12.3 sh ows two scatterplots produced by PROC PLOT for the example using default size values for the p lots. The CANCORR syntax runs a preliminary canonical correlation analysis and saves the canonica l variate scores (as well as th e original da ta) to a file labeled LSSCORES. The four canonical variates for the first set are labeled V1 through V4; the canonical varia tes for the second set are labeled W1 through W4. Thus, the proc p l ot syntax requ ests
463
464 Chapter 12
Table 12.6
Syntax and Selected SAS MEANS Output for Initial Screening of Canonical Correlation Data Set
proc means data=SASUSER . CANON vardef= DF N NMISS MIN MAX MEAN VAR STD SKEWNESS KURTOSIS ; var TIMEDRS ATTDRUG PHYHEAL MENHEAL ESTEEM CONTROL ATTMAR DRUGUSE ATTROLE; run ; TM MEANS Pr~re H NMiM
V•ri•bte
L•bel
~EORS
v,,,.. to htalth poltst101\ab
ATTORUG Alt•Ludt toward un d medle111on I'HYHEAl. Phystcll hn ah symptoms MENHEAI. tMtllal health symp.oms ESTE£~1
Wt~ttM
ATTROLE
0 18 0000000 8 0000000 29 0000000
s 0000000
'" 465
_.,..,.poe "'""S
AltitUdes toward role d
10 0000000
11 0000000 58 0000000 0 &6 0000000 18 0000000 ss 0000000
46S
~
Std O.v SkeWMN
KllftOMf
7 9010753 119 8&95032 10-932 3 2•81170 U 100S1SS 7 &86021s 133&5
...
...
"'"
SOURCE: Created with Base SA$ 9.2 Software. Copyright 2008, SA$ Institute Inc .. Cary, NC, USA. All Rights Reserved. Reproduced with permission of SAS Institute Inc., Cary, NC.
Minimu m and maximum s tanda rd scores are w ithin a range of ± 3.29 with the exception of a large score on ESTEEM (z = 3.34), not disturbing in a sample of over 400 cases. SAS REGRESSION is used to screen mu ltivariate outliers by requesting that leverage values be saved in a n ew data file. Tab le 12.9 sh ows syntax to run the regression ana lysis on the first set of variab les and save the H (leverage) valu es into a da ta file labeled CANLEV. Leverage values for the first few cases are shown in the tab le. Critical valu e of Mah alanobis d istance w ith four variab les at a = .001 is 18.467. Use Equ ation 4.3 to con ver t this to a critica l leverage valu e: 18.467 h;; = 459 - 1
1
+ 459 = 0•0425
There a re no ou tlie rs in the segment of the da ta set shown in Tab le 12.9 o r any o ther in e ither set of variab les. 12.6.1.4 MULTICOLLINEARITY AND SING ULARITY SAS CANCORR protects agains t mu lticollinearity and singularity by setting a value for tolerance (sing) in the main analysis. It is not necessa ry to further check mu lticollinearity unless there is reason to expect large SMCs a mong variab les in either set and there is a d esire to elimin ate logically red undan t va riables.
12.6.2 Canonical Correlation The number an d impo rtance of can onical variates are determined using p roced ures from Section 12.5.1 (Table 12.10). RE D requests redund ancy s tatistics. Significance of the relationships between the sets of va riables is reported d irectly by SAS CANCORR, as shown in Tab le 12.10. With a ll four canonical correlations included , F(20, 1493.4) = 5.58, p < .001. With th e firs t and second canonical correlations removed, F valu es are not s ignificant; F(6, 904) = 0.60, p = .66. Therefore, only significant relationships are in the first two pairs of canonical va riates and these a re interp reted . Canonical correlations ('c) and eigenvalues (r~) are also in Table 12.10. The first canonical correlation is .38 (.36 adjusted ), represen ting 14% overlapping variance for the first pair of canonical
467
468 Chapter 12
Table 12.10
Syntax and Selected Portion of SAS CANCORR Output Showing Canonical Correlations and Significance levels for Canonical Correlations
proc cancorr data=SASUSER . CANONT RED
out=SASUSER . LSSCORNW sing=lE- 8 ; var ESTEEM CONTROL LATTMAR ATTROLE; with LTIMEDRS ATTDRUG LPHYHEAL MENHEAL LDRUGUSE ; run ;
-....... --·- ....... ·- ..... .......... -- - ... ·-.. ....... . ··- ..... .·-.... ·- ... ~·
Cofto..... ~
0)J"U12 O<t
I
•'
tOlCHr
.,._ ,.._ c-• ~
COMWon
f .......... of~
......
..........
tti11
t0l»>1
0012031
oom
,_,
000117•
""'
000111t
0Mt7
"'"
t0)12
00112
_..
..... ..... .........
T-et•n..u.INNIItetC01144Mr.wlllm.c--afMI .. ._.....,._,_
- c~.c
E~.._ ~
c-....
FY•M
'"""" '"''""
,,.
0ttii2U1
'"
" "•
.·-
...... ....
t1t3S
UCtl
variates (see Equation 12.2). The second canonical correlation is .27 (.26 adjusted ), representing 7% overlapping variance for the second pair of canonical variates. Although highly significant, neither of these two canonical correlations represents a substantia l relationship between pairs of canonical variates. Interpretation of the second canonical correlation and its corresponding pair of canonical variates is marginal. Load ing matrices between canonical variates and original variables are in Table 12.11. Interpretation of the two significant pairs of canonical variates from loadings follows proced u res mentioned in Section 12.5.2. Correlations between variab les and va ria tes (loadings) in excess of .3 are inte rpreted. Both the direction of correlations in the loading matrices and the d irection of scales of measurement are considered when interpreting the canonical variates. The first pair of canonical varia tes has high loadings on ESTEEM, CONTROL, and LATTMAR (.596, .784, and .730, respectively) on the attitudinal set and on LPHYHEAL and MENHEAL (.408 and .968) on the health s ide. Thus, low self-esteem, exte rnal locus of control, and dissatisfaction w ith marital s tatus are rela ted to poor physical and mental hea lth.
Table 12.11
Selected SAS CANCORR Output of Loading Matrices for the Two Sets of Variables. Syntax is in Table 12.10 Canonical Structure Correlations Between the VAR Variables and Their Canonical Variables V1 ESTEEM
Self-esteem
V2
V3
V4
0 5958
0.6005 -0.2862 ·0 4500
CONTROL Locus of control
0 7836
0 1478 -0 1771
LATIMAR log( ATIMAR )
0.7302 ·0.3166
ATIROLE Attitudes toward role of women ·0 0937
0.7829
0 5769
0.4341 .0.4221 0.6045
0.1133
Correlations Between the WITH Variables and Their Canonical Variables W1
W3
W4
0.1229 -0.3589 .0.8601
0.2490
LTIMEDRS
log( TIMEDRS + 1 )
ATIORUG
Attitude toward use of medication 0.0765
W2
0.5593 ·0.0332
0.4050
LPHYHEAL log( PHYHEAL )
0.4082 -0.0479 -0.6397 -0.5047
MENHEAL
0.9677 -0 1434 .0.1887 0 0655
Mental health symptoms
LDRUGUSE log( DRUGUSE + 1 )
0 2764 -0.5479
0.0165 -0.0051
Canonical Correlation
The second pair of canonica l varia tes has high loadings on ESTEEM, LATTMAR, and ATTROLE (.601, - .317 and .783) on the attitudinal side and LTIMEDRS, ATTDRUG, and LDRUGUSE ( - .359, .559, and - .548) on the health s ide. Big numbers on ESTEEM, little numbers on LATTMAR, and big numbe rs on ATTROLE go with little numbers on LTIMEDRS, big numbers on ATTDRUG, and little numbers on LDRUGUSE. That is, low self-esteem, satisfaction w ith marital status, and conservative a ttitudes toward the p roper role of women in society go with few visits to ph ysicians, favorable attitudes toward the use o f drugs, and little actual use of them. (Figu re that one out!) Loadings are converted to pv values by ap plication of Equations 12.11 and 12.12. These values a re shown in the ou tput in sections labeled Standardized Variance of the .. . Varia bles Expla ined by Their Own Canonical Variables (Table 12.11). The va lues for the first pair of canonica l variates a re .38 for the first set of variables and .24 for the second set of variables. That is, the first canonical varia te pair extracts 38% of variance from the attitudinal variables and 24% of variance from the health variables. The va lues for the second pa ir of canonical variates are .27 for the first set of va riables and .15 for the second set; the second canonical variate pa ir extracts 27% of variance from the attitudinal va riables and 15% of variance from the hea lth variables. Together, the two canonical va riates account for 65% o f va riance (38% plus 27%) in the attitudinal set, and 39% of variance (24% and 15%) in the hea lth set. Redundancies for the canonical varia tes are found in SAS CANCORR in the sections labeled Variance of the . .. Variables Explained by The Opposite Canonical Varia bles (Table 12.12). That is, the first health variate accounts for 5% of the variance in the attitudinal variables, and the second health varia te accoun ts for 2% of the variance. Together, two health variates "explain" 7% o f the variance in attitudinal variables. The first attitudina l variate accounts for 3% and the second 1% of the variance in the health set. Together, the two attitudinal variates overlap the variance in the health set, 4%.
Table 1 2.1 2 Selected SAS CANCORR Output of Loading Matrices for the Two Sets of Variables. Syntax Is in Table 12.10 Canonical Redundancy Analysis Raw Variance of the VAR Variables Explained by Their Own Canonical Variables
The Opposite Canonical Variables
Canonical Variable Cumulative Canonical Cumulative Number Proportion Proportion R-Square Proportion Proportion
0.1105
0 1105
0.1436
0.0159
0.0159
2
0.5349
0.6454
0.0720
0 0385
0.0544
3
0 2864
0 9318
0 0079
0 0023
0 0566
4
0 0682
1 0000
0 0012
0 0001
0 0567
Raw Variance of the WITH Variables Explained by Their Own Canonical Variables
The Opposite Canonical Variables
Cumulative Canonical Cumulative Canonical Variable Number Proportion Proportion R.Square Proportion Proportion
0.8501
0 8501
0 1436
0 1221
0 1221
2
0 0456
0 8957
0 0720
0.0033
0.1253
3
0.0400
0.9357
0.0079
0 0003
0.1257
4
0 0166
0 9522
0 0012
0 0000
0 1257
469
4 70 Chapter 12
Table 12.13
Selected SAS CANCORR Output of Unstandardized and Standardized Canonical Variate Coefficients. Syntax Is in Table 12.10 Canonical Redundancy Analysis Standardized Variance of the VAR Variables Explained by Their Own Canonical Variables
The Opposite Canonical Variables
Canonical Variable Cumulative Cumulative Canonical Number Proportion Proportion R. Square Proportion Proportion 0 3777
0 3777
0.1436
00542
0.0542
2
0 2739
0.6517
00720
0 0197
0 0740
3
0 1668
0 8184
0 0079
0 0013
0 0753
4
0 1816
1 0000
0 0012
0 0002
0 0755
Standardized Variance of the WITH Variables Explained by The Opposite Canonical Variables
Their Own Canonical Variables
Cumulative Canonical Canonical Variable Cumulative Number Proportion Proportion R. Square Proportion Proportion 0 2401
0.2401
01436
0 0345
0 0345
2
0.1529
0.3930
0.0720
0.0110
0.0455
3
0 2372
0 6302
0 0079
0 0019
0 0474
4
0 0970
0 7272
0 0012
0 0001
0 0475
Canonical Correlation Analysis Standardized Canonical Coefficients for the VAR Variables VI ESTEEM
Self-esteem
V2
V3
V4
0.2446 0.6125 ·0.6276 ·0.6818
COtHROL Locus of control
0 5914 00272 ·0 1102 0 8894
LATIMAR log( AlTMAR )
0.5241 ·0.4493 0.7328 ·0.3718
ATIROLE Attitudes toward role of women ·0 0873 0.6206 0 7986 0.2047 Standardized Canonical Coefficients for the WITH Variables W1
W2
W3
W4
LTIMEORS
log( TlMEORS + 1 )
.0 2681 .0 3857 ·0 8548 0 7809
ATIORUG
Attitude toward use of medication
0 0458 07772 ·0 0453 04480
LPHYHEAL log( PHYHEAL ) MENHEAL
Mental heahh symptoms
LORUGUSE log( ORUGUSE + 1 )
0 0430 04464 ·04434 ·11868 1 0627 0 0356 01529 0 3802 .0.0596 .0 8274 0.5124 ·0.0501
If a goa l of analysis is production of scores on canonical variates, coefficients for them are read ily availab le. Tab le 12.13 shows both s tandard ized and u nstandardized coefficients for production of canonical variates. Scores on the varia tes themselves for each case are also produced by SAS CANCORR if an output file is requested (see syntax in Table 12.9). A summary table of information appropria te for inclusion in a journal article appea rs in Tab le 12.14. A checklist for canonica l correlation appears in Table 12.15. An example of a Results section, in journal format, follows for the complete ana lysis described in Section 12.6.
Canonical Correlation
Table 12.14
Correlations, Standardized Canonical Coefficients, Canonical Correlations, Proportions of Variance, and Red undancies Betw een Attitudinal and Health Variables and Their Corresponding Canonical Variates First Canonical Variate
Coffelation
Coefficient
Secon d Canonical Variate
Coffelation
Coefficient
Attitudinal set Locus of control
.78
.59
.15
.03
Attitude toward current marital status (logarithm)
.73
.52
- .32
- .45
.60
.25
.60
.61
- .09
- .09
.78
.62
Self-esteem Attitude toward role of women Proportion of variance
.38
.27
Total= .65
Redundancy
.05
.02
Total= .07
1.06
- .14
.04 .45 - .39
Health set
.97
Mental health Physical health (logarithm!
.41
.04
- .05
VIsits to health professionals (logarithm)
.12
- .27
- .36
Attitude toward use of medication
.08
.05
.56
.78
.28
- .06
- .55
- .83
use of psychotropic drugs (logarithm) Proportion of variance
.24
.15
Total= .39
Redundancy
.03
.01
Total= .04
.38
.27
canonical oorrelation, etc.
Table 12.15
Checklist for Canonical Correlation
1. Issues a . Missing data
b. Normality, linearity, homosoedasticity
c. Outliers d. Multicollinearity and singularity 2. Major analyses a . Significance of canonical correlations b. Correlations of variables and variates c . variance accounted for (1) By canonical correlations (2) By same-set canonical variates (3) By other-set canonical variates {redundancy) 3. Additional analyses a . canonical ooefficients b. canonical variates scores c . Relative weight analysis if one set of variables serves as predictors
Resu lts Canonical correlation was performed between a set of attitu dinal variables and a set of health variables using SAS CANCORR Version 9 . 4 . The attitudinal set included attitudes toward the role of women , toward locus of control, toward current marital status , and toward self- worth . The health set measured mental health , physical health , visits to health professionals, attitude toward use
471
4 72 Chapter 12
of medication , and use of psychotropic drugs . Increasingly large numbers reflected more conservative attitudes toward women ' s role , e x ternal locus of control , dissatisfaction with marital status, low self- esteem, poor mental health, poor physical health , more numerous health visits , favorable attitudes toward drug use , and more drug use . To improve linearity of relationship between variables and normality of their distributions ,
logarithmic transformations
were applied to attitude toward marital status , visits to health professionals, physical health , and drug use . No within - set mul tivariate outliers were identified at p< . 001 , although six cases were found to be missing data on locus of control or attitude to ward marital status and were deleted , leaving N= 459 . Assumptions regarding within - set multicollinearity were met . The first canonical correlation was . 38 (14% overlapping vari ance) ; the second was . 27 (7% overlapping variance) . The remaining two canonical correlations were effectively zero . With all four canonical correlations included , F(20 , 1493 . 4)
= 5 . 58 ,
p< . 001 ,
and with the first canonical correlation removed, F(12 , 1193 . 5)
= 3 . 20 ,
p< . 001 . Subsequent F tests were not statistically sig-
nificant . The first two pairs of canonical variates ,
therefore ,
accounted for the significant relationships between the two sets of variables . Data on the first two pairs of canonical variates appear in Table 12 . 14 . Shown in the table are correlations between the variables and the canonical variates , standardized canonical variate coefficients , within - set variance accounted for by the canonical variates (proportion of variance) ,
redundancies , and
canonical correlations . Total proportion of variance and total redundancy indicated that the first pair of canonical variates was moderately related , but the second pair was only minimally related ; interpretation of the second pair was questionable . With a cutoff correlation of . 3 ,
the variables in the atti -
tudinal set that were correlated with the first canonical variate were locus of control ,
(log of) attitude toward marital sta-
tus , and self - esteem . Among the health variables , mental health
Canonical Correlation
and (log of) physical health correlated with the first canoni cal variate . The first pair of canonical variates indicated that those with external locus of control ( . 78) ,
feelings of dissatis -
faction toward marital status ( . 73) , and lower self - esteem ( . 60) were associated with more numerous mental health symptoms ( . 97) and more n umerous physical health symptoms ( . 41) . The second canonical variate in the attitudinal set was composed of attitude toward role of women , self- esteem, and negative of (log of) attitude toward marital status, and the corresponding canonical variate from the health set was composed of negative of (log of) drug use , attitude toward drugs , and negative of (log of) visits to health professionals . Taken as a pair,
these vari -
ates s uggested that a combination of more conservative attitudes toward the role of women ( . 78) , lower self - esteem ( . 60) , but relative satisfaction with marital status
(- . 32)
was associ ated with a
combination of more favorable at titudes toward use of drugs ( . 56) , but lower psychotropic drug use ( - . 55) , and fewer visits to health professionals ( - . 36) . That is , women who had conservative attitudes towa rd the role of women and were happy with their marital statu s b ut had lower self - esteem were likely to have had more favorable a ttitudes toward drug u se but fewer visits to health professionals and lower u se of psychotropic drugs .
12.7 Comparison of Programs One program is available in the SAS package for canonical analyses. IBM SPSS has two p rograms tha t may be used for canonical analysis. Table 12.16 provides a comparison of important features of the programs. If available, the p rogram of choice is SAS CANCORR and, with limitations, the IBM SPSS CANCORR macro.
12.7.1 SAS System SAS CANCORR is a flexib le program with abundant features and ease of interpretation. Along with the basics, you can specify easily inte rpretable labels for canonical variates and the program accepts several types of input ma trices. Mu ltivaria te output is quite d etailed, with several test criteria and voluminous redundancy analyses. Univariate outpu t is minimal, however, and if plots are desired, case statis tics such as canonical scores are written to a file to be analyzed by the SAS PLOT p rocedure. If requested, the program does separa te multiple regressions with each variab le pred icted from the other set. You can also do separate canonical correlation ana lyses for different groups.
473
474 Chapter 12
Table 12.16 Comparison of IBM SPSS, SAS, and SYSTAT Programs for Canonical Correlation IBMSPSS MANOVAa
IBM SPSS CANCORR
SASCANCORR
SYSTAT SETCOR
Correlation matrix
Yes
Yes
Yes
Yes
Covariance matrix
No
No
Yes
No
SSCP matrix
No
No
Yes
No
Number of canonical variates
No
Yes
Yes
No
Tolerance
No
No
Yes
No
Specify alpha
No
No
No No
Feature Input
Minimum canonical correlation Labels for canonical variates
No
No
Yes
Error df if residuals input
No
No
Yes
No
Specify partlaling covariates
No
No
Yes
Yes
Means
Yes
No
Yes
No
Standard deviations
Yes
No
Yes
No
Confidence intervals
Yes
No
No
No
Normal plots
Yes
No
No
No
Canonical correlations
Yes
Yes
Yes
Yes
Eigenvalues
Yes
No
Yes
No
F
x2
F
x>
Lambda
Yes
No
Yes
RAO
Additional test criteria
Yes
No
Yes
Yes
Correlation matrix
Yes
Yes
Yes
Yes
Covariance matrix
Yes
No
No
No
Loading matrix
Yes
Yes
Yes
Yes
Output Univariate:
Multivariate:
Significance test
Loading matrix for opposite set
No
Yes
Yes
No
Raw canonical coefficients
Yes
Yes
Yes
Yes
Standardized canonical coefficients
Yes
Yes
Yes
Yes
Canonical variate scores
No
Data file
Data file
No
Proportion of variance
Yes
Yes
Yes
No
Redundancies
Yes
Yes
Yes
Yes
Stewart-Love Redundancy Index
No
No
No
Yes
Between-sets SMCs
No
No
Yes
No
Multiple-regression analyses
DVs only
No
Yes
DVsonly
Separate analyses by groups
No
No
Yes
No
a Additional features are listed in Sections 6.7 and 7.7.
12.7.2 IBM SPSS Package IBM SPSS has two programs for canonical analysis, both available only through syn tax: IBM SPSS MANOVA and a CANCORR macro (see Table 12.5). A complete canonica l analysis is ava ilab le through SPSS MANOVA, w hich prov ides loadings, p roportions of va riance, redundancy, and much more. But p roblems arise with reading the resu lts, because MANOVA is not designed specifically for canonical analysis and some of the labels are confusing. Canonical analysis is requested through MANOVA by calling one set of variables DVs and the other set covariates; no IVs are listed. Although IBM SPSS MANOVA p rovides a rather complete canonica l ana lysis, it does not calculate canonical varia te scores, nor does it offer multivariate plots. Tabachnick and Fidell (1996) show an IBM SPSS MANOVA ana lysis of the small-sample example of Table 12.1.
Canonical Correlation
Syntax and ou tput for the IBM SPSS CANCORR macro is much simpler and easier to interp ret. All of the critical information is available, however, with the peel-down tests, and a full set of correlations, canonical coefficients, and loadings. A redundancy analysis is included by default, and canonical variate scores are written to the original da ta set for p lotting.
12.7.3 SYSTAT System Currently, canonical analysis is most readily done through SETCOR (SYSTAT Softwa re Inc., 2002). The p rogram p rovides all of the basics of canonical correlation and several o thers. There is a test of overall association between the two sets of variables, as well as tests of p red iction of each DV from the set of IVs. The p rogram also p rovides analyses in which one set is pa rtia led from the o ther set- usefu l for statistical adjustment of irrelevant sou rces of variance (as per cova riates in ANCOVA) as well as representa tion of cu rvilinea r relationships and in teractions. These fea tures are well explained in the manua l. The Stewart- Love canonical redundancy index also is p rovided. Canonical factors may be rota ted. Canonical analysis may also be done through the multivariate general linear model GLM p rogram in SYSTAT. But to get a ll the ou tput, the analysis must be done twice, once w ith the first set of variables d efined as the DVs, and a second time with the other set of va riables d efined as DVs. The ad vantages over SETCOR a re that canonica l va riate scores may be saved in a da ta file, and tha t standardized canonical coefficients are provided.
475
Chapter 13
Principal Co111ponents and Factor Analysis Learning Objectives 13.1
Describe the goals of principle component analysis (PCA) and factor analysis (FA)
13.2
Identify the types of research questions addressed by PCA and FA
13.3
Describe the theoretical and practical limitations of PCA and FA
13.4
Analyze data using fundamental equations for factor analysis (optional)
13.5
Summarize extraction procedures used in the major types of factor analyses
13.6
Summarize rotation procedures used in the major types of factor analysis
13.7
Interpret results of factor analysis for a given data set
13.8
Compare the features of programs used to handle PCA and FA
13.1 General Purpose and Description Principa l components analysis (PCA) and factor analysis (FA) are statistical techniques ap plied to a single set of va riables when the researcher is interested in discovering which variables in the set form coherent subsets that are rela tively independent of one another. Variables that a re correlated with one another bu t largely independent of other subsets of va riables are combined into factors. 1 Factors are though t to reflect underlying p rocesses that have crea ted the correlations among variables. Suppose, for instance, a researcher is interested in studying characteristics of graduate s tudents. The researcher measures a large sample of graduate students on personal ity characteristics, motivation, inte llectual ability, scholastic history, familial history, health and ph ysical characteristics, e tc. Each of these a reas is assessed by numerous variables; the variables all en ter the ana lysis individua lly a t one time, and correlations am ong them are s tudied. The analysis reveals patterns of correlation among the va riables that are thought to reflect underlying p rocesses affecting the behav ior of gradua te students. For instance, severa l individua l variables from the persona lity measures combine with some variables from the motivation and scholastic history measures to form a factor measuring the degree to which a person prefers to work independently- an independence factor. Several variables from the intellectual ability measures combine w ith some others from scholastic histo ry to suggest an intelligence factor. 1 PCA produces components while FA produces factors, but it is less confusing in this section to call the results of both analyses factors.
476
Principal Components and Factor Analysis
A major use of PCA and FA in psychology is in development of objective tests for measurement of personality and intelligence and the like. The researcher starts ou t with a very large number of items reflecting a first guess about the items that may eventually prove useful. The items are given to randomly selected resea rch participants, and factors are derived. As a resu lt of the first factor ana lysis, items are added and deleted, a second test is devised, and that test is given to other randomly selected participants. The process con tin ues until the researcher has a test w ith nu merous items forming several factors that represen t the area to be measured. The valid ity of the factors is tested in research where p red ictions are made regarding differences in the behavior of persons who score high or low on a factor. The specific goals of PCA or FA are to summarize patterns of correlations among observed variables, to redu ce a large number of observed variables to a smaller number of factors, to provide an operational definition (a regression equation) for an underlying process by using observed variables, or to test a theory about the nature of underlying p rocesses. Some or all of these goals may be the focus of a particular research p roject. PCA and FA have considerable u tility in reducing numerous variables down to a few factors. Mathematically, PCA and FA p roduce severa l linear combinations of observed variables, where each linea r combination is a factor. The factors su mmarize the patterns of correlations in the observed correlation matrix and can be used, with varying degrees of success, to reproduce the observed correlation matrix. But since the number of factors is usually far fewer than the number of observed variables, there is considerable parsimony in using the factor analysis. Further, when scores on factors are estimated for each participant, they are often more reliable than scores on individua l observed variab les. Steps in PCA or FA include selecting and measuring a set of variab les, preparing the correlation matrix (to perform e ither PCA or FA), extracting a set of factors from the correlation matrix, determining the number of factors, (probably) rotating the factors to increase interpretability, and, finally, interpreting the resu lts. Although there are relevant sta tistica l considerations to most of these steps, an importan t test of the analysis is its interp retability. A good PCA o r FA "makes sense"; a bad one does not. Interpreta tion and naming of factors depend on the meaning of the particular combina tion of observed variables that correla te highly with each factor. A factor is more easily interpreted when several observed variab les correlate highly with it and those va riables do not correlate with other factors. Once interpretability is adequate, the last, and very large, step is to verify the factor structure by establishing the construct validity of the factors. The researcher seeks to demonstra te that scores on the latent variables (factors) cova ry with scores on other variab les, or tha t scores on latent variables change with experimental conditions as predicted by theory. One of the p roblems with PCA and FA is that there are no readily available criteria against which to test the solution. In regression ana lysis, for instance, the DV is a criterion and the correlation between observed and predicted DV scores serves as a test of the solution-similarly for the two sets of variables in canonical correlation. In d iscriminant function ana lysis, logis tic regression, p rofile analysis, and multivariate analysis of variance, the solu tion is judged by how well it p red icts group membership. But in PCA and FA, there is no external criterion s uch as group membership against which to test the solution. A second problem with FA or PCA is that, after extraction, there is an infinite number of rotations available, all accounting for the same amount of variance in the original data, b u t with the factors defined slightly diffe rently. The fina l choice among alternatives depends on the researcher's assessment of its interpretability and scientific u tility. In the presence of an infinite number of mathematica lly iden tical solutions, researchers are bound to d iffer regarding which is best. Because the d ifferences cannot be resolved by appea l to objective criteria, arguments over the best solution sometimes become vociferous. However, those who expect a certain amoun t of amb iguity with respect to choice of the best FA solution will not be surp rised when o ther researchers choose a different one. Nor wi ll they be su rprised when results are not replicated exactly, if differen t decisions are made a t one, or more, of the steps in performing FA.
477
4 78
Chapter 13
A third problem is tha t FA is frequently used in an attemp t to "save" poorly conceived resea rch. If no other statistical p roced ure is ap plicable, at least d ata can usually be factor analyzed . Thus, in the minds of many, the various forms of FA are associated with s loppy research. The very power of PCA and FA to create apparen t o rder from real chaos contribu tes to their somewhat tarnished repu tations as scientific tools. There a re two major types of FA: explora to ry and confirmatory. In exploratory FA, one seeks to describe and summa rize d ata by grouping together va riables that a re correlated . The variables themselves may o r may not have been chosen with po tential underlying p rocesses in mind. Exploratory FA is usually performed in the early s tages of research, when it p rovides a tool for consolidating variables and for generating hypotheses abou t underlying p rocesses. Confirmatory FA is a much more sophisticated technique used in the advanced stages of the resea rch p rocess to test a theory about latent p rocesses. Variables are ca refully and specifically chosen to revea l underlying p rocesses. Currently, confirmatory FA is most often performed through structu ral equation modeling (Chapter 14). Before we go on, it is help ful to d efine a few terms. The first terms involve correla tion ma trices. The correla tion matrix p roduced by the observed variables is called the observed correlation matrix. The correlation matrix p roduced from factors, tha t is, correlation matr ix implied by the factor solution, is called the reproduced correlation matrix. The difference between observed and reproduced correlation matrices is the residual correlation matrix. In a good FA, correlations in the residual matrix are small, indicating a close fit between the observed and reproduced matrices. A second set of terms refers to ma trices produced and interpreted as part of the solution. Rotation of factors is a p rocess by which the solution is mad e more interpretable withou t changing its underlying mathematical properties. There are two general classes of ro tation: orthogonal and oblique. If rota tion is orthogonal (so that a ll the factors are uncorrelated w ith each o ther), a loading matrix is produced . The load ing matrix is a matrix of correlations between observed variables and facto rs. The sizes of the load ings reflect the extent of relationship between each observed variable and each factor. Orthogonal FA is in terp reted from the loading matr ix by looking a t which observed va riables correlate with each facto r. If rotation is oblique (so that the factors themselves a re correla ted), several additiona l matrices a re p roduced. The factor correlation matrix contains the correlations a mong the factors. The loading ma trix from o rthogonal rota tion splits in to two matrices for oblique rotation: a structure ma trix of corre la tions between factors and variab les and a pattern ma trix of unique rela tionships (uncontaminated by overlap a mong factors) between each factor and each observed variab le. Following ob lique rota tion, the meaning of factors is ascertained from the pa ttern ma trix. Lastly, for bo th types of rotations, there is a factor-score coefficients matrix- a matr ix of coefficients used in several regression-like equations to p redict scores on factors from scores on observed variables for each individua l. FA produces factors, while PCA p roduces components. H owever, the p rocesses a re s imila r except in preparation of the observed correla tion matrix for extraction and in the und erlying theory. Mathema tically, the difference be tween PCA and FA is in the variance that is analyzed. In PCA, a ll the va riances in the observed variables are ana lyzed. In FA, only sha red variance is analyzed; attemp ts a re mad e to estimate and e limina te variance due to error and variance that is unique to each va riable. The te rm factor is used here to refer to both components and factors unless the distinction is critical, in which case the app ropriate term is used . Theoretically, the d ifference between FA and PCA lies in the reason that variables a re associa ted with a factor or component. Factors are though t to" cause" variables- the underlying cons truct (the factor) is what p roduces scores on the va riables. Thus, explora tory FA is associated with theory developmen t and confirmatory FA is associated with theory testing. The question in exploratory FA is: What are the underlying p rocesses that could have p roduced correlations among these variables? The question in confirmatory FA is: Are the correlations among variab les cons isten t w ith a hypo thesized factor structu re? Components are simply aggregates of
Principal Components and Factor Analysis
correlated va riables. In tha t sense, the va riables "cause"-or p roduce-the component. There is no underlying theory abou t which variables should be associated with which factors; they are simply empirically associated . It is understood that any labels applied to der ived components are me rely convenient descriptions of the combination of variables associated w ith them, and do not necessarily reflect some underlying process. Tiffin, Kaplan, and Place (2011) gathered responses from 673 adolescents (age range 12- 18) for an explora to ry FA of a 75-item test of perceptions of fa mily functioning. Items were d erived from p rev ious work that seemed to indica te a five-factor structu re and from feedba ck from both professionals and other adolescents. The explora to ry FA w ith va rimax rota tion revealed three significant factors accoun ting fo r 73% of the va riance. However, the first factor seemed to be a composite of three themes; the 75 items were p runed and a fivefactor structure was accepted with between five and seven items per factor. The five factors were labeled Nurture, Problem Solving, Expressed Emotion, Behav ioral Boundaries, and Responsibility. LaVeis t, Isaac, and Williams (2009) used p rinciple components analysis to reduce a 17-item Med ical Mis trust Index to 7 items. They used a nicely constructed telep hone survey of 401 persons to iden tify the fi rst p rinciple component and then a follow-up interview of 327 of them three weeks later to investigate u ti liza tion of health services. Those who scored higher on the 7-item Med ical Mistrus t Index were more likely to fail to take med ical adv ice, fa il to keep a follow-up appointmen t, postpone receiv ing need ed ca re, and fail to fill a p rescription; they were not, however, more likely to fail to get needed medica l care. Kinoshita and Miyashita (2011) used maximum likelihood extraction and p romax rotation to s tudy difficulties felt by ICU nurses in p rov iding end-of-life care. A total of 224 ICU nurses from the Kan to region of Japan participated in the first part of the study. The researche rs h ypothesized that the nu rses would have d ifficulties in nine different areas, and generated 75 items to assess those a reas. However, the FA revealed only five factors with a total of 28 items. Although p romax rotation migh t have revea led correlated factors, the highest actual correlation between two factors was .51. The fina l five factors were called "the p urpose of the ICU is recovery and survival"; "nurs ing system and mod el nurse for end-of-life care"; "building confidence in end-of-life care"; "caring for patients and families a t end-of-life"; and "converting from cura tive care to end-of-life care".
13.2 Kinds of Research Questions The goa l of resea rch using PCA or FA is to reduce a large number of variables to a smaller number of facto rs, to concisely describe (and perhaps und erstand) the relationships among observed va riables, or to test theory abou t underlying p rocesses. Some of the specific questions that are frequen tly asked are presen ted in Sections 13.2.1 through 13.2.5.
13.2.1 Number of Factors How many reliab le and interpretable facto rs a re there in the d ata set? How many factors a re needed to summarize the pattern of correlations in the correla tion matr ix? In the graduate student examp le, two factors are d iscussed (independence and intellectua l ability): Are these both reliable? Are there any more factors that are reliab le? Strategies for choosing an ap propria te number of factors and for assessing the correspondence between observed and rep roduced correlation ma trices a re discussed in Section 13.6.2.
13.2.2 Nature of Factors What is the meaning of the factors? How a re the factors to be interp reted ? Factors a re interp reted by the variables tha t correla te with them. Rotation to improve in terp retab il ity is discussed in Section 13.6.3; in terp reta tion itself is discussed in Section 13.6.5.
479
480 Chapter 13
13.2.3 Importance of Solutions and Factors How much variance in a d ata set is accounted for by the factors? Which factors account for the most va riance? In the graduate student example, does the ind ependence or intellectual ab ility factor account for more of the variance in the measured variables? How much variance does each accoun t for? In a good factor analysis, a high percentage of the va riance in the observed variables is accoun ted for by the first few factors. And, because factors a re computed in d escending o rder of magnitude, the first factor accounts for the most va riance, with late r factors accounting for less and less of the variance un til they are no longer reliable. Methods for assessing the importance of solutions and factors are in Section 13.6.4.
13.2.4 Testing Theory in FA How well does the obtained factor solu tion fit an expected factor solu tion? If the researcher had generated hypotheses regarding both the n umber and the nature of the factors expected of graduate students, comparisons between the h ypothesized factors and the factor solu tion p rovid e a test of the hypotheses. Tests of theory in FA are add ressed, in p reliminary form, in Sections 13.6.2 and 13.6.7. More highly developed techniques are available for testing theory in complex da ta sets in the form of structura l equation modeling, which can also be used to test theory regarding factor s tructure. These techniques are sometimes known by the names of the most popular p rograms for d oing them- EQS and LISREL. Structural equation modeling is the focus of Chapter 14. Confirmatory FA is demonstrated in Section 14.7.
13.2.5 Estimating Scores on Factors Had factors been measured directly, what scores would participants have received on each of them? Fo r ins tance, if each gradua te student were measu red directly on independence and intelligence, what scores would each s tudent receive for each of them? Estimation of factor scores is the topic of Section 13.6.6.
13.3 Limitations 13.3.1 Theoretical Issues Most applica tions of PCA or FA are exploratory in nature; FA is used p rimarily as a tool for red ucing the number of variables or examining patterns of correlations among va riables. Under these circums tances, both the theoretical and the p ractical limita tions to FA are relaxed in favor of a frank exploration of the data. Decisions abou t number of factors and rotational scheme a re based on p ragma tic ra ther than theoretical criteria. The research p roject tha t is designed specifically to be factor analyzed, however, differs from other projects in severa l important respects. Among the best d etailed discussions of the differences is the one found in Comrey and Lee (1992), from which some of the following discussion is ta ken. The first task of the researcher is to genera te hypotheses abou t factors believed to underlie the domain of in te rest. Statistically, it is important to make the research inquiry broad enough to include five or six hypothesized factors so that the solution is s tab le. Logically, in ord er to reveal the p rocesses underlying a resea rch area, all relevant factors have to be included . Fa ilure to measure some impo rtant factor may distort the apparen t rela tionships among measured factors. Inclusion of all relevant factors poses a logical, b ut not statistica l, problem to the resea rcher. Next, one selects variables to observe. For each hypothesized factor, five or six variables, each though t to be a rela tively p ure measure of the factor, are included. Pure measures a re called marker variables. Ma rker va riab les are highly correla ted with one and only one factor and load on it rega rd less of extraction o r ro tation technique. Ma rker va riables a re useful
Principal Components and Factor Analysis
because they define clearly the nature of a factor; adding potential va riab les to a factor to round it ou t is m uch more meaning ful if the factor is unambiguously d efined by marker variables to begin with. The complexity of the variables is also cons idered. Complexity is ind icated by the number of facto rs with which a variable correlates. A p ure variable, which is preferred, is correlated with only one factor, whereas a complex variable is correlated w ith several. If variables d iffering in complexity are all included in an analysis, those with similar complexity levels may "catch" each other in factors that have little to do with underlying p rocesses. Va riables w ith s imilar complexity may correlate with each other because of their complexity and not because they relate to the same factor. Estimating (or avoiding) the comp lexity of variables is part of generating hypotheses abou t factors and selecting va riables to measure them. Severa l other cons idera tions are required of the researcher planning a factor ana lytic study. It is impo rtant, for instance, that the sample chosen exhibits spread in scores with respect to the variables and the factors they measure. If all participants achieve abou t the same score on some factor, correlations among the obser ved va riables are low and the factor may not eme rge in analysis. Selection of pa rticipan ts expected to differ on the observed variables and underlying factors is an important d esign cons idera tion. One should also be wary of pooling the results of several samples, or the same sample with measures repea ted in time, for factor analy tic purposes. First, samples that a re known to be different with respect to some criterion (e.g., socioeconomic status) may also have d ifferent factors. Examination of group differences is often quite revealing. Second, underlying factor s tructure may shift in time for the same participants with learning or with exp erience in an experimenta l setting and these differences may also be quite revealing. Pooling resu lts from d iverse groups in FA may obscure d ifferences rather than illuminate them. On the other hand, if d ifferent samples d o p roduce the same factors, pooling them is desirable because of increase in sample size. For example, if men and women p roduce the same facto rs, the samples should be combined and the results of the sing le FA reported.
13.3.2 Practical Issues Because FA and PCA are exquis itely sensitive to the sizes of correla tions, it is critical tha t honest correlations be employed. Sensitiv ity to ou tlying cases, p roblems created by missing data, and degrad ation of correla tions between poorly distributed variab les a ll plague FA and PCA. A rev iew of these issues in Chapter 4 is impo rtant to FA and PCA. Thoughtfu l solu tions to some of the prob lems, includ ing variable transforma tions, may marked ly enhance FA, whether performed for exploratory or confirmatory p u rposes. However, the limitations app ly with greater force to confirmatory FA if done through FA rather than SEM p rograms. 13.3.2.1 SAMPLE SIZE AND MISSING DATA Correlation coefficien ts tend to be less reliable when estima ted from small samples. Therefore, it is impo rtant that sample size be la rge enough that correlations are reliably estimated. The requ ired sample size also d epends on magnitude of population correla tions and number of factors: if there are strong correla tions and a few, distinct factors, a smaller samp le size is ad equate. MacCallum, Wid aman, Zhang, and Hong (1999) show that samples in the range of 100-200 are acceptable with well-determined factors (i.e., most factors defined by many indicators, i.e., marker variables with loadings > .80) and communalities (squared multiple correlations among variables) in the range of .5 or greater. At least 300 mses are needed with low communalities, a small number offactors, and just tl1ree or four indicators for each factor. Samp le sizes well over 500 are required und er the worst conditions of low communalities and a larger n umber of weakly determined factors. Impact of sample size is reduced with consistently high communalities (all greater than .6) and well-determined factors. In s uch cases, samples well below 100 are acceptable, although such s mall samples run the compu tational ris k of failure of the solu tion to converge. MacCa llum et a l. recommend designing studies in which va riables are selected to p rov ide as high a level of communalities as possible (a mean level of at least .7) with a small range of
481
482
Chapter 13
varia tion, a sma ll to moderate nu mber of factors (say, three to five) and several indica tors per factor (say, five to seven ). If cases h ave missing data, either the missing values are estima ted or the cases deleted . Cons ult Chapter 4 for methods of fin ding an d estimating missing va lues. Cons ider the distribution of missing values (is it random?) and remaining sample size when d eciding between estimation and deletion. If cases are missing valu es in a n onrandom pattern o r if samp le size becomes too small, estimation is in o rder. However, beware of using estima tion p rocedures (such as regression) that are likely to overfit the d ata and cause correlations to be too high. These p rocedures may "create" factors. 13.3.2.2 NORMALITY As long as PCA and FA are used descrip tively as convenien t ways to s ummarize the rela tionships in a large set of observed variables, assump tions regarding the distributions of variables are relaxed. If variables are normally distributed, the solution is en hanced. To the exten t that n orma lity fa ils, the solu tion is degraded but may still be worthwhile. However, mu ltivariate normality is assumed when s tatistical inference is used to determine the nu mber of facto rs. Multivariate n orma lity is the assump tion that all va riables, and all linear combinations of va riables, are normally distributed. Although tests of mu ltiva riate normality a re overly sens itive, nonuality among single variables is assessed by skewness and kurtosis (see Chapter 4 and Section 13.7.1.2). If a variable has s ubstantia l skewness and kurtosis, variab le transformation is consid ered. Some SEM (Chap ter 14) p rogra ms (e.g., Mp lus and EQS) permit PCA/FA with n onnormal variab les. 13.3.2.3 LINEARITY Mu ltiva riate n orma lity also implies that relationships among p airs of variables a re linea r. The analysis is degraded when linea rity fails, because correlation measu res linear rela tionship and does not reflect nonlinear relationship. Linearity among pairs of variables is assessed through inspection ofscntterplots. Cons ult Ch apter 4 and Section 13.7.1.3 for methods of screening for linea rity. If nonlinearity is found , transformation of va riables is consid ered. 13.3.2.4 ABSENCE OF OUTLI ERS AMONG CASES As in all mu ltivaria te techniqu es, cases may be outliers eithe r on ind iv idual variab les (univa riate) o r on combin ations of variables (mu ltivaria te). Such cases have more influ ence on the factor solu tion than o ther cases. Consult Chapter 4 and Section 13.7.1.4 for methods of detecting an d reducing the influ ence of b oth univaria te and multiva riate ou tliers. 13.3.2.5 ABSENCE OF MULTICOLLIN EARITY AN D SINGULARITY In PCA, mu lticollinearity is not a problem because there is no need to invert a ma trix. For most forms of FA an d for estimation of factor scores in an y form of FA, singularity or extreme mu lticollinearity is a p roblem. For FA, if the determinant of R and eigenvalu es associated with some factors approach 0, multicollinearity or singularity may be present.
To investigate further, look at the SMCs for ench variable where it serves as DV with all other variables as IVs. If any of the SMCs is on e, singu larity is p resent; if any of the SMCs is very large (near one), multicollinearity is p resen t. Delete the variab le w ith mu lticollinearity or singularity. Chapter 4 and Section 13.7.1.5 p rovid e examp les of screening for and d ealing w ith multicollinearity and singu la rity. 13.3.2.6 FACTORABILITY OF R A ma trix tha t is facto rable should in clude several sizab le correlations. The expected size depends, to some extent, on N (larger sample sizes tend to produce smaller correlations), but if no correlation exceeds .30, use ofFA is question ab le because there is p robably nothing to facto r analyze. Inspect R for correlations in excess of .30, and, if none is found,
reconsider use of FA. High bivaria te correla tions, however, are n ot ironclad p roof that the correla tion ma trix contains factors. It is possible that the correlations are between only two variables and do not reflect the underlying p rocesses that are simultaneous ly affecting several variables. For this reason, it is help ful to examine matrices of partial correlations where p airwise correlations are adjus ted for effects of all other variab les. If there are factors p resent, high bivariate correlations become very low partial correla tions. IBM SPSS and SAS p roduce partial correlation matrices.
Principal Components and Factor Analysis
Ba rtlett's (1954) test of sphericity is a notoriously sensitive test of the hypothesis that the correlations in a correlation matrix are zero. The test is ava ilable in IBM SPSS FACTOR bu t because of its sensitiv ity and its dependence on N, the test is likely to be significant w ith samples of substantial size even if correlations are very low. Therefore, use of the test is recommended only if there are fewer than, say, five cases per variable. Severa l more sophistica ted tests of the factorability of R are available through IBM SPSS and SAS. Both programs give significance tests of correlations, the anti-image correlation matrix, and Kaiser's (1970, 1974) measure of sampling adequacy. Significance tests of correlations in the correlation matrix p rovide an ind ication of the reliability of the rela tionships between pairs of variables. If R is factorable, numerous pairs are s ignificant. The anti-image correlation matrix contains the negatives of pa rtial correlations between pairs of variab les with effects of o ther va riables removed. If R is factorable, there are mostly small values among the offdiagonal e lements of the anti-image matrix. Finally, Kaiser's meas ure of sampling adequacy is a ratio of the sum of squared correlations to the sum of squared correla tions plus sum of squared partial correlations. The value approaches 1 if partial correlations are small. Values of .6 and above are required for good FA. 13.3.2.7 ABSENCE OF OUTLI ERS AMONG VARIABLES After FA, in both exploratory and confirmatory FA, va riables that a re unrelated to others in the set are identified. These variables are usually not correlated with the first few factors although they often correlate with factors extracted later. These factors are usually unreliable, both because they account for very little variance and because factors tha t are defined by just one or two variables are not stable. Therefore, one never knows whether these factors are "real." Suggestions for determining reliability of factors defined by one or two variables are in Section 13.6.2. If the variance accoun ted for by a factor defined by only one o r two variables is high en ough, the factor is interpreted w ith great caution or is ignored, as p ragmatic considerations d ictate. In confirma tory FA done through FA rather than SEM p rograms, the factor represents e ither a promising lead for future work or (probably) error va riance, bu t its in terpretation awaits clarification by more research.
A variable with a law squared multiple correlation with nil other variables and Jaw correlations with all important factors is an outlier among the variables. The variable is usually ignored in the current FA and either deleted or given friends in future research. Screening for ou tliers among variables is illustrated in Section 13.7.1.7.
13.4 Fundamental Equations for Factor Analysis Because of the variety and complexity of the calculations in volved in preparing the correlation matrix, extracting factors, and rotating them, and because, in our judgment, little ins ight is p roduced by demonstrations of some of these p rocedures, this section does not show them all. Instead, the rela tionships between some of the more important matrices a re shown, w ith an assis t from IBM SPSS FACTOR for underlying calculations. Table 13.1 lists many of the important matrices in FA and PC A. Although the list is lengthy, it is composed mostly of matrices of correlations (between va riables, between factors, and between variab les and factors), matrices of standard scores (on variables and on factors), matrices of regression weights (for p roducing scores on factors from scores on variables), and the pattern matrix of unique relationships between facto rs and variab les after oblique rota tion. Also in the tab le a re the matrix of eigenva lues and the matrix of their corresponding e igenvectors. Eigenvalues and e igenvectors are d iscussed here and in Appendix A, albeit scantily, because of their importance in facto r extraction, the frequency with which one encoun ters the terminology, and the close association between eigenvalues and va riance in statistical applications.
483
484 Chapter 13
Table 13.1
Commonly Encount ered Matrices in Factor Analyses Rotation
Size•
R
Correlation matrix
Both orthogonal and oblique
pXp
Matrix of correlations between variables
Z
Variable matrix
Both orthogonal and oblique
NXp
Matrix of standardized observed variable scores
F
Factor-soore matrix
Both orthogonal and oblique
Nxm
Matrix of standardized scores on factors or components
A
Factor loading matrix Pattem matrix
Orthogonal
pxm
Matrix of regression-like weights used to estimate the unique contribution of each factor to the variance in a variable. If orthogonal, also oorrelations between variables and factors
B
Factor-soore coefficients matrix
Both orthogonal and oblique
pxm
Matrix of regression-like weights used to generate factor soores from variables
C
Structure matrix"
Oblique
pxm
Matrix of correlations between variables and (correlated) factors
Factor oorrelation matrix
Oblique
mxm
Matrix of correlations among factors
L
Eigenvalue matrixc
Both orthogonal and oblique
mxm
Diagonal matrix of eigenvalues, one per factor"
V
Eigenvector matrixd
Both orthogonal and oblique
pxm
Matrix of eigenvectors, one vector per eigenvalue
label
~ow by
Name
Description
Oblique
column dimensions where
= number of variables. N = numbet of participants. m = number of factors or components. p
bin most textbooks. the structure matrix is tabeled S. Howevet. we have used S to represent the s~~n-of·squares and crossproducts matrix
elsev.flete and wiU use C for the structure matrix here. cAlso called characteristic roots or latent roots. dAJso called charactetistic vectors. -., the matrix is of full rank. there are actually p rather than m eigenvaJues and eigenvectocs. Only m are of interest. however. so the remaining p - m are not displayed.
A data set appropriate for FA consists of numerous participants each measured on several variables. A grossly inadequ ate data set appropriate for FA is in Table 13.2. Five skiers who were trying on ski boots late on a Friday night in January were asked about the importance of each of four variables to their selection of a ski resort. The variables were cost of ski ticket (COST), speed of ski lift (LIFT), depth of snow (DEPTH), and moisture of snow (POWDER). Larger numbers ind icate greater importance. The researcher wanted to investigate the patte rn of relationships among the variables in an effort to understand better the dimensions underlying choice of ski area.
Table 13.2
Small Sample of Hy pothetical Data f or Illustration of Factor A nalysis Variables Skiers
COST
UFT
DePTH
5,
32
64
65
67
52
61
37
62
65
POWDeR
53
59
40
45
43
5.
36
62
34
35
5s
62
46
43
40
Correlation Matrix
COST
UFT DEPTI-i POWDER
Principal Components and Factor Analysis
Notice the pattern of correlations in the correlation matrix as set off by the vertical and horizonta l lines. The strong correlations in the upper left and lower right quadran ts show that scores on COST and LIFT are related, as are scores on DEPTH and POWDER. The other two quadrants show that scores on DEPTH and LIFT are unrela ted, as are scores on POWDER and LIFT, and so on. With luck, FA will find this pattern of correlations, easy to see in a small correlation matrix bu t not in a very large one.
13.4.1 Extraction An important theorem from matrix algebra indicates that, under certain conditions, matrices can be diagonalized. Correlation and covariance matrices a re among those that often can be d iagonalized. When a matrix is diagonalized, it is transformed in to a matrix with numbers in the positive diagonal2 and zeros everywhere else. In this application, the n umbers in the positive diagonal represent variance from the correla tion matrix that has been repackaged as follows: L = V 'RV
(13.1)
Diagonalization of R is accomplished by post- and pre-multiplying it by the matrix V and its transpose. The colu mns in V are called e igen vectors, and the va lues in the main diagonal of L are called eigenvalues. The first eigenvector corresponds to the first eigenvalue, and so forth. Because there are four variables in the example, there are four e igenva lues w ith their corresponding eigenvecto rs. However, because the goal of FA is to summarize a pattern of correlations with as few factors as possible, and because each e igenva lue corresponds to a d ifferent potential factor, usua lly only factors with large eigenvalues are retained. In a good FA, these few factors almost d up lica te the correlation matrix. In this example, when no limit is placed on the number of factors, eigen values of 2.02, 1.94, .04, and .00 are computed for each of the four possible facto rs. Only the first two factors, with values over 1.00, are large enough to be retained in s ubsequent analyses. FA is rerun specifying extraction of jus t the first two factors; they have eigenvalues of 2.00 and 1.91, respectively, as ind icated in Table 13.3. Using Equation 13.1 and inserting the values from the example, we obtain
L = [ - .283 .651 = [2.00
.00
.177 - .685
.658 .252
['~
.675] - .953 .207 - .055 - .130
- .953 1.000 - .091 - .036
- .055 - .091 1.000 .990
.00] 1.91
Table 13.3 Eigenvectors and Corresponding Eigenvalues for the Example Eigenvector 1
.651
.177
- .685
.658
.252
.675 Eigenvalue 1
2.00
2
Eigenvector 2
- .283
.207 Eigenvalue 2
1.91
The positive diagonal runs fro m upper left to lower right in a matrix.
-130rW - .036 .990 1.000
.177 - .658 .675
ffill
- .685 .252 .207
485
486 Chapter 13
(All values agree w ith comp u ter ou tpu t. Hand calculation may p roduce d iscrepancies due to rounding error.) The matrix of eigenvectors p re-multiplied by its transpose prod uces the identity matrix with ones in the positive d iagonal and zeros elsewhere. Therefore, p re- and post-multiplying the correlation matrix by eigenvectors d oes not change it so much as repackage it. (13.2)
VV = I
For the examp le:
- .283 [ .651
.177 - .685
.658 .252
.675 .207] [
- .283 .177 .658 .675
.651] - .685 = [ 1.000 .252 .000 .207
.000] 1.000
The important point is that because correlation ma trices often meet requiremen ts for d iagonalizability, it is possible to use on them the matrix a lgebra of eigenvectors and e igen values with FA as the resu lt. When a matr ix is d iagonalized, the informa tion contained in it is repackaged. In FA, the variance in the correlation matrix is cond ensed into eigenvalues. The factor with the largest e igenva lue has the most va riance and so on, down to factors with s mall or negative eigenvalues that are us ually omitted from solutions. Calculations for eigenvectors and eigenvalues are extremely laborious and not pa rticularly enligh tening (a lthough they are illustrated in Appendix A for a sma ll ma trix). They require solving p equations in p unknowns w ith additional side constraints and are rarely performed by hand. Once the eigenvalues and eigenvectors are known, however, the rest of FA (or PCA) more o r less "falls ou t," as is seen from Equations 13.3 to 13.6. Equation 13.1 can be reorganized as follows: R = VLV '
(13.3)
The correla tion matrix can be considered a p roduct of three ma trices- the matrices of eigenvalues and corresponding eigenvectors. After reorganization, the square root is taken of the ma trix of eigenvalues. R = VVL YLV '
(13.4)
or R = (V v'L)(VLV ') If VVL is called A and YLV ' is A' then R = AA'
(13.5)
The correlation matrix can also be consid ered a p roduct of two ma trices-each a combination of eigenvectors and the square root of e igen va lues. Equa tion 13.5 is frequently ca lled the funda mental equa tion fo r FA. 3 It represents the assertion that the correla tion matrix is a product of the facto r loading matr ix, A, and its transpose. Equations 13.4 and 13.5 also revea l that the major work of FA (and PCA) is calculation of eigenva lues and eigenvectors. Once they are known, the (unrotated) factor loading matrix is found by stra ightforwa rd matrix multiplication, as follows. A = VVL
(13.6)
3 In order to reproduce the correlation matrix exactly, as indicated in Equations 13.4 and 13.5, a ll eigenvalues and eigenvectors are necessary, no t just the first few of them.
Principal Components and Factor Analysis
For the example:
A =
[
- .283 .177 .658 .675
.651] - .685 [ V2.00 .252 0 .207
0
v'f.9I]
=
[
- .400 .251 .932 .956
- .900 .947 ]
.348 .286
The factor loading matrix is a matrix of correlations between factors and variables. The first column is correlations between the first factor and each variable in tum, COST (-.400), LIFT (.251), DEPTH (.932), and POWDER (.956). The second column is correlations between the second factor and each variable in tum, COST (.900), LIFT (- .947), DEPTH (.348), and POWDER (.286). A factor is interpreted from the variables that are highly correlated with it- that have high loadings on it. Thus, the first factor is p rimarily a snow conditions factor (DEPTH and POWDER), while the second reflects resort conditions (COST and LIFT). Skiers who score high on the resort conditions factor (Equation 13.11) tend to assign high value to COST and low value to LIFT (the negative correlation); skiers who score low on the resort conditions factor value LIFT more than COST. Notice, however, tha t all the va riables are correla ted w ith both facto rs to a considerable extent. Interpretation is fairly clear for this hypothetical example, but most likely would not be for real data. Usually, a factor is most interpretable when a few variab les are highly correlated with it and the rest are not.
13.4.2 Orthogonal Rotation Rotation is o rd inarily used after extraction to maximize high correlations between factors and variables and minimize low ones. Numerous methods of rotation are available (see Section 13.5.2) but the most commonly used, and the one illustrated here, is varimax. Varimax is a variance-maximizing procedure. The goal of varimax rotation is to maximize the variance of factor loadings by making high loadings higher and low ones lower for each factor. This goal is accomplished b y means of a transformation matrix A (as defined in Equation 13.8), where ~nrotated A =
(13.7)
A rotated
The unrota ted factor loading matrix is multip lied by the transformation matrix to prod uce the rotated loading matrix. For the example:
A.otated
=
[
- .400 .251 .932 .956
.900] - .947 [ .946 .348 .325 .286
- .086 - .325] = - .071 .946 [ .994 .997
- .981] .977 .026 - .040
Compare the rota ted and unrotated loading matrices. Notice that in the ro ta ted matrix the low correlations are lower and the high ones are higher than in the unrotated loading matrix. Emphasizing differences in loadings facilitates in terpretation of a factor by making unambiguous the variables that correlate with it. The numbers in the transformation matrix have a spatial interpreta tion. A = [ cos 'l' sin 'l'
- sin 'l'J cos 'l'
(13.8)
The transformation matrix is a matrix of sines and cosines of an angle 'l'. For the example, the angle is approximately 19°. That is, cos 19 "" .946 and sin 19 "" .325. Geometrically, this corresponds to a 19° swivel of the factor axes about the origin. Grea ter detail regarding the geometric meaning of rota tion is in Section 13.5.2.3.
487
488 Chapter 13
13.4.3 Communalities, Variance, and Covariance Once the rotated loading matrix is ava ilable, other relationships are found, as in Table 13.4. The communa lity for a va riable is the variance accoun ted for by the factors. It is the squ ared multip le correlation of the variab le as predicted from the factors. Communality is the sum of squ ared loadings (SSL) for a va riable across factors. In Tab le 13.4, the commun ality for COST is (- .086) 2 + .9812 = .970. That is, 97% of the variance in COST is accoun ted for b y Factor 1 p lus Factor 2. The p roportion of va riance in the set of variables accounted for by a factor is the SSL for the factor d ivided by the number of variables (if rotation is o rthogonal). 4 For the first factor, the p roportion of variance is [ ( - .086 ) 2 + ( - .071 )2 + .9942 + .9972 )/4 = 1.994/ 4 = .50 ). Fifty percen t of the variance in the variab les is accounted for by the fi rs t factor. The secon d factor accounts for 48% of the variance in the variables and, because rotation is orthogonal, the two factors together accoun t for 98% of the variance in the variab les. The p roportion of va riance in the solution accounted for by a factor- the p roportion of covariance- is the SSL for the factor d ivid ed by the sum of communa lities (or, equiva len tly, the sum of the SSLs). The first factor accoun ts for 51% of the variance in the solution {1.994/ 3.915) while the second factor accounts for 49% of the variance in the so lu tion {1.919/ 3.915). The two factors together account for a ll of the covariance. The reprodu ced correlation matrix for the examp le is gen era ted using Equation 13.5:
R=
%1] - .086 [-- .071 ~ - .977 .994 .997
-
[·~
- .953 - .059 - .125
.026 [ .981 - .040 - .953 .962 - .098 - .033
- .059 - .098 .989 .990
- .071 - .977
.994 .026
.997] - .040
-125]
- .033 .990 .996
Notice tha t the reprodu ced correlation matrix differs slightly from the origin al correlation matrix. The difference between the origina l and reproduced correlation matrices is the residua l correlation matrix:
Rres = R -
R
(13.9)
The residual correlation matrix is the difference between the observed correla tion matrix and th e reprodu ced correla tion matrix.
Table 13.4
Relationships Among Loadings, Communalities, SSLs, Variance, and Covariance of Orthogonally Rotated Factors Factor 1
Communalities (h~
Factor 2
.994
.026
.997
- .040
La> =.9 7o L.r = .9oo L.r = .989 La2 = .996
1.994
La2 = 1.919
3.91 5
Proportion of variance
.50
.48
.98
Proportion of covariance
.51
.49
COST
- .086
.981
UFT
- .071
- .9 77
DEPTH POWDER SSLs
La2 =
4 For unrotated factors only, the sum of the squared loadings for a factor is equal to the eigenvalue. Once loadings are rotated, the sum of squared loadings is called SSL and is no longer equal to the eigenvalue.
Principal Components and Factor Analysis
For the example, with communalities inserted in the positive diagonal of R:
R,...
=
- .055 - .091 .989 .990
000 .000 .004 - .005
.000 .000 .007 - .003
.004 .007 .000 .000
[ - .953 m
[
-
- .055 - .130
- .953 .960 - .091 - .036
[·~
- 1~]
- .036 .990 .996
- .953 - .059 - .125
- .953 .960 - .098 - .033
- .059 - .098 .989 .990
-125] - .033 .990 .996
- ~]
- .003 .000 .000
In a "good" FA, the numbers in the residual correlation matrix are small because there is little difference between the original correlation matrix and the correlation matrix generated from factor loadings.
13.4.4 Factor Scores Scores on factors can be predicted for each case once the loading matrix is availab le. Regressionlike coefficients are compu ted for weighting variable scores to produce factor scores. Because R- 1 is the inverse of the matrix of correlations among variables and A is the matrix of correlations between factors and variables, Equation 13.10 for factor score coefficients is similar to Equation 5.6 for regression coefficients in mu ltiple regression. (13.10) Factor score coefficients for estimating factor scores from variable scores are a p roduct of the inverse of the correlation matrix and the factor loading matrix. For the example:5 25.485 B=
=
22.689 31.655 [ 35.479 0.082 0.054 [ 0.190 0.822
22.689 21.386 - 24.831 28.312
- 31.655 - 24.831 99.917 - 103.950
35.479 ] [ - .087 28.312 - .072 - 103.950 .994 109.567 .997
.981 ] - .978 .027 - .040
- 0.537] 0.461 0.087 - 0.074
To estimate a skier's score for the first factor, all of the skier's scores on variables are standardized and then the standardized score on COST is weighted by 0.082, UFT by 0.054, DEPTH by 0.190, and POWDER by 0.822, and the results are added. In matrix form, (13.11)
F = ZB
Factor scores are a product of s tandardized scores on variables and factor score coefficients. For the example:
F=
5
l-122
0.75 0.61 - 0.95 0.82
1.14 - 1.02 - 0.78 0.98 - 0.30
1.15 0.92 - 0.36 - 1.20 - 0.51
1 12 l.01 - o.45 - l.08 - 0.60
1141['"' Q~'l l . -1161
1.01 - 0.47 - l.01 - 0.67
0.054 0.190 0.822
- 0.461 0.087 = - 0.074
0.88 0.69 - 0.99 0.58
The numbers in B are different from the factor score coefficients generated by computer for the small data set. The difference is due to rounding error following inversion of a multicoUinear correlation matrix. Note also that the A matrix contains considerable rounding error.
489
490 Chapter 13
The first skier has an estimated stand ard score of 1.12 on the first factor and - 1.16 on the second factor, and so on for the other four skiers. The first skier strongly va lues both the snow factor and the resort factor, one positive and the other negative (indicating stronger importance assigned to speed of LIFT). The second skier values both the snow factor and the resort factor (with more value placed on COST than LIFT); the third skier p laces more value on resort conditions (particularly COST) and less value on snow conditions, and so forth. The sum of s tandardized factor scores across skiers for a single factor is zero. Predicting scores on variables from scores on factors is also possible. The equation for doing so is (13.12)
Z = FA'
Pred icted standardiwd scores on variables are a p roduct of scores on factors weigh ted by factor load ings. For example:
z=
=
r 1.12
1.01 - 0.45 - 1.08 - 0.60
r-1~
0.78 0.72 - 0.88 0.62
0.88 - 086 -1161 0.69 . - 0.99 [ .981
- .072 - .978
.994 .027
.997] - .040
0.58 1.05 - 0.93 - 0.64 1.05 - 0.52
1.08 1.03 - 0.43 - 1.10 - 0.58
1161
0.97 - 0.48 - 1.04 - 0.62
Tha t is, the first skier (the first row of Z) is p redicted to have a stand ardized score of - 1.23 on COST, 1.05 on LIFT, 1.08 on DEPTH, and 1.16 on POWDER. Like the reproduced correlation matrix, these values are similar to the observed values if the FA captures the rela tionship among the variables. It is helpfu l to see these va lues written ou t because they provide an insigh t into how scores on va riables are conceptualized in factor analysis. For example, for the first skier - 1.23 = - .086(1.12) + .981(- 1.16) 1.05 = - .072(1.12) - .978( - 1.16) 1.08 = .994(1.12) + .027( - 1.16) 1.16 = .997(1.12) - .040( - 1.16) Or, in algeb raic form,
A score on an observed variable is conceptualized as a p roperly weighted and summed combination of the scores on facto rs that underlie it. The resea rcher believes that each participan t has the same latent factor s tructure, bu t differen t scores on the factors themselves. A particular participant's score on an observed variab le is prod uced as a weighted combination of that pa rticipant's scores on the underlying factors.
Principal Components and Factor Analysis
13.4.5 Oblique Rotation All the relationships mentioned thus far are for orthogonal rotation. Most of the complexities of orthogona l ro ta tion rema in and severa l others are added when oblique (correla ted) rotation is used. Consu lt Table 13.1 for a listing of additional matrices and a hin t of the discussion to follow. IBM SPSS FACTOR is run on the data from Table 13.2 using the default option for ob lique rotation (cf. Section 13.5.2.2) to get va lues for the pattern matrix, A, and factor-score coefficients, B. In oblique rota tion, the loading matr ix becomes the pattern matrix. Va lues in the pattern matrix, when squared, represent the unique con tribution of each factor to the variance of each variable but do not include segments of variance that come from overlap between correla ted factors. For the example, the pattern matrix following oblique rota tion is
- .079 - .078 A= [ .994 .977
.981] - .978 .033 - .033
The first factor makes a unique contribu tion of - .079 2 to the variance in COST, - .0782 to LIFT, .9942 to DEPTH, and .9972 to POWDER. Factor-score coefficients following oblique rota tion are a lso found:
0.104 B =
0.081 [ 0.159 0.856
0.584]
- 0.421 - 0.020 0.034
Applying Equation 13.11 to p roduce factor scores resu lts in the following values:
F =
=
l-122
0.75 0.61 - 0.95 0.82
1.14
- 1.02 - 0.78 0.98 - 0.30
1.15 0.92 - 0.36 - 1.20 - 0.51
Ll4riM '~]
1.01 - 047 .
=~:~~
0.081 0.159 0.856
- 0.421 - 0.020 0.034
l Lll -LI81 1.01 - 0.46 - 1.07 - 0.59
0.88 0.68 - 0.98 0.59
Once the factor scores are determined, correla tions among factors can be obtained. Among the equations used for this purpose is 1 = ( - - ) F'F N - 1
(13.13)
One way to compu te correla tions among factors is from cross-products of s tandard ized factor scores d ivided by the number of cases minus one.
491
492
Chapter 13
The factor correlation matrix is a s tandard part of computer outpu t following oblique rotation. For the example:
4> =
~[ 1.12 1.01 4 - 1.18
0.88
1.00 - O.Dl
- O.Dl ] 1.00
= [
- 0.46 0.68
- 1.07 - 0.98
ll.I2 -1161
- 0.59] 0.59 -
1.01 0.45 1.08 0.60
0.88 0.69 - 0.99 0.58
The correlation between the first and second factor is quite low, - .01. For this example, there is almost no relationship between the two factors, although considerable correlation could have been produced had it been warranted. Usually, one uses orthogonal rota tion in a case like this because complexities introduced by oblique rotation are not warranted by such a low correlation among factors. However, if oblique rota tion is used, the structure matrix, C, is the correlation between variables and factors. These correlations assess the unique relationship between the variable and the factor (in the pattern matrix) p lus the relationship between the variable and the overlapping variance among the factors. The equation for the structure matrix is (13.14)
C = A ct>
The structure matrix is a product of the pattern matrix and the factor correlation matrix. For example: - .079
c=
- .078 [ .994 .997
.981 ] - .978 [ 1.00 .033 - .01 - .033
- .069 - .01 - .088 1.00] = [ .994 .997
.982 ] - .977 .023 - .043
COST, LIFT, DEPTH, and POWDER correlate - .069, - .088, .994, and .997 with the first factor and .982, - .977, .023, and - .043 with the second factor, respectively. There is some debate as to whether one should interpret the pattern matrix or the structure matrix following oblique rotation. The structure matrix is appealing because it is readily understood. However, the correlations between variables and factors are inflated by any overlap between factors. The problem becomes more severe as the correlations among factors increase and it may be hard to determine which variables are related to a factor. On the other hand, the pattern matrix contains values representing the unique contributions of each factor to the variance in the variables. Shared variance is omitted (as it is with standard multiple regression), but the set of variables that composes a factor is usually easier to see. If factors are very highly correlated, it may appear that no variables are related to them because there is almost no unique variance once overlap is omitted. Most researchers in terpret and report the pa ttern matrix rather than the structure matrix. However, if the researcher reports either the structure or the pattern matrix and a lso , then the interested reader can generate the o ther using Equation 13.14 as desired. In oblique rota tion, R, is produced as follows:
R= CA'
(13.15)
The reproduced correlation matrix is a product of the structure matrix and the transpose of the pattern matrix. Once the reproduced correla tion matrix is available, Equation 13.9 is used to generate the residual correlation matrix to d iagnose adequacy of fit in FA.
Principal Components and Factor Analysis
13.4.6 Computer Analyses of Small-Sample Example A two-factor principal factor analysis (PFA) with varimax rotation using the example is shown for IBM SPSS FACTOR and SAS FACTOR in Tab les 13.5 and 13.6.
Table 13.5
Syntax and IBM SPSS FACTOR Output for Factor Analysis on Sample Data
of Table 13.2
FACTOR /VARIABLES COST LIFT DEPTH POWDER / MISS ING LISTWI SE / ANALYS I S COST LIFT DEPTH POWDER / PRI NT I NITI AL EXTRACTION ROTATION / CRITERIA MI NEIGEN(l) ITERATE (25 ) / EXTRACTION PAF / CRITERIA I TERATE (2 5) / ROTATION VARIMAX /MET HOD=CORRELATION.
Communalities E.otract1on
ln1UI
COST
961
.970
UfT
.953
.960
DePTH
990 991
989 996
POWOER
El2radion Ut'I'IOd. Prineip.al Ans
r.-ng
Total Varianc:t Expl1ined Initial EiO'ftwllutS
Fa-
..
..
·2
hetort
Figure 13.5 Scatterplot of factor scores wit h pairs of factors (2 and 3) as axes following oblique rotation.
The solution tha t is evalua ted, interpreted, and reported is the run with principal factors extraction, varimax rotation, and five factors. In other words, after " trying ou t" oblique rotation, the decision is made to in terpret the earlier run w ith orthogona l rotation. Syn tax for this run is in Table 13.14. Communalities are inspected to see if the variables are well defined by the so lution. Communalities ind icate the percent of variance in a variab le that overlaps variance in the factors. As seen in Table 13.16, communality valu es for a number of variab les are quite low (e.g., FOULLANG). Seven of the variables have communality valu es lower than .2 indicating considerable heterogeneity among the variables. It sh ould be recalled, however, that factoria l p urity was not a consideration in development of the BSRL
Table 13.16
Communality Values (Five Factors). Selected Output from SAS FACTOR PFA (See Table 13.14 for Syntax) Final Communality Estimates: Total•16.670574 sensitiv
undstand
compass
leaderab
soothe
risk
decide
selrsuff
0 42057011 0 55491827 0 67643224 0 58944801 0 39532766 0 32659884 0 39618000 0 66277946
Principal Components and Factor Analysis 519
Plot of Factor Pattern f or factor2 an (p hl), an r X r matrix. Therefore, the pa rameter matr ices of the model are B, y , and (J> . Unknown parameters in these matrices need to be estimated. The vectors of dependen t variables, .,, and independen t va riables, ~,are not estimated. The diagram for the example is transla ted in to the Bentler- Weeks model, with r = 7 and q = 5, as below.
.,
.,
B
+
lVl""•1 l' 1l~" . 1l' V2 or 172
V3 or 173 V4 or 174 F2 or 1)5
0000 0 00 0 0 0 00 0 1 0 0 0 0* 0 00 0 0
V2 o r 1)2 V3 o r 1)3 V 4 o r 1)4
+
F2 or 1)5
~
1'
V5or ~1
0000 0 * 0 10 0 0 0 0 0 0 10 0 '1 0 0 0 00 1 0 * *0 0 00 1
1
F1 or ~2 E1 o r ~3 E2 o r ~4 E3or ~5 E4or ~6 02 or ~7
Notice that Tl is on both sides of the equation. Thls is because DVs can predict one another in SEM. The diagram and matrix equations are identical. Notice that the asterisks in Figure 14.4 directly correspond to the asterisks in the matrices and these matrix equations directly correspond to s imple regression equations. In the matrix equations, the number 1 ind icates that we have "fixed" the parameter, either a variance or a path coefficient, to the specific value of 1. Parameters are generally fixed for identification purposes. Id entification will be discussed in more detail in Section 14.5.1. Parameters can be fixed to any number; most often, however, parameters are fixed to 1 or O. The parameters that are fixed to 0 are also included in the path diagram but are easily overlooked because the 0 parameters are represented by the absence of a line in the diagram. Carefully compare the model in Figure 14.4 w ith this matr ix equation. The 5 x 1 vector of values to the left of the equal sign, the e ta (TI) vector, is a vecto r of DVs listed in the ord er indicated: NUMYRS (V1), DAYSKI (V2), SNOWSAT (V3), FOODSAT (V4), and SKISAT (F2). The next matrix, just to the righ t of the equal s ign, is a 5 X 5 ma trix of regression coefficients a mong the DVs. The DVs are in the same ord er as above. The matrix contains 23 zeros, one 1, and one •. Remember tha t ma trix mu ltiplication involves cross mu ltiplying and then summing the elements in the first row of the beta (B) ma trix w ith the firs t column in the eta (TI) matrix, and so fo rth (consu lt Appendix A as necessary). The zeros in the first, second, and fifth rows of the be ta ma trix ind ica te tha t no regression coefficients are to be estimated between DVs for V1, V2, and F2. The 1 a t the end of the thlrd row is the regression coefficient between F2 and SNOWSAT tha t was fixed to 1. The • a t the end of the fou rth row is the regression coefficient between F2 and V4 that is to be estimated. Now look to the right of the plus sign. The 5 X 7 ga mma matr ix contains the regression coefficients that a re used to p red ict the DVs from the Ns. The five DVs that are associated with the rows of this matrix a re in the sa me order as above. The seven Ns that id entify the colum ns are, in the o rder indicated, SENSEEK (V5), LOVESKI (F1), the four E (errors) for V1 to V4, and the D (d isturbance) of F2. The 7 X 1 vector of IVs is in the same order. The first row of the y (gamma) matrix times the ~ (X i) vector p roduces the equa tion for NUMYRS. The • is the regression coefficient for predicting NUMYRS from LOVESKI (F1) and the 1 is the fixed regression coefficien t fo r the re lationsh ip between NUMYRS and its El. For example, consider the equa tion fo r NUMYRS (V1) reading from the fi rst row in the matrices:
o r, by dropping the zero-weighted p roducts, and using the d iagram's nota tion. V1
= *F1 + E1
539
540 Chapter 14
Con tin ue in this fashion for the next four rows, to be sure you understand their relationship to the d iagrammed model. In the Bentler- Weeks model, only IVs have va riances and cova riances and these are in « (phi), an r X r ma trix. For the example, w ith seven IVs:
VS or~ 1 Fl or ~2 El o r ~3 E2 or ~4 E3 o r ~5 E4or ~6 « = VS or ~~
*
Fl or ~2 El or ~3 E2 or ~4 E3 or ~5 E4 or ~6 02 or ~7
0 0 0 0 0 0
0 1 0 0 0 0 0
0 0
0 0 0
* 0 0 0 0
02or ~7
*
0 0 0 0
0 0 0
*
0 0 0 0 0
0 0
*
0 0 0 0 0 0
0
*
This 7 X 7 phi matrix con tains the variances and covariances tha t a re to be estimated for the IVs. The •s on the d iagonal ind icate the va riances to be estima ted for SENSEEK (VS), LOVESKI (Fl), El , E2, E3, E4, and 0 2. The 1 in the second row corresponds to the variance of LOVESKI (Fl) that was set to 1. There a re no covariances a mong IVs to be estimated, as ind icated by the zeros in a ll the off-diagona l positions.
14.4.4 Model Estimation The modeling p rocess begins w ith initial guesses (start values) for the parameter to be estimated (ind icated w ith an •). Given the capabilities of current software p rograms, it is generally perfectly reasonable to allow the compute r algorithm to select an initia l start value, by simply specifying an""''. Once in a great while an estimation procedure may s truggle to converge. If that happens, it wou ld be fine to supply a s tart value; ideally this value would be a close app roxima tion to the final estimated value. However, it seems rare with most typical models to need to supply specific s tart values. Compu ter program-generated s ta rt values are indicated with asterisks in the diagrams and in each of the three param eter matrices in the Ben tler- Weeks model, B, y, and c'i>, that follow next page. The (hat) over the matrices indicates that these are matrices of estimated pa rameters. The B(beta ha t) matrix is the ma trix of regression coefficients between OVs where sta rt values have been substitu ted for • (the parameters to be estima ted). For the example: A
B =
r~ ~ ~ ~ ~ 1 0 0
0 0
0 0
0 0
.83 0
The matrix con taining the s ta rt va lues for regression coefficients between OVs and IVs is f (gamma hat). Fo r the example:
f
=
r; .39
.80 .89 0 0 .51
1 0 0 0 0
0 1
0 0 0
0 0 1
0 0
0 0 0 1
0
Structural Equation Modeling
Finally, the matrix con taining the s tart va lues for the variances and cova riances of the IVs is (phi hat). For the example:
cl> =
1.00 0 0 0 0 0 0
0 1.00 0 0 0 0 0
0 0 .39 0 0 0 0
0 0 0 10.68 0 0 0
0 0 0 0 .42 0 0
0 0 0 0 0 .73 0
0 0 0 0 0 0 1.35
To calcula te the estimated popula tion covariance matrix implied by the parameter estimates, selection matrices a re first used to p ull the meas ured variab les ou t of the full parame ter matrices. (Remember, the parameter ma trices have both measu red and laten t va riables as componen ts.) The selection matrix is simply labeled G and has elements that are either Is or Os. (Refer to Ullman, 2001, for a more deta iled treatment of selection matrices.) The resulting vector is labeled Y,
(14.11)
where Y is our name for the those measured variables that are dependent. The independent measured variab les are selected in a simila r manner,
x=
(14.12)
c:~ = V5
where X is our name for the independent measu red variables. Compu tation of the estimated population covariance matrix p roceeds by rewriting the basic structural modeling Equation 14.10 as: 3 (14.13) where I is simply an identity matrix the same size as B. This equation exp resses the DVs as a linea r combination of the IVs. At this point, the estim ated population covariance matrix for the DVs, i YY' is estima ted using:
(14.14) For the example: 1.04
~ =
.72
[
.41 .34
.72 11.48 .45 .38
.41 .45 2.18 1.46
.34] .38 1.46 1.95
The estimated population cova riance matrix between IVs and DVs is obtained similarly by: (14.15) 3
This rewritten equation is often called the "reduced form."
541
542
Chapter 14
For the example:
i yx =
[~.39] .38
Finally, (phew!) the estimated population covariance matrix between IVs is estimated: (14.16) For the example:
f
XX
= 1.00
In practice, a "super G" matrix is used so tha t all the covariances are estima ted in one step. The components of l: are then combined to p roduce the estimated population covariance matrix after one iteration. For the example, using initia l s tart values supplied by EQS:
l:
=
f'~
.72 .41 .34
0
.72 11.48 .45 .38 0
.41 .45 2.18 1.46 .39
l
.34
.38 1.46 1.95 .33
0 0.39
.33 1.00
After initial start values a re calcula ted, parameter estimates are changed incrementally (itera tions con tinue) until the prespecified (in this case, maximum likelihood) function (Section 14.5.2) is minimized (converges). After six itera tions, the maximum likelihood function is at a minimu m and the solution con verges. The final estimated parameters are p resented for comparison pu rposes in the B, 'Y, and cl> matrices; these unstandardized parameters a re also presented in Figure 14.5.
B=
r~ ~ ~ ~ ~.00 0 0
y=
cl> =
0 0
0 0
0 0
.39
.81 .86 0 0 .62
1.00 0 0 0 0
0 1.00 0 0 0
1.00 0 0 0 0 0 0
0 1.00 0 0 0 0 0
0 0 .34 0 0 0 0
0 0 0 10.72 0 0 0
f!
.70
0 0 0 1.00 0 0 0 0 0 0 .52 0 0
1
0 0 0 1.00 0 0 0 0 0 0 .51 0
0 0 0 0 0 0 .69
Structural Equation Modeling
(.35) .59
NUMYRS number of years skied
Vt
(10.77) .97
SNOWSAT (.70) .75
(.81") .81
(1.00) .84
snow satisfaction V3
(.70") .74
DAY SKI
Figure 14.5
SENSEEK
(.5 1) .67
FOOD SAT
(.39")
total number of days skied V2
(.52) .54
food satisfaction V4
.35
sensation seeking V5
Final model for small-sample example with standardized (and unstandardized)
coefficients.
The final estimated population covariance matrix is given by
i
=
llOO
.70 .51 .35
0
.70 11.47 .54
.51 .54 1.76 .87 .39
.38 0
.35 .38 .87 1.15 .27
i. For the example:
:,1 .27 1.00
The final residual matrix is:
s-i{12
.08 .30
0 0 .08
.06 .21
.12 .08 .12 .08 .15
.08 .06 .08 .06 .11
.21 .15 .11
wl
0
14.4.5 Model Evaluation A _x2 s tatistic is compu ted based upon the function minimum when the solution has con verged. The minimum of the function was .09432 in this example. This value is multiplied by N - 1 (N = number of participants) to yield the _x2 value: (.09432)(99) = 9.337
This _x2 is evaluated with degrees of freedom equal to the difference between the total number of degrees of freedom and the number of parameters estima ted. The degrees of freedom in SEM are equa l to the amoun t of unique information in the sample variance/cova riance matrix (variances and covariances) minus the number of parameters in the model to be estimated (regression coefficients and variances and covariances of independent variables). In a model with a few variables, it is easy to count the numbe r of va riances and covariances; however, in larger models, the number of da ta points is calculated as: . p(p + 1) number of data pomts = 2
where p equals the number of measured variables.
(14.17)
543
5 44 Chapter 14
In this example w ith 5 measu red variables, there a re (5( 6) / 2 = ) 15 data points (5 variances and 10 covariances). The estima ted model includ es 11 parameters (5 regression coefficien ts and 6 va riances), so ,y2 is evaluated with 4 d fs, ,y2 ( 4, N = 99) = 9.337, p = .053. Because the goa l is to develop a mod el that fits the data, a nonsignifiCIInt chi-square is d es ired . This ,y2 is nonsignificant, so we conclud e that the model fits the data. However, chi-squa re values d epend on samp le sizes; in models with large samples, trivia l d ifferences often cause the ,y2 to be significant solely because of sample size. For this reason, many fit ind ices have been developed that look at model fit while eliminating or minimizing the effect of sample size. All fit ind ices for this model ind icate an ad equate, bu t not spectacular, fit. Fit ind ices a re discussed fully in Section 14.5.3. The model fits, but what d oes it mean? The hypothesis is that the observed covariances among the measured variables arose because of the rela tionships between va riables specified in the model; because the chi-square is not s ignificant, we conclud e tha t we should retain our hypothesized model. Next, researchers usua lly examine the s ta tistically significant relationships w ithin the mod el. If the unstand ardized coefficients in the three para meter matrices are d ivid ed by their respective stand ard er rors, a z-score is obtained for each parameter that is evaluated in the usua l manner4 :
z=
param eter estimate std error for estima te
(14.18)
. f 0.809 For NUMYRS p redicted rom LOVESKI - - = 7.76, p < .05 0.104
. fr 0.865 DAYSKI p redicted om LOVESKI 0. = 8.25, p = .054 106 0.701 FOODSAT p red icted from SKISAT - - = 5.51, p < .05 0.127 . f 0.389 SKISAT pred icted rom SENSEEK 0. = 3.59, p < .05 108
. f 0.625 SKISAT p redicted rom LOVESKI 0. = 4.89, p < .05 128
Because of d ifferences in scales, it is some times d ifficult to in te rpret unstandard ized regression coeffic ients; therefore, resea rchers often examine standard ized coefficients. Both the standard ized and unstandard ized regression coefficients for the final mod e l a re in Figure 14.5. The unstanda rd ized coefficien ts are in paren theses. The paths from the factors to the variab les a re jus t s tand ardized factor loadings. It could be concluded that nu mber of years s kied (NUMYRS) is a sign ificant ind ica tor of Love of Skiing (LOVESKI); the greate r the Love of Skiing, the higher the nu mber of years skied. N u mber of total days skied (DAYSKI) is a significan t indicator of Love of Skiing (i.e., grea ter Love of Skiing p redicts mo re total d ays skied) because there was an a p riori h ypothesis that stated a positive re la tionship. Degree of food satisfaction (FOODSAT) is a significant ind icator of s ki trip sa tisfaction (SKISAT), higher Ski Trip Sa tisfaction p redicts g reater satisfaction w ith the food . Because the path from SKISAT to SNOWSAT is fixed to 1 for identification, a standard e rror is not ca lcu la ted. If this s tandard error is d esired , a second run is perfo rmed with the FOODSAT path fixed instead . H igher SENSEEK pred icts higher SKISAT. Las tly, greate r Love of Skiing (LOVESKI) significantly p redicts Ski Trip Satisfaction (SKISAT) because this rela tionship is also tested as an a priori, u nidirectional hypothesis.
4
The s tandard errors are derived fro m the inverse of the information matrix.
Structural Equation Modeling
14.4.6 Computer Analysis of Small-Sample Example Tables 14.2, 14.4, and 14.5 show syntax and minimal selected ou tpu t for compu ter analyses of the da ta in Table 14.1 using EQS, LISREL, and AMOS, respectively. The syntax and ou tp u t for the programs are all qu ite d ifferent. Each of these p rograms offers the option of using a Windows "poin t and click" method in add ition to the syntax approach. Additiona lly, EQS, AMOS, and LISRELallow for ana lyses based on a d iagram . The sample example is shown only using the syntax approach. The "point and click" method and the diagram specifica tion methods are just special cases of the syntax.
Table 14.2 Structural Eq uation Model of Small-Sample Example Through EQS 6.3 (Syntax and Selected Output) /TITLE EQS model created by EQS 6 for Windows-c : \JODIE\Papers\smallsample example /SPECIFICATIONS DATA=' C: \smallsample example 04 . ESS '; VARIABLES=5 ; CASES=100 ; GROUPS=1 ; METHODS=ML; MATRIX=covariance ; ANALYSIS=COVARIANCE; /LABELS V1=NUMYRS ; V2=DAYSKI ; V3=SNOWSAT; V4=FOODSAT; V5=SENSEEK; F1 = LOVESKI ; F2=SKISAT ; /EQUATIONS ! LOve of Skiin9 Construct Vl = ,._Fl + El ; V2 = *Fl + E2 ;
!Ski Trip Satisfaction Construct V3 1F2 + E3; V4 = *F2
+ E4 ;
F2 = *Fl + *VS
+ 02 ;
/VARIANCES vs = *; F1 = 1.00; El to E4 *· D2 = *; /PRINT EFFECT = YES ; FIT=ALL ; TABLE=EQUATION; /LMTEST /WTEST /END MAXIMUM LIKELIHOOD SOLUTION (NORMAL DISTRIBUTION THEORY) GOODNESS OF FIT SUMMARY FOR METHOD = ML 170 . 851 ON
INDEPENDENCE MODEL CHI - SQUARE
INDEPENDENCE AIC =
INDEPENDENCE CAlC=
1. 33724
MODEL AIC CHI - SQUARE =
150 . 85057
10 DEGREES OF FREEDOM
9 . 337 BASED ON
PROBABILITY VALUE FOR THE CHI - SQUARE STATISTIC IS
MODEL CAIC
114 . 79887 -13 . 08344
4 DEGREES OF FREEDOM . 05320
THE NORMAL THEORY RLS CHI - SQUARE FOR THIS ML SOLUTION IS
8 . 910 .
FIT INDICES BENTLER- BONETTNORMED
FIT INDEX
. 945
BENTLER- BONETT NON- NORMED
FIT INDEX
. 917
COMPARATIVE FIT INDEX (CFI) BOLLEN (I FI)
. 967 FIT INDEX
. 968
(continued)
545
546 Chapter 14
Table 14.2
(Continued)
MCDONALD (MFI)
FIT INDEX
. 974
LISREL GFI
FIT INDEX
. 965
LISREL AGFI
FIT INDEX
. 870
ROOT MEAN-SQUARE RESIDUAL
(RMR)
. 122
STANDARDIZED RMR
. 111
ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA) = . 116 90 % CONFIDENCE INTERVAL OF RMSEA (
. 000 ,
. 214)
MEASUREMENT EQUATIONS WITH STANDARD ERRORS AND TEST STATISTICS STATISTICS SIGNIFICANT AT THE 5 % LEVEL ARE MARKED WITH @. NUMYRS
=V1
. 809*F1
+ 1 . 000 El
. 104 7 . 755@ DAYSKI
=V2
. 865*F1
+ 1.000 E2
. 105 8 250@ SNOWSAT
=V3
. 000 F2
+ 1 . 000 E3
FOODSAT
=V4
. 701*F2
+ 1.000 E4
. 127 5 . 511@ CONSTRUCT EQUATIONS WITH STANDARD ERRORS AND TEST STATISTICS STATISTICS SIGNIFICANT AT THE 5 % LEVEL ARE MARKED WITH @. SKI SAT
=F2
. 389*V5
+ . 625*F1
. 108
. 128
3 . 591@
4 . 888@
+ 1 . 000 D2
STANDARDIZED SOLUTION :
R- SQUARED
NUMYRS
=V1
. 809*F1
+ . 588 E1
. 655
DAYSKI
=V2
. 865*F1
+ . 502 E2
. 748
SNOWSAT
=V3
. 839 F2
+ . 544 E3
. 704
FOODSAT
=V4
. 738*F2
+ . 674 E4
SKI SAT
=F2
. 350*V5
+ . 562*F1
. 545
+ . 749 D2
. 438
As seen in Table 14.2, an d described in Section 14.4.3, th e model is specified in EQS using a series of regression equations. 1n the / EQUATI ONS section, as in ordinary regression, the DV appears on the left s ide of the equa tion, the IVs on the right s ide. Measured variab les a re referred to by the letter v and the number corresponding to the va riable given in the / LABELS section. Errors associated with measured variab les are ind icated by the letter E and the number of the variab le. Factors are referred to with the letter F and a number given in the / LABELS section. The e rrors, o r d isturbances, associated w ith factors are referred to by the letter D and the number corresponding to the factor. An asteris k indicates a parameter to be estimated . Variables inclu ded in the equa tion w ithou t asteris ks are considered para meters fixed to the value 1. 1n this examp le, start valu es are not sp ecified and are estimated au tomatically by the p rogram. The va riances of IVs are p arameters of the model and are indicated in the /VAR paragraph. The d ata app ea r as a covariance matrix in the paragraph labeled / MATRIX. 1n the I PRI NT paragraph, F IT= AL L requests all goodn ess-of-fit ind ices available. The ou tput is heavily edited. After mu ch d iagnostic information (n ot inclu ded here), goodness-of-fit ind ices are given in the section labeled GOODNES S OF FIT SUMMARY. The in depen dence model ch i-square is labeled I NDEPENDENCE C HI - S QUARE. The in depen dence em-squa re tests the hypothesis that there is n o relationship among th e variab les. This em-squa re sh ou ld always be s ignificant, in d icating that there is some relationship among the variables.
Structural Equation Modeling
C HI - SQUARE is the model chi-square tha t ideally shou ld be nonsignificant. Several different goodness-of-fit ind ices are given (cf. Section 14.7.1), beginning with BE NTLER - BONETT NORMED FI T I NDEX. Significance tests fo r each pa ramete r of the measurement portion of the model a re found in the section labeled MEASUREMENT EQUATI ONS WITH STANDAR D ERRORS AND TES T S TATI S TIC S. The unstandardized coefficient appears on the first line, immediately below it is the standard e rror for tha t pa ra meter. The z-score associa ted with the pa ra mete r (the unstandard ized coefficient div ided by the s tandard error) is given on the third line. The section labeled CONSTRUCT EQUATIONS WITH S TANDARD ERRORS AND T E ST S TATI S TI CS contains the unstandardized regression coefficients, standa rd e rro rs, and z-score significance tests for p redicting factors from other factors and measured variables. The s tandardized parameter estimates appear in the section labeled S TANDARDIZED SOLUTION. LISREL offers two very different methods of specifying models. SIMPLIS uses equations and LISRELemploys matrices. Neither program a llows the exact mod el specified in Figu re 14.4 to be tested. Underlying both these programs is the LISREL model, which, although similar to the Bentler- Weeks model, employs eight matrices instead of three. The matrices of the LISREL mod el that correspond to the Ben tier- Weeks model are given in Table 14.3. Within the LISREL mod el, there is no matrix of regression coefficients for predicting la ten t DVs from measured IVs. To estima te these parameters, a little trick, illus trated in Figure 14.6, is employed . A dummy "latent" variab le w ith one indicator5 is specified, in this example SENSEEK. The dummy latent variable then p redicts SKISAT. The regression coefficient from the dummy "latent" va riable to SENSEEK is fixed to one and the e rro r variance of SENSEEK is fixed at zero. With this modification, the solutions are identical, because SENSEEK = (dummy latent va riable) + 0. LISREL uses matrices, ra ther than equa tions, to specify the mod el. Syntax and ed ited output are p resen ted in Table 14.4. Matrices and commands are given with two-letter specifications d efined in Table 14.3. CM with an aste risk ind icates analysis of a covariance matrix.
Table 14.3
Equivalence of Matrices in Bentler-Weaks and U SREL Model Specifications
Bentler-Weeks Model
Symbol
Name
B
Beta
1'
Gamma
Phi
Contents matrix of regression coeffiCients o f DVs p redicting otherDVs
LISREL M odel
Symbol
Name
USAEL Two Letter Specification
1. B
1. Beta
2. A,
2. Lambda y
2. LY
2. matrix of regression coefficients o f measured DVs predicted by latent DVs
1. Gamma
1. GA
1. matrix of regression coefficient of latent ovs predicted by latent IVs
1.
r
1. matrix of regression coefficients of latent DVs predicting other latent DVs
matrix of regression coeffiCients o f DVs p redicted by iVs
2. A,
2. Lambda x
2. LX
2. matrix of regression coefficients o f measured DVs predicted by latent IVs
matrix of
1.
1. Phi
1. PI
1. matrix of covariances among the latent IVs
2.
"'
2. Psi
2. PS
2. matrix of covariances of errors associated w ith latent DVs
3.
e.
3 . Theta ·
3 . TO
3 . matrix of covariances among errors associated w ith measured DVs predicted fro m latent IVs
e.
4. Theta · Epsilon
4. TE
4. matrix of covariances among errors associated w ith measured DVs predicted fro m latent DVs
covariances among the IVs
4.
5
Contents
1. BE
Delta
Note this dummy variable is not a true latent variable. A one indicator latent variable is simply a measured variable.
547
548 Chapter 14
sensation seeking SENSEEK
E= O
Figure 14.6
LISREL adaptation for small -sample example.
Following LA (for label), the measured variable names are given in the same order as the data. LISREL requires that the DVs appear before the IVs, so the specifica tion SE (select) reorders the variables. The model specification begins w ith MO. The number of measured DVs is indicated after the key letters NY (number of Ys). The number of measured IVs is specified after the key letters NX (number of Xs). The latent DVs are specified after NE and the la tent IVs are specified after NK. Labels are optional, but helpful. The labels for the latent DVs follow the key le tte rs LE and labels for the latent IVs follow the key letters LK.
Table 14.4
Structural Eq uation Model for Small-Sample Example Through U SREL 8.5.4 (Syntax and Edited Output) TI Small Sample Example - LISREL DA NI=S N0=100 NG=1 MA=CM CM 1.00 11 . 47 . 70 . 623 . 623 1.874 . 436 .436 . 95 1.173 .3 . 21 . 54 . 38 1 . 00 LA NUMYRS DAYSKI SNOWSAT FOODSAT SENSEEK SE SNO~ISAT FOODSAT NUMYRS DAYSKI SENSEEK MO NY =2 NX = 3 NE =1 NK = 2 LE SKISAT LK LOVESKI DUMMY FR LX(1 , 1) LX(2 , 1) LY(2 , 1) FI PH(2 , 1) TD(3 , 3) VA 1 LX(3 , 2) LY(1 , 1) PH(1 , 1) OU SC SE TV RS SS MI ND=3
LISREL Estimates (Maximum Likelihood) LAMBDA- Y SKI SAT SNOWSAT
1 . 000
FOODSAT
0 . 701 (0 . 137) 5 . 120
Structural Equation Modeling
Table 14.4
(Continued)
LAMBDA- X LOVESKI NUMYRS
DUMMY
0 . 809 (0 . 291) 2 . 782
DAY SKI
0 . 865 (0 . 448) 1 . 930
SEN SEEK
1.000 GAMMA
LOVESKI SKI SAT
DUMMY
0 . 625
0 . 389
(0 . 185)
(0 . 112)
2 . 540
3 . 480
Covariance Matrix of ETA and KSI SKI SAT SKI SAT
1 . 236
LOVESKI
0 . 625
DUMMY
0 . 389
LOVESKI
DUMMY
1.000 1.000
PHI Note : This matrix is diagonal .
LOVESKI
DUMMY
1 . 000
1.000 (0 . 142) 7 . 036
PSI SKISAT 0 . 694 {0 . 346) 2 . 007
Squared Multiple Correlations for Structural Equations SKI SAT 0 . 438 THETA- EPS SNOWS AT
FOODSAT
0 . 520
0 . 507
(0 . 223)
(0 . 126)
2 . 327
4 . 015
Squared Multiple Correlations for Y - variables SNOWS AT
FOODSAT
0 . 704
0 . 546
THETA- DELTA NUMYRS
DAYSKI
0 . 345
10 . 722
---(0 . 454)
(1. 609)
0 . 760
6 . 664
SENSEEK
Squared Multiple Correlations for X - variables NUMYRS
DAYSKI
SENSEEK
0 . 655
0 . 065
1 . 000
(continued)
549
550 Chapter 14
Table 14.4
(Continued) Goodness of Fit Statistics
De9rees of Freedom = 4 Minimum
~it ~unction
Chi - Square= 9 . 337 (P = 0 . 0532)
Normal Theory Weighted Least Squares Chi - Square= 8 . 910 (P = 0 . 0634) Estimated Non - centrality Parameter {NCP) = 4 . 910 90 Percent Confidence Interval for NCP = (0 . 0 ; 17 . 657)
Minimum Fit Function Value = 0 . 0943 Population Discrepancy Function Value (FO) = 0 . 0496 90 Percent Confidence Interval for FO = (0 . 0 ; 0 . 178) Root Mean Square Error of Approximation (RMSEA) = 0 . 111 90 Percent Confidence Interval for RMSEA = (0 . 0 ; 0 . 211) P - Value for Test of Close
~it
(RMSEA
< 0 . 05) = 0 . 127
Expected Cross- Validation Index (ECVI) = 0 . 312 90 Percent Confidence Interval for ECVI = (0 . 263 ; 0 . 441) ECVI for Saturated Model = 0 . 303 ECVI for Independence Model = 1 . 328 Chi - Square for Independence Model with 10 0e9rees of Freedom Independence AIC = 131 . 492 Model AIC = 30 . 910 Saturated AIC = 30 . 000 Independence CAIC = 149 . 518 Model CAIC = 70 . 567 Saturated CAIC = 84 . 078 Normed ~it Index (N~I) = 0 . 923 Non- Normed Fit Index (NNFI) = 0 . 880 Parsimony Normed Fit Index
(PN~I)
= 0 . 369
Comparative Fit Index (CFI) = 0 . 952 Incremental Fit Index (IFI) = 0 . 955 Relative Fit Index (RFI) = 0 . 808 Critical N (CN) = 141 . 770
Root Mean Square Residual (RMR) = 0 . 122 Standardized RMR = 0 . 0974 Goodness of Fit Index (GFI) = 0 . 965 Adjusted Goodness of Fit Index (AG~I) = 0 . 870
Parsimony Goodness of Fit Index (PGFI) = 0 . 257 Completely Standardized Solution LAMBDA- Y SKISAT SNOWSAT
0 . 839
FOODSAT
0 . 769 LAMBDA- X LOVESKI
NUMYRS
0 . 809
DAYSKI
0 . 255
SENSEEK
DUMl1Y
1.000 GAMMA
SKISAT
LOVE SKI
DUMMY
0 . 562
0 . 350
Correlation Matrix of ETA and KSI SKI SAT SKI SAT
1 . 000
LOVESKI
0 . 562
DUMMY
0 . 350
LOVE SKI
DUMMY
1 . 000 1.000
121 . 492
Structural Equation Modeling
Table 14.4
(Continued)
PSI SKI SAT 0 . 562 THETA- EPS SNOW SAT
FOODSAT
0 . 296
0 . 454
THETA- DELTA NUMYRS
DAYSKI
0 . 345
0 . 935
SEN SEEK
Regression Matrix ETA on KSI (Standardized)
SKI SAT
LOVESKI
DUMMY
0 . 562
0 . 350
By default, elements of the matrices are either fixed at zero o r are free. Additionally, matrices are one of four possible shapes: full nonsymmetrical, symmetrical, d iagonal, or zero. Matrices are referred to by their two-letter designation, for example, LX (lambda x) is a full nonsymmetrical and fixed matrix of the regression coefficients precticting the measured DVs from latent Ns. The model is specified by a combination of freeing (FR) or fixing (FI) elemen ts of the relevan t matrices. Freeing a parameter means estimating the paramete r. When an elemen t of a matrix is fixed w ith the key le tters FI, it is fixed at zero. A command line begins w ith either FI (for fix) or FR (for free). Following this FI o r FR specification, the particular matrix and specific element (row, colu mn) tha t is to be freed or fixed is indica ted. For example, from Table 14.4, FR LX(1,1) means free (FR) the element of the lambda x matrix (LX) that is in the first row and the first colu mn (1,1), that is, the factor loading of NUMYRS on LOVESKI. Similarly, FI PH( 2, 1) ind icates that the covariance that is in the second row, first column (2,1) of the p hi matrix (PH) is fixed to zero (FI) (i.e., there is no rela tionship between LOVESKI and DUMMY). 1n this example, LX (LAMB DA- X) is a 3 X 2 full and fixed matrix of regression coefficients of measured variables predicted by latent Ns. The rows are the three measu red variables tha t are the indica tors of la tent Ns: NUMYRS, DAYSKI, and SENSEEK; and the columns a re the latent Ns: LOVESKI and DUMMY. LY (LAMBDA- Y) is a full and fixed ma trix of the regression coefficien ts pred icting measu red DVs from the la tent DV. 1n this example, LY is a 2 X 1 vecto r. The rows are the meas u red variables SNOWSAT and FOODSAT and the colu mn is SKISAT. The PH (phi matrix) of covariances among latent Ns is by default symmetrica l and free. 1n this exa mple, p hi is a 2 X 2 matrix. No covariance is specified between the d ummy la tent variable and LOVESKI therefore PH ( 2, 1) is fixed, F I. To estimate this model, the error va riance associated with SENSEEK mus t be fixed to zero. This is done by specifying FI TD ( 3 , 3) . TDrefers to the theta delta matrix (erro rs associated w ith meas ured Ns serving as ind icato rs of laten t Ns); by default, this matrix is d iagonal and free. A d iagonal matr ix has zeros everywhere but the main d iagona l. 1n the small-sample example it is a 3 X 3 matrix. Only four of the eigh t LISREL matrices (LX, LY, PH, and TD) are includ ed on the model (MO) line. LISREL matrices have particular shapes and elements specified by d efault. 1f these defaults are appropriate for the model, there is no need to mention the unmodified matrices on the MO line. 1n this example, the d efault specifications forTE , GA, PS , and BE are all appropria te. TE (theta epsilon) is diagonal and free by default. TE contains the covariances associated with the measured DVs associated with the latent DVs.ln this example, it is a 2 X 2 matrix. Gamma (GA) contains the regression coefficients of latent Ns predicting latent DVs. By default, this matrix is full and free. 1n this example, GA is a 1 X 2 vector. PS contains the covariances among errors associated w ith latent DVs, by default it is d iagonal and free. 1n the small-sample example, there is only one latent DV; therefore, PS is s imply a scalar (a number). BE contains the regression coefficients among the latent
551
552 Chapter 14 DVs, by default a matrix of zeros. The sm all-sample example contains no relationships among latent DVs-there is only one latent DV-so there is no need to mention BE. Finally, for identification, a pa th is fixed to 1 on each factor and the variance of LOVESKI is fixed at the value 1. (See Section 14.5.1 for a discussion of identifica tion.) This is accomplish ed with the key le tters VA 1 and the relevant matrices and corresponding elements. The ou line specifies ou tput options (sc comple tely standardized solution, SE s tandard errors, TV t values, RS residu al information, s s standardized solution, and ND n umber of decima l places), not all of which are included in the edited output. The highly edited outpu t provides the unstandardized regression coefficients, standard er rors for the regression coefficients, and t tests (unstandardized regression coefficient d i· vided by standard error) by matrix in the section labeled LISREL Estimates (Maximum Likeli hood). The statistical significance of parameter estima tes is determin ed w ith a table oft d is tribu tions (Tab le C.2). At statistic grea ter than 1.96 is needed for s ignificance at p < .05 and 2.56 for s ignificance a t p < .01. These are two-ta iled tests. If the d irection of the effect has been hypothesized a priori, a one-tailed test can be employed, t = 1.65, p < .05, on e-tailed. The goodness-of-fit summary is labeled Goodness Of Fit Statis t ics. A par tially s tan· dardized solution appears, by matrix, in the section labeled Completely Standard ized Solution. The regression coefficients, and variances and covariances, are completely s tan· dardized (latent variable mean of 0, sd = 1, observed variable mean = 0, sd = 1) and are identica l to the standardized solution in EQS. The error var iances given in Completely Standardized Sol ut ion for both measured variables and laten t var iab les are not actually completely s tandardized and are different from EQS (Chou and Bentler, 1993). An option in LISREL (n ot shown) is the standardized solution, a second type of pa rtia lly standardized solu · tion in w hich the la tent variables are standard ized to a mean = 1 and sd = 0, but the observed variables remain in their origin al scale. AMOS syntax uses equ ations to specify the model. Syn tax and edited ou tput are presented in Table 14.5. After an SEM n ew model is specified with Dim Sem As New AmosEngine, general commands regarding ou tput options are given. Each command begins with the letters Sem . Sem . TableOutpu t in d icates that outpu t be p resented in table form similar to IBM SPSS for Windows style. Other options are available. Sem . St andard ized requ ests a comp le tely standardized solu ti on. Sem . Mod s 0 requests all modification indices.
Table 14.5 Structural Eq uation Model of Small-Sample Example Through AMOS (Syntax and Selected Output) Sub Main Dim Sem As Ne•11 Amos Engine
Sem . TableOutput Sem . Standard ized Sem . Mods 0 Sem . Begi nGroup "UserGuide . x ls ", "smsampl e " Sem . Structure "numyrs 0
4
I
>-
4
3
2
2
0
0 0
2
3
4
2
0
X_COVARIATE
3
4
X_COVARIATE (c)
8
7 6
5
6I >-
4 3
2
0 0
2
3
4
X_COVARIATE
Figure 15.4 Regression lines for three groups varying in (a) intercepts but not s lopes, (b) both intercepts a nd s lopes, a nd (c) slopes but not inte rce pts. Generated in SYSTAT. Decisions about fixed versus random slopes apply separately to each p redictor in a model. On the other hand, the decision about fixed versus random intercepts is made for the model as a whole, disregarding any p red ictors. Do the groups differ in average speed? That is, do runs have different speeds averaged over skiers? This is a question best answered through the intraclass correlation (Section 15.6.1). The assumption that the s lopes are constant is testable in a multilevel model in which random slope coefficients are specified as the test of the variance of s lope. 1n the small-sample example with a Ievel-l p red ictor, s lope variance = 0.2104 with a s tandard error of 0.1036 in the SAS and IBM SPSS outpu t of Tables 15.7 and 15.8, respectively. This produces a z va lue of (0.2104/ 0.1036= ) 2.03, s ignificant at the one-ta iled a = .05 level. Thus, random slope coefficients are appropriate in this model. The p resence of significant heterogeneity of s lopes also shows the difficulty in in terpreting the nonsignificant DV- predictor association (here, between speed and skill) as unimportant. Recall from ANOVA that main effects (in this case the DV- IV association ) cannot be unambiguously in terpreted in the presence of in teraction (v iola tion of the assumption of homogeneity of slopes). Fa iling to accoun t for rand om slopes can have serious statistical and interpretational consequences. 1n the small-sample example, including skill as a fixed, but not random, effect (model no t shown) p roduces a significant effect of skill. That is, one wou ld conclude that speed is positively rela ted to skill on the bas is of such a model. However, as we have seen in Figure 15.2, that conclus ion is incorrect for four of the runs.
Multilevel Linear Modeling 651 Examination of the differences in s lopes over the groups may be of substan tive in terest. For example, the slopes (and intercep ts) for all of the runs in the small-sample example are illustrated in Figu re 15.2 and suggest an interesting pattern. The s lope for the Aspen run shows little relationship between s kill and speed for that run . The remaining slopes, for the Mammoth runs, ap pear to group themselves into two patterns. Five of the runs have relatively low intercepts and positive relationships with s kill. Runs are slow on average, bu t the grea ter the skill of the skier, the faster the speed. This suggests that the runs a re difficult, but the difficulty can be overcome by skill. The remaining four runs have relatively high in tercepts and negative relationships with skill. Tha t is, these are fast runs bu t, for some reason, skilled s kiers traverse them more slowly (perhaps to view the scenery). The pattern of nega tive rela tionships between in tercepts and slopes is a common one, reflecting floor and ceiling effects. Those at the top have less room to grow than those a t the bottom. Higher-level variables also can have fixed o r random slopes, but s lopes are necessarily fixed for p redictors in the highest level of analysis. Thus, moun tain is consid ered a fixed level-2 IV in the small-sample example of Section 15.4.3. If it turns ou t that the intercept and slopes for all p redictors a t the fi rs t level can be treated as fixed, a single-level mod el may be a good choice.
15.6.5 Statistical Inference Three issues arise rega rding s tatistical inference in MLM. Is the model any good a t all? Does making a model more complex make it any bette r? What is the contribu tion of ind ividual p red ictors? 15.6.5.1 ASSESSING MO DELS Does a mod el p red ict the DV beyond wha t wou ld be expected by chance, that is, d oes it do any better than an in tercep ts-only mod el? Is one model better than another? The same p rocedures are used to ask if the model is helpful at all and to ask whether adding p red ictors imp roves it. The familia r ~ likelihood-ratio test (e.g., Equa tion 10.7) of a difference between mod els is used as long as mod els a re nested (all of the effects of the simpler mod el are in the more complex mod el) and full ML (not REML) methods are used to assess both models. There are severa l ways of expressing the test, depending on the information available in the program used. Table 15.18 shows ~ equations using terminology from each of the software packages. Degrees of freed om for the~ equations of Table 15.18 are the difference in the n umber of parameters for the mod els being compared . Recall tha t IBM SPSS p rovides the to ta l number of parameters d irectly in the Model Dimension section of ou tpu t (cf. Table 15.5). SAS p resents the information in the Dimensio ns section in the form of Covariance Pa rameters plus Columns in X (cf. Table 15.4). The test to answer the question, "Does the model p red ict bette r than chance?," pits the in tercept-only mod el of Table 15.5 (with a - 2 Log Likelihood value of 693.468 and three parame ters) against the full mod el of Table 15.11 (with a - 2 Log Like lihood value of 570.318 and seven parameters). From Table 15.18: ~ = 693.468 - 570.318 = 123.15
Table 15.18 Equations for Comparing Models Using Various Software Packages Program
Equation
SASMIXED
MlwiN
X' =(-2 Log Li kelihood)5 - (-2 Log Likelihood)e X' = (- 2 Log Likelihoodls - (- 2 Log Likelihood). X' = (- 2•/og/ikelihood), - ( - 2•Jog/ike/ihood)c
HLM
r = (Oe viance)s-( De viance) c
IBM SPSS M IXED
S'ISTAT MIXED REGRESSION Note: Sl..bscq:IC s = sifl"Pier model. c
X'=
= more com~ rrodel.
2 ( Log Likel ihood) . - 2(Log Likeli hood),
652 Chapter 15 This value is clearly statistically significant with (7 - 3 =) 4 df, so the full mod el leads to p red iction that is significantly better than chance. SAS MIXED provides the Null Model Likelihood Ratio Test routinely for all models. To test for differences among n ested models, em-squ are tests are used to evalu ate the consequences of ma king effects rand om, o r to test a dummy-coded categorical variab le as a single effect, or for mod el bu ilding in general (Section 15.6.8). Also, Wald tests of individu al p redictors can be verified, especially wh en samples are small, by testing the difference in mod els with and w ithou t them. Note, however, that if the test is for a random effect (var iance compo· nen t) other than covariance, the obtained p va lue associated w ith the chi-square difference test shou ld be d iv ided by two in order to create a one-tailed test of the null hypothesis that variance is no greater than expected b y chance (Berkhof and Snijders, 2001). Non -nested models can be compa red using the AIC that is produced by SAS and IBM SPSS (Hox, 2002). AIC can be calcu lated from the deviance (which is - 2 times the log-likelihood) as: AIC = d + 2p
(15.18)
where d is deviance and pis the number of estimated p arameters. Although no statis tical test is available for differences in AlC between models, the mod el with a lower va lue of AlC is p referred . 15.6.5.2 TESTS OF INDIVIDUAL EFFECTS The programs rev iewed p rovid e s tandard er rors for pa rameter estimates, whether random (variances and cova riances 11 ) or fixed. Some also p rovid e z valu es (parameter d ivided by stand ard er ror- Wa ld tests) for those param eters, and some add p (probab ility) valu es as well. These test the contribu tion to the equation of the p red ictors represen ted by the p arameters. However, there are some difficu lties with these tests. First, the s tand ard errors are valid only for large sam p les, w ith no guidelines availab le as to how large is large enough . Therefore, it is worthwhile to ver ify the significance of a borderline p redictor using the model-com parison p rocedu re of Section 15.6.5.1. Secon d, Rau denbush and Bryk (2001) argue tha t, for fixed effects, the ratio should be interpreted as t (with df based on num ber of groups) rather than z. They also argu e that the Wald test is not app ropriate for vari· ances p rodu ced by random e ffects (e.g., variab ility am ong groups) because the sampling distribu tions of variances a re skewed. Therefore, their HLM p rogram prov ides chi-squ are tests of random effects. That is, the test as to whether intercep ts differ among groups is a chi-square test. When tests with and withou t individual preclictors (Section 15.6.5.1) are used for rand om effects, recall that, except for covariances, one-tailed tests are appropriate. One wants to know if the pred ictor differs among higher level units (e.g., among Jevel-2 groups) more than expected by chance? Therefore, the p value for the chi-square difference test should be clivid ed by two (Hox, 2002). An exam ple of this test app lied to a fixed predicto r is availab le by comparing results of Sections 15.4.2 and 15.4.3, in which the level-2 IV, mountain, is added to a model that a lread y has the Ievel-l p red ictor, skill:
>! =
587.865 - 570.318 = 17.55
With 1 df p rodu ced by adding the single Jevel-2 p red ictor to the less complex model, this is clearly a statis tically s ignificant result. The mod el is improved by the ad dition of mountain as a predictor. An app lication of the test to the add ition of a random p redictor is available by comparing the models of Sections 15.4.1 and 15.4.2, in which SKILL is ad ded to the in tercepts-only model. Using th e most common form of the test:
>! =
693.468 - 587.865 = 105.6
This is clea rly a s ignificant effect w ith 6 - 3 = 3 df. Remember to divide the p robability level by t wo because the p redictor is specified to be random. 11 Recall that variances represent d iffe rences amo ng groups in slopes or intercepts. Covariances are relationships between slopes and intercepts or between slo pes for two predictors if there is more than one predictor considered random.
Multilevel Unear Modeling
15.6.6 Effect Size Recall that effect size (strength of associa tion) is the ratio of systema tic variance associated with a pred ictor or set of p red ictors to total variance. How much of the tota l variance is attributable to the model? Kreft and DeLeeuw (1998) point ou t several ambiguities in the ill-d efined methods cur· rently available for calculating effect size in MLM. Counterintuitively, the e rror variances on which these measures are based can increase w h en p redictors a re added to a model, so that there can be "negative effect sizes." Further, between -groups and within-grou ps estimates are confound ed unless p red ictors a re cen tered. Kreft and DeLeeuw (1998) prov ide some guidelines for the calculation of rt 2 with the caution that these measures should only be applied to models with random intercepts and should not be applied to p redictors with random s lopes. 12 Fu rther, separate ca lculations are don e for the w ithin-groups (Ievel-l) and between-groups (level-2) portions of the MLM, because residua l variances are d efined differently for the two levels. In general, for fixed pred ictors, an estimate of effect size is found by subtracting the residua l variance with the p red ictor (the larger model) from the resid ua l variance of the intercepts· only model (the smaller model), and d ividing by the residual variance without the p redictor 13: (15.19)
si is the residua l variance of the intercepts-only model and s~ is the residual va riance of the larger model (note that the larger model genera lly has the s ma ller residua l variance). There is as yet no convenient method for find ing confidence intervals around these effect sizes. Refer to Kreft and DeLeeuw (1998, p p. 115-119) for a full discussion of the issues involved and definitions of these variances a t the w ithin-group and between-group levels of analysis for those relatively few instances when the measures can be applied and in terpreted. where
15.6.7 Estimation Techniques and Convergence Problems As in all iterative p rocedures, a va riety of estimation algorithms are available, with d ifferent p rograms offering a somewhat d ifferent choice among them. Table 15.19 shows the methods relevan t to MLM in several p rograms. The most common methods are maxim um like lihood and restricted maximum likelihood. Maximum likelihood (ML) is a good choice when nested models are to be compared (e.g., w h en an effect has several pa rameter estimates to eva lua te as when a ca tegorical variable is represented as a series of dichotomous d ummy variables, or w hen a comparison with an in tercepts-only mod el is desired). In the case of ca tegorica l variables, models a re compared with and withou t the set of durnmy- BOldo-gan'S Ctditlon
2)~6.8S6
(CAlC)
setwtarn Ba-tts••
232lf~
Clcttton (SIC)
The normMion c.Wt«i art di$J!la.ftd in smalltt·IS..btltr form. ~ 0eptnde n1 V~nibte con110enee 1n:erva1 Lower Bound Upper BounQ
1.120963
.060000
18.683
.000
1.009324
1.244950
lnlercept (subject= subno)
Variance
.286073
.104448
2.739
.006
.1 39861
.585137
Intercept (subject= houseI
Variance
.411477
.178968
2.299
.021
.175438
.965092
a. Dependent Variable: annoy.
The null model has four parameters, one each for the fixed intercept (grand mean), variability in pa rticipan t in tercepts, variability in household in tercepts, and residual variance. Applying Equation 15.15 for the second level: = .16 0.28607 0.28607 + 0.41148 + 1.12096
With abou t 16% of the va riability in annoyance associated w ith ind ividual differences (differen ces among participants), an MLM is advisable. Applying Equation 15.16 for the third level: 0.41148 = .23 0.28607 + 0.41148 + 1.12096 With abou t 23% of the variance in annoyance associated with the third level of the hierarchy (differences among households), a three-level MLM is advisable.
15.7.2 Multilevel Modeling A three-level model is hypo thesized w ith predictors at all three levels (noise, time to fall asleep, age, and s ite). Recall tha t no interactions are hypothesized, either within or between levels. Only nigh tleq is p red icted to have a random s lope; indiv idual d ifferences are expected in the relationship between noise and annoyance. A mod el in which noise was treated as random failed to converge, even when n umber of itera tions was increased to 500 and p robability of convergence was relaxed to .001. Therefore, the decision was made to treat all p red ictors as fixed effects. Table 15.26 shows syntax and ou tp u t for the full three-level mod el.
661
662 Chapter 15
Table 15.26
Three-level Model of Annoyance as Predicted by Noise l evel, Time to Fall Asleep, Age, and Site (IBM SPSS MIXED Syntax and Selected Output)
MIXED annoy BY site l sice2 WITH age nightleq lacency /CRITERIA=CIN(95) MXITER(l00) MXSTEP( l 0) SCORING(!) SINGULAR(0 . 000000000001) HCONVERGE(O , ABSOLUTE) LCONVERGE(O , ABSOLUTE) PCONVERGE(0 . 000001, ABSOLUTE) /FIXED=age nighcleq latency sitel sice2 I SSTYPE(3) /METHOD=ML /PRINT=SOLUTION TESTCOV /RANDOM=INTERCEPT I SUBJECT(subno) COVTYPE(UN) /RANDOM=INTERCEPT I SUBJECT(house) COVTYPE(UN
Information Criteriaa -2 Log likelihood
2262.857
Akalke's lnformabon Cntenon (AIC)
2280.857
HuJYich and Tsai's Cntenon (AICC)
2281 .101
Bozdogan·s Cnterion (CAlC)
2331 .401
Schwarz's Bayesian Criterion (BIC)
2322.401
The information criteria are displayed in smaller-is-better form. a. Dependent Variable: annoy.
Fixed Effects Type Ill Tests of Fixed Effects3 Source
Numerator
1
+ 2 (0) 20
= 0.22
Au tocorrela tions an d standard errors for o ther lags are calcu lated by the same p rocedure. Using Equation 17.8, the standard error of all the partial correlations is 1
SEP, = • ~ = 0.22
v20 Calculation of the partial autocorrelations after the first few is labor in tensive. However, McCleary and Hay (1980) p rovid e equ ations sh owing the following relationships between ACF and PACF for the first three lags. PACF(1) = ACF(1)
(17.9)
2 p ACF ( 2 ) = _A_CF--'(,_2)'-:------,!, ( A:-cC--:F-7(::-'1)'") 2 1 - ( ACF(1))
(17.10)
- 2(ACF (1))ACF(2) - [ACF (1)]2(ACF{3)) p ACF (3 ) = -1-+-2-.,-(A--=C-F-{-1).:_)72A-C-F-(2.:_).:....----,:c(A_C_F_(:._2)'-':]2,..:.--2(,.--A.:_C.:.F-(1 .:. -)....,.]2
(17.11)
If an au tocorrelation at some lag is significan tly d ifferent from zero, the correlation is inclu ded in the ARIMA model. Similarly, if a partial au tocorrelation at some lag is s ignificantly different from zero, it is included in the ARIMA model as well. The s ignificance of full and par tial au tocorrelations is assessed using their stand ard errors. Alt hou gh you can look a t the au tocorrelations an d partial autocorrelation s numerically, it is stan dard p ractice to plot them. The cen ter vertical (or horizon ta l) lin e for these p lo ts rep resen ts fu ll or partial autocorrelation s of zero; th en sym bols such as • or _ o r bars are used to represen t the s ize an d direction of the au tocorrelation and par tia l au tocorrelation a t each lag. You compare these ob tained plots w ith stand ard, and somewhat idealized, patterns tha t are sh own b y various ARIMA mod els, as discu ssed more com p letely in Section 17.6.1. The ACF and PACF for the first 10 lags of the d ifferenced scores of Tab le 17.3 are seen in Figu re 17.4, as p roduced by IBM SPSS ACE The boundary lines around the functions are the 95% confid en ce bounds. The pattern here is a large, negative au tocorrelation at lag 1 and a d ecaying PACF, suggestive of an ARIMA (0, 0, 1) model, as illustra ted in Section 17.6.1. Reca ll, h owever, that the series has been differenced, so the ARIMA mod el is actually (0, 1, 1). The ser ies apparently h as both linear trend and memory for the p receding ran dom shock. That is, the qua lity of the compu ters is generally increasing, however, the qu ality in one week is influenced by random events in the manu factu ring p rocess from both the cu rrent and p reced ing week. The q value of 1 indicates that, with a differenced series, only the first of the two correlations in Equation 17.4 needs to be estimated , the cor relation, 0
Time-Series Analysis ACF VARIABLES=QUALI TY /LN /DI FF=l /MXAUTO 16 /SERROR=IND /PACF .
QUALITY Aut ocorrelations Serles: OUAUTY Box·I.,Juno Statostic
Autocorrelatoo n
Sta Error•
· .610
.212
8.238
.114
.206
8.545
2
.014
.069
.200
8.664
3
.034
· .203
.194
9.765
5
.247
.187
11.510
5
.042
6
· .168
.181
12.373
6
.054
7
.158
.173
13.203
8
· .241
.166
15.308
8
.053
9
.151
.158
16.215
9
.063
10
.035
.150
16.268
10
.092
11
.003
.142
16.269
11
.131
12
· .145
.132
17.470
12
.133
13
.091
.123
18.024
13
.157
14
· .001
.112
18.024
14
.206
15
· .025
.100
18.086
15
.258
16
.101
.087
19.456
16
.246
Lao
Value
Soo •
1t1 lag
-0 72743
0 19450
·3 74
0 0015
Variance Estimate
21 78926
Std Error Estimate
4 667896
AIC
113.4393
SBC
114 3837
Number of Residuals
19
• AIC and SBC do not include log determinant. Autocorrelation Check of Residuals To lag Chi-Square OF Pr > Chi Sq 6
3 04
5
Autocorrelations
0 6937 -0 261 -0 176 0.088 -0 030 0 143
0 015
12
10 20 11
0.5123 -0.008 -0 245 0.294
0 148 -0.091 -0.050
18
11.91
0 8053
0 044 -0.067
17
0 048 -0 005 0 035
0 026
SAS ARIMA also provides an Autocorre lation Check o f Res iduals, which assesses the departure of model residuals from random error. A significant result suggests discrepancies between the mode l and the data.
Time-Series Analysis
17.5 Types of Time-Series Analyses There are two major varieties of time-series an alysis: time domain (inclu din g Box- Jenkins ARIMA analysis) and spectra l domain (inclu ding Fourier spectral analysis). Time domain ana lyses deal d irectly with the DV over time; spectral domain ana lyses decompose a time series in to its sine wave componen ts. Either time o r spectral domain analyses can be used for identification, estimation, an d diagnosis of a time series. However, cu rrent s tatistical software offers no assistance for inter vention analysis using spectral methods. As a resu lt, this chapter is lim ited to time domain ana lyses. Numerous complexities are ava ilable w ith these ana lyses, h owever-seasonal au tocorrela tion and one or more in ter ventions (and with d ifferen t effects), to name a few.
17.5.1 Models with Seasonal Components Seasonal autocorrelation is d istinguished from "local" au tocor relation in that it is p red ictably spaced in time. Observations gathered monthly, for example, are often expected to have a spike a t lag 12 because many behaviors vary consistently from month to month over the year. Similarly, observations made daily could easily h ave a spike at lag 7, and observations gathered hourly often have a spike a t lag 24.3 These seasona l cycles can often be postula ted a priori, while local cycles are inferred from the data. Like local cycles, seasonal cycles sh ow up in p lo ts of ACFs and PACFs as spikes. However, th ey show up a t the appropriate lag for the cycle. Like loca l cycles, these a u tocorrelations can a lso be auto-regressive or moving average (or both). And, like local a utocor rela tion, a season a l auto-regressive compon ent tends to p rodu ce a decayin g ACF function and sp ikes on the PACF wh ile a moving average compon ent tends to produ ce the reverse pattern. When there are both local and seasona l tren ds, d, multiple differencing is used. For weekly measurement, as in the compu ter quality example, there could be a loca l linear trend at lag I and also a seasonal linear trend at lag 4 (monthly). Seasonal models may be either additive or multiplicative. The additive season al model just described wou ld be (0, 2, 0), with differencing a t lags I and 4. The notation for a multiplicative seasonal model is ARIMA (p, d, q )(P, D, Q )5, wh ere s is the seasona l cycle. Thus, the nota tion for a seasonal model with a loca l trend at lag I and a season al trend component at lag 4 is (0, I, 0)(0, I, Ok In this model, the interaction between time and seasons also is of inte rest. For example, the seasonal tren d may be s tronger at lower (or higher) levels of the series. The add itive model is more parsimonious than the multiplicative model, an d it is often found that the mu ltiplicative compon en t is very sma ll (McCain and McCleary, 1979). Therefore, an additi ve model is recommended unless the multiplicative model is required to produce acceptable ACF an d PACF, as well as significant parameter estimates. As an example of a seasona l model, reconsider the computer qu ality data, but this time with a lin ear trend added ever y fou r weeks. Perhaps monthly pep ta lks are given by the manager, or mora le is improved by mon thly p icnics, or greater effor t is made after receiving a monthly paycheck, or whatever. Da ta appear in Tab le 17.10 an d are p lotted u sing IBM SPSS TSPLOT. In the p lot of qu ality, the astute observer might notice peaks a t the 1st, 5th, 9th, 13th, and 17th weeks, indica tive of a lag 3 au tocor rela tion after differencing. However, the pattern is much more evident from the ACF and PACF p lots of differenced scores as seen in Figure 17.6, crea ted by IBM SPSS ACF. The large partial autocorrela tion at lag 3 indicates the presen ce of a monthly tren d which has not yet been removed by differencing. IBM SPSS syntax for this d ifferencing is seen in
3
Notice that the expected lag is one less than the cycle when the first observation is "used up" in differencing.
737
738 Chapter 17
Table 17.10 Week
Data w it h B oth Trend a nd Se asonality 50
Quality 26
2
21
3
17
40
4
19
5
28
/::
6
21
:::; 30
7
27
8
28
9
29
10
24
11
31
12
20
13
39
14
21
15
28
16
28
17
40
18
31
19
23
20
34
:3 0
20
10 1 2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
WEEK
Figure 17.7, together with the ACF and PACF for the results. The large ACF at lag 1 remains in the data to be model ed with a moving average component, an au to-regressive component, or both. The large PACF a t lag 3 is now removed. The remaining au tocorrelation can be modeled using ARIMA (1, 0, 0). The final model is ARIMA (1, 1, 0)(0, 1, 0)4 . Section 17.7.4 shows a seasona l mod el w ith intervention through SAS ARIMA.
17.5.2 Models with Interventions Intervention, or interrup ted, time-series analyses compare observa tions before and after some identifiable event. In quasi-experiments, the intervention is an attempt at experimenta l manipulation; however, the techniques are applicable to analysis of any event that occu rs during the time series. The goal is to evaluate the impact of the intervention. Interventions differ in both the onset (abrup t or gradual) and d uration (permanen t or temporary) of the effects they p roduce. An impulse intervention occurs once, briefly, and often p rod uces an effect w ith ab rupt onset and temporary d uration. An earthquake or a riot is likely to have such an effect. Sometimes discrete interven tions occur severa l times (e.g., assassinations). Quasi-experimen ts, on the other hand, a re more likely to have an interven tion tha t continues over a considerable time; the effects produced are often also permanent o r of long duration, but the onset can be either gradual or abrup t. Fo r example, the p rofit-sharing plan applied to the sma ll-sample compu ter quality exam p le is like ly to have a gradual-onset, long-duration effect. However, some impu lse interven tions also have an abrupt-onset, long-term or permanent effect; an example is bypass surgery. The connection between an interven tion and its effects is called a transfer function. In timeseries jargon, an imp ulse in tervention is also called an impu lse or a p ulse indicator and the effect is called an impulse or a pulse function. An effect w ith ab rupt onset and permanen t or long d u ration is called a step function. Because there are two levels of duration (permanent and temporary) and two levels of onset (abrup t and gradual), there are four possible combinations of
Time-Series Analysis ACF VARIABLE S=QUALITY /NO LOG
/DIFF=l /MXAUTO 16 /SERPOR=MA / PACF .
QUALITY
0
Coeffideril
-
Lower Confidence Linl
0
Coefficient
-~Conf-Limol
1.0
0.5
-1.0 2
3
4
5
6
1
8
9
10 11 12 13 14 15 16
Lag Number
QUALITY -~Confidence Limit
1.0
-
Lowe< Confidence Limit
0.5
LL
0
1t1 lag Variable Shift
Estimate
0 07927
7 41
< 0001
LOG_INJ
0
0 76890
0 06509
11 81
ChiSq
Autocorrelations
6
3 50
4
0 4774
0 053 ·0078
12
6.87
10
0 7377
0.067
0.015 .() 005 ·0 137 0012
0004 .() 008
0 109
0 092 0.025
18
1066 16
08302 -0 031 ·0 086 ·0 125 .() 040
0 030
24
14 88
0 8671
0 132 .0032
22
0 074 -0015
0 025 .() 061
0030
(Continued)
759
760 Chapter 17
Table 17.24 (Continued) Model for variable LOG_INJ Period(s) of Differencing 1.12
No mean term in this model. Moving Average Factors Factor 1: 1 - 0.58727 6"(1) Factor 2: 1 · 0 7689 6"(12)
Input Number 1
BELT
Input Variable
1.12
Perlod(s) of Differencing
Overall Regression Factor -0.06198
Input Number 2 Input Variable
MONTH
12
Period(s) of Differencing
Overall Regression Factor 0 000014
Outlier Detection Summary Maximum number searched Number found Significance used
3 0
0 001
Remaining ou tput summarizes model information and shows that there a re no level shifts tha t are not accounted for by the model a t a = .001. 17.7.1.4.2 Mod el Interpretation The nega tive sign of the in tervention parameter (- 0.06198) indicates that incapacitating inju ries a re reduced after in troduction of the seat belt law. The anti-log of the in tervention parameter is 10- 0·06198 = 0.87. 1n percen tage terms, this ind icates tha t A-level injuries are reduced to 87% of their pre-in tervention mean, or by abou t 13% after implementing the seat belt law. Inte rpretation is aided by finding the median (un transformed) number of inj uries before and after intervention 4 Table 17.25 shows a SAS MEANS run for medians. (Note that da ta a re already sorted by BELT.) Notice that the number of inj uries predicted by the model after intervention (.87)(3801.5) = 3307.3 is greater than the actual number of injuries observed (2791.5). This is because the model also adjusts the da ta for trends and persistence of random shocks (moving average components), and thus more accurately reflects the impact of the intervention. Effect size for the ARIMA in tervention model is estimated using Equation 17.15, substituting s~ from the model without in tervention as an estima te of SSY( the ARIMA model w ithou t intervention parameters. The Variance Estimate for the ARIMA model withou t
4 Recall from Section 4.1.6 that the median of the raw scores is an appropriate measure of central tendency after data have been transformed.
T ime-Series Analysis
Table 17.25 Oescriptives for Untransformed Time-Series Data (SAS MEANS Syntax and Output) proc means daca=SASUSER . TIMESER N MEDIAN ; by BELT; var I NJURIES; run;
The MEANS Procedure
BELT• O
Analysis Variable : INJURIES
N
Median
66
3801.50 BELT• 1
Analysis Variable : INJURIES
N
Median
66
2791 50
inte rvention parameters (not shown) is 0.000752. Degrees of freedom are 119 - 4 - 1 = 114, so that SSy, = (0.000753){114) = 0.0858. The Variance Estimate for the full model is 0.000719, as seen in Table 17.24. Degrees of freedom are 119 - 6 - 1 = 112, so that SS; = {0.000719){112) = 0.0805.
Thus, effect size using Equation 17.15 is
R2 = 1 - 0.0805 = 06 0.0858
.
That is, the inte rvention accounts for 6% of the variance in the ARIMA (0, 1, 1)(0, 1, 1) 12 model. Table 17.26 is a checklist for an ARIMA intervention analysis. An example of a Results section, in journa l format, follows.
Table 17.26 Checklist for ARIMA Intervention Analysis 1. Issues a. Nonnality of sampling d istributions
b . Homogeneity of variance c. Outliers
2. Major analyses a. IdentifiCation, estimation, and diagnosis of baseline model b . Diagnosis of intervention model c. Interpretation of intervention model parameters, if significant d . Effect size 3. Additional analyses a. Tests of alternative Intervention models, if nonsignifiCant b . Forecast of future observations c. Interpretation of ARIMA model parameters, if signifiCant
761
762 Chapter 17
Results A time - series model for incapacitating injuries was developed using SAS ARIMA Version 9 . 4 to examine the effect of a seat belt law, introduced in Illinois in January 1985 . Data were collected for 66 months before and 66 months after implementing the law . As seen in Figure 17 . 10 , there are no obvious outliers among the observations, but variance appears to decrease over t he time series , particu larly after intervention . Therefore , a logarithmic transform was applied to the series , producing Figure 17 . 10 . The 66- month pre- intervention series was used to identify a seasonal ARIMA (0 , 1 , 1) (0 , 1, 1) 12 model , with differencing required at lags 1 and 12 to ach ieve stationarity . The local moving average parameter was 0 . 58727 and the seasonal parameter 0 . 76890 , both statistically significant, with t = 7 . 41 and t = 11 . 81 , p < . 05 , respectively . The intervention parameter (-0 . 06198} , was strong and statistically significant, t= - 2 . 92 , p< . 0 5 , R2 = . 06 . The antilog of the parameter {lo- 0 · 062 ) was 0 . 87 , suggesting that the impact of t he seat belt law was a 13% reduction in mean number of incapacitating inj u ries per month . An estimated effect size of R2 suggests that about 6% of the variance in the ARIMA model is associated with implementation of the seat belt law . With 3, 802 inj u ries per month on average before the law, t he reduction was approximately 500 per month . Thus, the seat belt law was shown to be effective in reducing incapacitating injuries .
17.7.2. Time-Series Analysis of Introduction of a Dashboard to an Educational Computer Game This example uses the IBM SPSS Expert Modeler to identify a time-series model to assess the effectiveness of introduction of a dashboard into a complex online educational game (Selene) designed to teach children lunar science while "creating the moon" (Reese, 2015). The DV was persis tence, defined as cycles of p lay, aggregated over all pa rticipating children and days for each of 194 weeks. There were 181 weeks of p lay for the final version of the game before the dashboard was introduced into the game by showing up on the screen when requested by the player, and 13 weeks after the dashboard became part of the game.5 All d ata were collected automatically, with recordings made every 10 seconds of game play for each player. A more complete descrip tion of this facet of the resea rch project is in Appendix A.8. Data are in DASHTIME.• Figure 17.12 shows the dashboard for one child at an ad vanced point in the game.
5
Ideally, there would be many more data collection points after intervention. However, funding for the project and its sponsoring facility ended, so there was no support for continuing the technically· intensive data analysis effort.
Time-Series Analysis
A covariate was added to reflect proportion of plays during the week by Spanish-language players to adjust for one hugely successful classroom of Spanish players in the Canary Islands who began p lay not very long before incorporation of the dashboard. The major question for this analysis was whether inclusion of the dashboard increased persistence in the game (after adjusting for the language covariate). That is, did child ren persist in playing the game for more cycles if they had access to a dashboard illustrating their progress in the game? 17.7.2.1 EVALUATION O F ASSUMPTIONS
17.7.2.1.1 Normality of Sampling Distributions and Homogeneity of Variance Normality of sampling distributions and homogeneity of variance are typically evalua ted by examining residuals for the ARIMA model for the baseline period as part of the diagnostic process before the intervention series is included. However, inspection of Figure 17.13 suggests extreme
Figure 17.12 Example of the dashboard for one player at an advanced point in the Selene online game. CyGaMEs dashboard courtesy Wheeling Jesuit University, used with permission.
...
----=-
·-
-·- f::..-
i ...
. ...
I
..... •
•••••••••••••••••rn•• Weell
••••••••••••••••rn•• Week
Figure 17.13 Persistence (a) and Proportion of Spanish-language play (b) over 191 weeks, before and after introd uction of dashboard. (Plots generated in IBM SPSS Forecasting > Sequence Charts and modified in Adobe Photos hop CS5.)
763
764 Chapter 17
'
"~~-~~h~~mm~~ m ~•m~~
We•k
Figure 1 7 .14 Plot of transformed Pe rsiste nce over 191 weeks, before a nd after introduction of das hboard. (Plot gene ra ted in IBM SPSS Forecasting > Seque nce Charts and mod ified in Adobe Photoshop CS5.) d epartures of norma lity and homogeneity of variance for both persistence and p roportion of Spanis h language plays. The data show no apparen t seasona lity. After trying ou t various transformations, a loga rithmic transform is app lied to the DV. Figure 17.14 plots the DV after transforma tion. Because persis tence includes zero, one was ad ded to the raw score before application of the transform. The transformation has evened ou t the spikes in pe rsistence. Lack of slope in the pre-intervention da ta indica tes that there is no need for differencing. The only sensible transformation for the covariate was to reced e p roportion of Spanish-spea king plays into a d ichotomous variable with 0 = no Spanish and 1 = some Spanish. This results in abou t 25% of data points with Spanish-speaking plays. 17.7.2.1 .2 Outlie rs Table 17.27 shows a FREQUENCIES run to examine the d is tribu tion after transformation for persistence. The z for the minimum score on log(Persistence) is marginally acceptable ((0.559 - 1.20484) / .199422 = - 3.24, however the z for the maximum score is not
Table 17.27 Descriptive Statistics for Transformed D. IBM SPSS Frequencies Syntax and Output. FREQUENCI ES VARIABLES=LPersistence /STATI STI CS=STDDEV MI NIMUM MAXIMUM MEAN SKEWNESS SESKEW KURTOSI S SEKURT / HISTOGRAM NORMAL /ORDER=ANALYSI S .
Statistics LPersistence N
Valid Missing
194 0
Mean
1.20484
Std. Deviation
.199422
Skewness
·.048
Std Error of Skewness
.175
KurtOSIS
. l:' c ~
.." ~
1.104
Std Error Of KurtOSIS
.347
Mlmmum
.559
Ma10mum
1.957 .-
eoo
• ooo
t_;oo
••
LPerslsunce
'eoo
• eoo
2.000
Time-Series Analysis 765
Table 17.28 Data Set by Ascending Order by Transformed Persistence (LPersistence) Q
'OASHTIME.uv 1Dat•Sot41 ·IBM SPSS Stotistics Data Editor
file
E41t
\111W
Qata
Iransform
illr"
t,nalyl$
"'!j
Direct !!!art