240 32 7MB
English Pages [482] Year 2019
MATHEMATICS RESEARCH DEVELOPMENTS
STATISTICS VOLUME 2 MULTIPLE VARIABLE ANALYSIS
No part of this digital document may be reproduced, stored in a retrieval system or transmitted in any form or by any means. The publisher has taken reasonable care in the preparation of this digital document, but makes no expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of information contained herein. This digital document is sold with the clear understanding that the publisher is not engaged in rendering legal, medical or any other professional services.
MATHEMATICS RESEARCH DEVELOPMENTS Additional books and e-books in this series can be found on Nova’s website under the Series tab.
MATHEMATICS RESEARCH DEVELOPMENTS
STATISTICS VOLUME 2 MULTIPLE VARIABLE ANALYSIS
KUNIHIRO SUZUKI
Copyright © 2019 by Nova Science Publishers, Inc. All rights reserved. No part of this book may be reproduced, stored in a retrieval system or transmitted in any form or by any means: electronic, electrostatic, magnetic, tape, mechanical photocopying, recording or otherwise without the written permission of the Publisher. We have partnered with Copyright Clearance Center to make it easy for you to obtain permissions to reuse content from this publication. Simply navigate to this publication’s page on Nova’s website and locate the “Get Permission” button below the title description. This button is linked directly to the title’s permission page on copyright.com. Alternatively, you can visit copyright.com and search by title, ISBN, or ISSN. For further questions about using the service on copyright.com, please contact: Copyright Clearance Center Phone: +1-(978) 750-8400 Fax: +1-(978) 750-4470 E-mail: [email protected]. NOTICE TO THE READER The Publisher has taken reasonable care in the preparation of this book, but makes no expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of information contained in this book. The Publisher shall not be liable for any special, consequential, or exemplary damages resulting, in whole or in part, from the readers’ use of, or reliance upon, this material. Any parts of this book based on government reports are so indicated and copyright is claimed for those parts to the extent applicable to compilations of such works. Independent verification should be sought for any data, advice or recommendations contained in this book. In addition, no responsibility is assumed by the publisher for any injury and/or damage to persons or property arising from any methods, products, instructions, ideas or otherwise contained in this publication. This publication is designed to provide accurate and authoritative information with regard to the subject matter covered herein. It is sold with the clear understanding that the Publisher is not engaged in rendering legal or any other professional services. If legal or any other expert assistance is required, the services of a competent person should be sought. FROM A DECLARATION OF PARTICIPANTS JOINTLY ADOPTED BY A COMMITTEE OF THE AMERICAN BAR ASSOCIATION AND A COMMITTEE OF PUBLISHERS. Additional color graphics may be available in the e-book version of this book.
Library of Congress Cataloging-in-Publication Data ISBN: H%RRN
Published by Nova Science Publishers, Inc. † New York
CONTENTS Preface
vii
Chapter 1
A Correlation Factor
1
Chapter 2
A Probability Distribution for a Correlation Factor
17
Chapter 3
Regression Analysis
69
Chapter 4
Multiple Regression Analysis
117
Chapter 5
Structural Equation Modeling
169
Chapter 6
Fundamentals of a Bayes’ Theorem
179
Chapter 7
A Bayes’ Theorem for Predicting Population Parameters
223
Chapter 8
Analysis of Variances
277
Chapter 9
Analysis of Variances Using an Orthogonal Table
305
Chapter 10
Principal Component Analysis
339
Chapter 11
Factor Analysis
361
Chapter 12
Cluster Analysis
431
Chapter 13
Discriminant Analysis
453
References
461
About the Author
463
Index
465
Nova Related Publications
469
PREFACE We utilize statistics when we evaluate TV program rating, predict a result of voting, prepare stock, predict the amount of sales, and evaluate the effectiveness of medical treatments. We want to predict the results not on the base of personal experience or images, but on the base of the corresponding data. The accuracy of the prediction depends on the data and related theories. It is easy to show input and output data associated with a model without understanding it. However, the models themselves are not perfect, because they contain assumptions and approximations in general. Therefore, the application of the model to the data should be careful. We should know what model we should apply to the data, what are assumed in the model, and what we can state based on the results of the models. Let us consider a coin toss, for example. When we perform a coin toss, we obtain a head or a tail. If we try the coin toss three times, we may obtain the results of two heads and one tail. Therefore, the probability that we obtain for heads is 2/3, and the one that we obtain for tails is 1/3. This is a fact and we need not to discuss this any further. It is important to notice that the probability (2/3) of getting a head is limited to this trial. Therefore, we can never say that the probability that we obtain for heads with this coin is 2/3, in which we state general characteristics of the coin. If we perform the coin toss trial 400 times and obtain heads 300 times, we may be able to state that the probability of obtaining a head is 2/3 as the characteristics of the coin. What we can state based on the obtained data depends on the sample number. Statistics gives us a clear guideline under which we can state something is based on the data with corresponding error ranges. Mathematics used in statistics is not so easy. It may be a tough work to acquire the related techniques. Fortunately, the software development makes it easy to obtain results. Therefore, many members who are not specialists in mathematics can do statistical analysis with these softwares. However, it is important to understand the meaning of the model, that is, why some certain variables are introduced and what they express, and what we can state based on the results. Therefore, understanding mathematics related to the models is invoked to appreciate the results.
Kunihiro Suzuki
viii
In this book, we treat models from fundamental ones to advanced ones without skipping their derivation processes as possible as I can. We can then clearly understand the assumptions and approximations used in the models, and hence understand the limitation of the models. We also cover almost all the subjects in statistics since they are all related to each other, and the mathematical treatments used in a model are frequently used in the other ones. We have many good practical and theoretical books on statistics [1]-[10]. However, these books are oriented to special direction: fundamental, mathematical, or special subjects. I want to add one more, which treats fundamental and advanced models from the beginnings to the advanced ones with a self contained style. I also aim to connect theories to practical subjects. This book consists of three volumes:
The first volume treats the fundamentals of statistics. The second volume treats multiple variable analysis. The third volume treats categorical and time dependent data analysis
This volume 2 treats multiple variable analysis. We frequently treat many variables in the real world. We need to study relationship between these many variables. First of all, we treat correlation and regression for two variables, and it is extended to multiple variables. We also treat various techniques to treat many variables in this volume. The readers can understand various treatments of multiple variables. We treat the following subjects. (Chapter 1 to 4) Up to here, we treat statistics of one variable. We then study correlation and regression for two variables. The probability function for the correlation function is also discussed, and it is extended to multiple regression analysis. (Chapter 5) We treat data which consist of only explanatory variables. We assume an objective variable based on the explanatory variables. The analysist can assume the image variables freely, and he can discuss the introduced variables quantitatively based on the data. This process is possible with the structural equation modeling. We briefly explain the outline of the model.
Preface
ix
(Chapter 6 and 7) We treat a Bayes’ theorem which is the extension of conditional probability, and is vital to analyze problems in real world. The Bayes’ theorem enables us to specify the probability of the cause of the event, and we can increase the accuracy of the probability by updating it using additional data. It also predict the two subject which are parts of vast network. Further, we can predict population prametervalues or the related distribution of a average, a variance, and a ratio using obtained data. We further treat vey complex data using a hierarchical Bayesian theory. (Chapter 8 and 9) We treat a variance analysis and also treat it using an orthogonal table which enables us to evaluate the relationship between one variable and many parameter variables. Performing the analysis, we can evaluate the effectiveness of treatment considering the data scattering. (Chapter 10 to 13) We then treat standard multi variable analysis of principal component analysis, factor analysis, clustering analysis, and discriminant analysis. In principal component analysis, we have no objective variable, and we set one kind of an objective variable that clarify the difference between each members. The other important role of the analysis is that it can be used to lap the variables which have high correlation factors. A high correlation factor induces unstable results in multi variable analysis, which can be overcome by the proposed method associated with the principal component analysis. In factor analysis, we also have no objective variable. We generate variables that explain the obtained results. Cluster analysis gives us to make groups that resemble each other. In discriminal analysis, we have two groups. We decide that a new sample belongs to which group. We do hope that the readers can understand the meaning of the models in statistics and techniques to reach the final results. I think it is not easy to do so. However, I also believe that this book helps one to accomplish it with time and efforts. I tried to derive any model from the beginning for all subjects although many of them are not complete. It would be very appreciated if you point out any comments and suggestions to my analysis.
Kunihiro Suzuki
Chapter 1
A CORRELATION FACTOR ABSTRACT Analysis of a relationship between two variables is an important subject in statistics. The relationship is expressed with a covariance, and it is extended to a non-scale parameter of correlation factor. The correlation factor has a value range between -1 and 1, and it expresses the strength of the relationship between two variables. The ranking data have a special data form, and the corresponding correlation factor has its special form and is called as a Spearman’s correlation factor. We also discuss a pseudo correlation factor where the other variable makes as if two variables have a relationship each other.
Keywords: covariance, correlation factor, ranking data, Spearman’s correlation factor, pseudo correlation
1. INTRODUCTION We guess that smoking influences the lifespans based on our feeling. We want to clarify the relationship quantitatively. We obtain a data set of smoking or no-smoking and their lifespans. Using the data, we should obtain a parameter that expresses the significance of a relationship between smoking and the lifespan, and want to decide whether the two items have a relationship or not. It is important to study the relationship between the two variables quantitatively. We focus on the two quantitative data and study to quantify the relationship in this chapter.
Kunihiro Suzuki
2
2. A COVARIANCE We consider the case where we obtain various kinds of data from one sample. Table 1 shows concrete data of scores of mathematics, physics, and language for 20 members. Table 1. Data for scores of mathematics, physics, and language Mathematics 52 63 34 43 56 67 35 53 46 79 75 65 51 72 49 69 55 61 41 34
Physics 61 52 22 21 60 70 29 50 39 78 70 59 56 79 48 73 57 68 42 36
Language 100 66 64 98 86 35 43 47 65 43 72 55 75 82 81 99 57 81 75 63
100
100
80
80
Score of language
Score of physics
Member ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
60 40 20 0
0
20 40 60 80 Score of mathematics (a)
100
60 40 20 0
0
20 40 60 80 Score of mathematics (b)
100
Figure 1. Scatter plots. (a) Sore of mathematics vs score of physics. (b) Score of mathematics vs score of language.
A Correlation Factor
3
We can see the relationship between scores of mathematics and physics or between scores of mathematics and language by obtaining scatter plots as shown in Figure 1 (a) and (b). We can qualitatively say that the member who gets a high score of mathematics is apt to get a high score of physics, and the member who gets a low score of mathematics is apt to get a low score of physics. We can also say that the score of mathematics and that of language have no clear relationship. We want to express the above discussions quantitatively. We consider the relationship between scores of mathematics and physics, or between scores of mathematics and language. The average of mathematics is given by N
x E X
x
i
i 1
N
(1)
where we express the score of mathematics for each member as xi , and N is the number of members, and N is 20 in this case. The average of physics or language is given by N
y E Y
y i 1
i
N
(2)
where we express the score of them for each member as yi . The variance of mathematics is given by N
xx 2 E X x 2
x
2
x
i 1
N
(3)
The variance of physics or language is given by
y N
2 yy 2 E Y y
2
y
i 1
N
(4)
Kunihiro Suzuki
4
2 2 2 Note that we express the variances as xx and yy instead of x and y to extend them to a covariance which is defined soon after. In the two variables, we newly introduce a covariance, which is given by 2
xy 2 E X x Y y
x y N
x
i 1
y
N
(5)
In the above discussion, we implicitly assume that the set is a population. We assume that the sample size is n . A sample average of a variable x is denoted as x and is given by n
x
x
i
i 1
n
(6)
A sample average of
y
is given by
n
y
y
i
i 1
n
(7)
An unbiased variance of x is given by n
sxx 2
x x
2
i 1
n 1
(8)
A sample variance of x is given by n
S xx 2
x x
2
i 1
n
An unbiased variance of
(9)
y
is given by
A Correlation Factor n
s yy
2
y y
2
i 1
n 1
(10)
A sample variance of n
S yy 2
5
y y
y
is given by
2
i 1
n
(11)
We can also evaluate the variances given by n
sxy 2
x x y y i 1
n 1
(12)
n
S xy 2
x x y y i 1
n
(13)
We study characteristics of the covariance graphically as shown in Figure 2. The covariance is the product of deviations of two variables with respect to each average. The deviation is positive in the first and third quadrant planes, and negative in the second and fourth quadrant planes. Therefore, if the data are only on the first and third quadrants, the product is always positive, and hence sum becomes large. If the data is scattered on all quadrant planes, the product sign is not constant, and some terms vanish each other, and hence the sum becomes small. The covariance between scores of mathematics and physics is 132.45, and that between scores of mathematics and language is 40.88. Therefore, the magnitude of the covariance expresses the strength of the relationship qualitatively. Let us consider the 100 point examination, which corresponds to Table 1. We can convert these scores to 50 point one simply by divided by 2. The corresponding scatter plots are shown in Figure 3. The relationship must be the same as that of Figure 1(a). However, the corresponding covariance is 40.88, which is much smaller than that of 100 points. Since we only change the scale in both figures, we expect the same significance of the relationship. Therefore, we want to obtain a parameter that expresses the strength of relationship independent of the data scaling, which is not the covariance.
Score of physics - y
40
-
+
20 0 -20 -40
-
+
-40 -20 0 20 40 Score of mathematics - x (a)
Score of language - y
Kunihiro Suzuki
6
40
-
+
+
-
20 0 -20 -40
-40 -20 0 20 40 Score of mathematica - x (b)
Figure 2. Scattered plots of deviations with respect to each variable’s average. (a) Sore of mathematics vs score of physics. (b) Score of mathematics vs score of language.
Score of physics
50 40 30 20 10 0
0
10 20 30 40 Score of mathematica
50
Figure 3. Scatter plots under 50 points.
3. A CORRELATION FACTOR A correlation factor
xy
xy
is introduced as
xy 2 xx 2 yy 2
(14)
A Correlation Factor
7
We can also define a sample correlation factor given by sxy
rxy
sxx
2
2
s yy
2
S xy
S xx
2
2
S yy
2
(15)
0.94
We obtain the same correlation factor of xy expected independent of the scale. We evaluate a following expectation given by
for Figure 1(a) and Figure 3 as is
2 E X x Y y 2 2 E 2 X x 2 X x Y y Y y
2 xx 2 xy yy 2
2
2
2 xy 2 2 xx 2 2 yy xx 2 2 xy 2 xy 2 2 2 xx 2 2 yy xx xx 2
xx
where
2
xy 2 2 xy 2 yy 0 2 xx xx 2
2
(16)
2
is an arbitrary constant. Since the formula has a quadratic form, we insist on
2 xy
xx 2
2
yy 0 2
(17)
Therefore, we obtain
2 xy
2
xx 2 yy 2
2 xy 2 2 xx yy
2
2 1 xy
Finally, the range of the correlation factor is given by
(18)
Kunihiro Suzuki
8
1 xy 1
(19)
If there is no correlation, that is, two variables are independent of each other, xy is 0. If the correlation is significant, the absolute value of xy approaches to 1. The above theory is for a population set. The sample correlation factor rxy is simply given by rxy
s xy
2
s xx s yy 2
2
(20)
This is also expressed with sample variances given by rxy
sxy
2
sxx s yy 2
2
S xy
2
S xx S yy 2
2
(21)
4. A RANK CORRELATION There is a special data form where we rank the team. If we evaluate m teams, the corresponding data is from 1 to m . We want to obtain the relationship between each member’s evaluation. The theory is exactly the same for the correlation factor. However, the data form is special, and we can derive a special form of a correlation factor, which is called as a Spearman’s correlation factor. Let us derive the correlation factor. Table 2 shows the ranking of each member to the football teams. We want to know the resemblance of each member, which is evaluated by the correlation factor. The correlation factor is simply evaluated by the procedure shown in this chapter, which is shown in Table 3. It is clear that member 1 resembles with member 3. Table 2. Ranking of each member to teams Member ID m1 m2 m3
Team A
B 2 4 1
C 3 5 3
D 5 1 4
E 1 6 2
F 6 3 6
4 2 5
A Correlation Factor
9
Table 3. Correlation factor of each member m1
m2 m3 -0.771429 0.8857143 -0.6
m1 m2 m3
We denote the data of the member 1 as a1 , a2 ,
b1 , b2 ,
, an , and the data of the member 2 as
, bn . The correlation factor is evaluated as n
a
r
i
i 1
n
a i 1
i
a bi b
a
2
b b n
i 1
2
i
(22)
The averages for a and b are the same and is decided by the team number, and is given by
a b
n 1 2
(23)
A variance is also decided only by the team number, and is given by n
a i 1
i
n
a bi b 2
i 1
2
n 1 n 1 n 1 1 2 3 2 2 2 2
n n 1 i 2 i 1
2
2n 1 n n 1 6
2
n 1
2
n n 1
2 n n 1 4n 2 6n 6 3n 3 n n 1 n 1 12
n 1 n 2
2
n n n 1 i 2 n 1 i n 2 i 1 i 1
1
n
n 1
2
4
12
(24)
Kunihiro Suzuki
10 We then evaluate a covariance. Further, we obtain
ai bi
a a b b
ai a bi b
2
2
i
2
2
i
2 ai a bi b
(25)
1 2 ai bi 2
(26)
2 ai a 2 ai a bi b 2
Therefore, we obtain
ai a bi b ai a
2
Finally, we obtain the Spearman’s correlation factor form as n
r
a
a 2
i
i 1
n
a i 1
i
1 n 2 ai bi 2 i 1 a
n
6 ai bi
1
2
2
i 1
n n 1 n 1
(27)
This gives the same results in Table 3. The maximum value for correlation factor is 1, and hence we set n
1 1
6 ai bi
2
i 1
n n 1 n 1
(28)
This leads to
ai bi The minimum value of the correlation factor is -1, and hence we set
(29)
A Correlation Factor n
1 1
6 ai bi
11
2
i 1
n n 1 n 1
(30)
This leads to n
a b i 1
i
2
i
1 n n 1 n 1 6
(31)
We can easily guess this is the case when one data is 1,2, ,n , and corresponding the other data is n, n 1, ,1 , respectively. The corresponding term is n
i n i 1 i 1
n
n
n
i 1
i 1
i 1
n i i 2 i 1 1 1 n n n 1 n n 1 2n 1 n n 1 2 6 2 1 1 n n 1 n 2n 1 1 2 3 1 n n 1 n 2 6
(32)
This agrees with Eq. (31)。 Random can be expressed by r 0 , this expressed by n
0 1
6 ai bi
2
i 1
n n 1 n 1
(33)
This leads to n
a b i 1
i
i
2
1 n n 1 n 1 6
(34)
Kunihiro Suzuki
12
5. A PSEUDO CORRELATION AND A PARTIAL CORRELATION FACTOR r
Even when the correlation factor xy for two variables of x and y is large, we cannot definitely say that there is a causal relationship. Let us consider the case of the number of sea accident and amount of ice cream sales. The back ground of the relationship is the temperature. If the temperature increases, the number of people who go to the sea increases, and the number of the accident increases, and the amount of ice cream sales may increase. The relationship between the accident number and the ice cream sales is related to the same one parameter of the temperature. The second example is the relationship between annual income and running time for 50 m. The income increases with age, and the time increase with the age. We then obtain the relationship between the income and the time thorough the parameter of the age. The above feature can be evaluated as the followings. When there is a third variable
u , and x and y both have correlation relationships,
we then obtain a strong correlation factor between x and y even if there is no correlation between them. The correlation under this condition is called as pseudo correlation. We want to obtain the correlation eliminating the influence of the variable The regression relationship between x and
S xx 2
Xi x
2 Suu
u is given by
rxu ui u (35)
Similarly, the regression relationship between y and
Yi y
2 S yy 2 Suu
u.
u is given by
ryu ui u (36)
The residuals for x and y associated with the
u are given by
exi xi X i xi x
S xx 2 Suu 2
rxu ui u (37)
A Correlation Factor
13
eyi yi Yi yi y
S yy 2 Suu 2
ryu ui u (38)
u . The correlation factor between these variables is the one eliminating the influence of u and is called as a causal These are the variables eliminating the influence of
correlation factor. The causal correlation factor is denoted as
rxy ,u
rxy ,u
, and is given by
Sex2ey Sex2ex Sey2ey
(39)
where
Sexi2e xi
S xx 2 S xx 2 1 n x x r u u x x r u u i i 2 xu i 2 xu i n i 1 Suu Suu 2 S xx 2 1 n 1 S xx 2 n 1 n 2 2 x x r u u 2 r i i n S 2 xu x x ui u 2 xu n i 1 i n i 1 i 1 Suu uu
S xx 2
S xx 2 2 2 r S 2S xx 2 rxu2 2 xz uu Suu
S xx 2 1 rxz2
(40) Similarly, we obtain
Seyi2e yi S yy 2 1 ryu2 On the other hand, we obtain
(41)
Kunihiro Suzuki
14 Sexi e yi 2
2 2 S yy S xx 1 n x x r u u y y ryu ui u i xu i i 2 2 n i 1 Suu Suu
S yy 2 1 n 1 n xi x yi y xi x ryu ui u 2 n i 1 n i 1 Suu
2 S 2 S xx 2 1 n 1 n S xx yy y y r x x r u u r u u 2 yu i i 2 xu i n 2 xu i n i 1 Suu i 1 S Suu uu S yy 2
S xy 2 ryz
2
S xx 2
S xz 2 rxz
2
S zz
S yz 2 rxz ryz
S zz
S xx 2 S yy 2 2
S zz
S zz 2
S xy S yy S xx rxu ryu S yy S xx ryu rxu S xx S yy rxu ryu 2
2
2
2
2
2
2
S xy S yy S xx rxu ryu 2
2
2
(42)
Finally, we obtain
rxy ,u
2 2 2 S xy S yy S xx rxu ryu
1 r S 1 r S 2 xu
2 xx
2 yu
2 yy
rxy rxu ryu
1 r 1 r 2 xu
2 yu
(43)
The other try is to take data considering the stratification. For example, if we take data considering the age, we take correlation selecting the limited age range with no relation between the income and running time. The above means the followings. If we have some doubt about the correlation and have some image of the third parameter that should be the cause of the correlation, we should take data for the limited range of the third parameter. This means that we should take data of two variables of x and y with the limited range of the third parameter of u where we can approximate that the value of u is constant. We can then evaluate the real relationship between x and y .
6. CORRELATION AND CAUSATION We should be careful about that a high correlation factor does not directly mean causation. On the other hand, if two variables have strong causation relationship, we can
A Correlation Factor
15
always expect a high correlation factor. Therefore, a high correlation is a requirement, but is not a sufficient condition. For example, there is implicitly the other variable, and two variables relate strongly to this implicit variable. We then observe a strong correlation. However, this strong correlation is related to the implicit variable. The relationship between two variables is summarized in Figure 4. The key process is whether there is an implicit variable.
Figure 4. Summary of the relationship between correlation and causation.
SUMMARY To summarize the results in this chapter, the covariance between variable given by
xy 2
x
and y is
1 N x x y y N i 1
This expresses the relationship between two variables. This can be extended to the correlation factor given by
xy
xy 2 xx 2 yy 2
Kunihiro Suzuki
16
2 yy2 xx where and are variances of variables x and y given by
xx 2
1 N 2 x x N i 1
yy 2
2 1 N y y N i 1
The correlation factor is given by
1 xy 1 When the data are order numbers, the related correlation factor is called as a Spearman’s factor and is given by n
r
a i 1
a 2
i
n
a
i
i 1
n
1
1 n 2 ai bi 2 i 1 a
6 ai bi
2
2
i 1
n n 1 n 1
If the correlation factor is influenced by the third variable correlation factor eliminating the influence of factor, and is given by.
rxy ,u
rxy rxu ryu
1 r 1 r 2 xu
2 yu
u , we want to know the
u and is called as a causal correlation
Chapter 2
A PROBABILITY DISTRIBUTION FOR A CORRELATION FACTOR ABSTRACT We derive a distribution function for the correlation factor, and the distribution function with a zero correlation factor is obtained as the special case of it. Using the distribution function, we evaluate the prediction range of the correlation factor, and the validity of the relationship.
Keywords: covariance, correlation factor, prediction, testing
1. INTRODUCTION We can easily obtain a correlation factor with one set of data. However, it should have an error range, which we have no idea up to here. The accuracy of the correlation factor should depend on the sample number. We want to evaluate the error range for the correlation factor. We therefore study a probability distribution function for a correlation factor. We can then predict the error range of the correlation factor. We perform matrix operation in the analysis here. The basics of matrix operation are described in Chapter 15 of volume 3.
2. PROBABILITY DISTRIBUTION FOR TWO VARIABLES We consider a probability distribution for two variables which have relation to each other.
Kunihiro Suzuki
18
We first consider two independent normalized variables z1 and z2 , which follow standard normal distributions. The corresponding joint distribution is given by f z1 , z2 dz1dz2 f z1 f z2 dz1dz2
1
z12 2
1
z2 2 2 dz dz 1 2
e e 2 2 1 1 exp z12 z2 2 dz1dz2 2 2
(1)
We then consider two variables with a linear connection of z1 and z2 , which is given by x1 az1 bz2 x2 cz1 dz2
(2)
where
ad bc 0
(3)
Eq. (2) can be expressed with a matrix form as x1 a x2 c
b z1 d z2
(4)
We then obtain z1 a b z2 c d d D c
1
x1 x2 b x1 a x2
(5)
where D is a determinant of the matrix, given by D
1 ad bc
(6)
A Probability Distribution for a Correlation Factor
19
We then have z1 dDx1 bDx2 z2 cDx1 aDx2
(7)
The corresponding Jacobian (see Appendix 1-5 of volume 3) is given by
J
z1 x1
z1 x2
z2 x1
z2 x2
dD bD cD aD
adD 2 bcD 2 D
(8)
Therefore, we obtain f z1 , z2 dz1dz2 f x1 , x2 dx1dx2 dx bx 2 cx ax 2 1 2 1 2 exp D 2 dx1dx2 2 2 D
(9)
We can evaluate averages and variances of x1 and x2 as
E x1 E az1 bz2 aE z1 bE z2 0
(10)
E x2 E cz1 dz2 cE z1 dE z2 0
(11)
2 E x12 E az1 bz2
2
(12)
a b 1 2
2
2 E x22 E cz1 dz2
2
c d 2 2
2
(13)
Kunihiro Suzuki
20
Cov x1 , x2 E x1 x2 E x1 E x1 E az1 bz2 cz1 dz2 acE z12 bdE z2 2 ad bc E z1 z2
(14)
ac bd 12 2
The correlation factor is given by
12 2
2 2 1 2
(15)
ac bd a 2 b2 c2 d 2
We then obtain
1 2
2 2 2 2 1 2 12 21 2 2 1 2
2 2 2 2 1 2 12
2 2
a
1 2 2
b 2 c 2 d 2 ac bd
(16)
2
12 2 2
2 ad bc 2 2 1 2
The determinant is then expressed by
D
1 ad bc
1
1 2
2
1
2 2
We preform calculation in the exponential term of
(17)
f x1 , x2
as
A Probability Distribution for a Correlation Factor
21
dx1 bx2 2 cx1 ax2 2 D 2 2
1 d 2 c 2 x12 2 bd ca x1 x2 b 2 a 2 x22 D 2 2 1 2 1 2 2 2 2 x12 2 1 2 x1 x2 1 x2 2 2 2 2 1 2 1 2
1
2 1 2
2 x1 x2 x2 2 2 x1 2 2 2 2 2 2 1 2 1 2
f x1 , x2
We then have
f x1 , x2
(18)
1 2
1 2
as 1 exp 2 2 2 1 2 1 2
2 x1 x2 x22 x1 2 2 2 2 2 1 2 2 1
(19)
This is the probability distribution for two variables that follow a normal distribution
0,
0,
2
2 2
with and , and the correlation factor between the two variables is . We need to generalize this for a certain averages instead of 0. Therefore, we introduce two variables as 1
u1 x1 1 u2 x2 2
(20)
, , These two variables follow standard distributions with and . The 2
2
1
1
2
2
Jacobian is not changed, and hence
f x1 , x2 dx1dx2 f u1 , u2 du1du2
(21)
We then obtain a probability distribution as
f u1 , u2
1 exp 2 1 2 2 2 2 1 1 2 1
2
2 2 u1 1 2 u1 1 u2 2 u1 2 2 2 2 2 2 1 2 1
(22)
Kunihiro Suzuki
22
u1 and u2 are the dummy variables, and we can change them to any notation, and
hence we use common variables x and y , and obtain
f x, y dxdy
1 exp 2 2 2 2 1 2 1 xx yy 1
2
2 x x 2 x x y y y y 2 2 2 yy2 xx xx yy
2
dxdy
(23) This is the probability distribution for two variables that follow normal distributions
, , with and . The correlation factor for the two variables is . 2
1
2
2
1
2
The probability associated with x y plane region is given by
P
A
f x, y dxdy
(24)
where A is the region in the x y plane.
3. THE PROBABILITY DISTRIBUTION FOR MULTI VARIABLES We can extend the probability distribution for multi variables. We consider m variables. The variables can be expressed as a form of vector as x1 x x 2 xm
(25)
The corresponding probability distribution can be expressed by
f x1 , x2 , , xm
where
1 m
2 2
D2 exp 2
(26)
A Probability Distribution for a Correlation Factor 11 2 12 2 2 22 2 21 2 2 m2 m1
1m2 2 2m 2 mm
23
(27)
and D is given by
D 2 x1 1 , x2 2 ,
11 2 12 2 21 2 22 2 , xm m m1 2 m 2 2
xi i x j j m
m
1m 2 x1 1 2 m 2 x2 2
mm 2 xm m
(28)
ij 2
i 1 j 1
where
k
1k k 2 k m
11 2 21 2 m1 2
(29)
12 2
1m 2 111 12 2
1m2
22 2
2 22 2 2 m 2 21
2 2m
m 2 2
mm 2 2 m1
m 22
1
(30)
2 mm
We do not treat this distribution here, but use it in Chapter 4 where multi-regression is discussed.
4. PROBABILITY DISTRIBUTION FUNCTION FOR SAMPLE CORRELATION FACTOR We pick up n pair data from a population set. The obtained data are expressed by
x1, y1 , x2 , y2 , , xn , yn , and the sample correlation factor r
is given by
Kunihiro Suzuki
24 r
1
1 2 2 n S xx S yy
n
x x y i
1 2 2 n S xx S yy 1
i
y
i 1
n
i 1
(31)
xi yi x y
where S xx 2
S yy 2
1 n
1 n
n
x
i
x
2
i 1
(32)
n
y y
2
i
i 1
(33)
We start with the probability distribution for correlation factor. The simultaneous probability distribution for
f x, y
1 exp 2 2 2 2 1 2 2 xx yy 1 1
x, y , and modify it for the sample
x, y is given by
2 2 x x y y x y x x 2 2 2 yy2 xx yy xx
2
(34)
2 where is the correlation factor of the population, xx and yy are the variance of 2
x
and y , respectively. We perform variable conversions as t
u
x x 2 xx
(35)
y y
yy2
The corresponding Jacobian is evaluated as follows.
(36)
A Probability Distribution for a Correlation Factor
25
x 2 xx t
(37)
x 0 u
(38)
y 0 t
(39)
y 2 yy u
(40)
We then have x, y t, u
2 xx
0
0
yy2
(41)
xx yy 2
2
The probability distribution for simultaneous variable pair t , u is then given by f x, y dxdy f t , u dtdu
1 exp 2 2 2 2 1 2 2 xx yy 1
1 exp 2 1 2 2 1 2
1
1
t
2
t
2
2 2 2 tu u 2 xx yy dtdu
2 tu u 2 dtdu
(42)
We further introduce a variable given by
v
u t 1 2
and convert the variable pair from t , u to t , v . The corresponding Jacobian is given by
(43)
Kunihiro Suzuki
26 t 1 t
(44)
t 1 v v t 1
(45)
1 2 1 2
u 0 t
(46)
u 1 2 v
(47)
and we obtain t, u t, v
1
1 2
(48)
1 2
0
1 2
The term in the exponential in Eq. (42) is evaluated as t 2 2 tu u 2 u t 2t 2 t 2 2
1 2 v2 1 2 t 2
(49)
We then obtain a probability distribution for variable pair of t , v as f t , u dtdu g t , v dtdv
v2 t 2 exp 1 2 dtdv 2 1 2 2 2
v2 1 t2 exp exp dtdv 2 2 2 2
1
1
(50)
A Probability Distribution for a Correlation Factor
27
Consequently, v and t follow a standard normal distributions independently. We convert each sample data as xi x
ti
2 xx
ui
(51)
yi y
yy2
(52)
We further convert ui to vi as
vi
ui ti 1 2
(53)
We know that
ti , vi follows a standard normal distributions independently.
We need to express r with ti , vi .
r is a sample correlation factor for the pair xi , yi . It is also equal to the sample correlation factor for the pair ti , ui . n
t
i
t
ui u
i 1
r
n
t
i
i 1
t
(54)
2
n
u
i
u
2
i 1
We only know that t and v follow standard distributions independently, but do not know for u . Therefore, we need to convert u to v in Eq. (53). We do not start with Eq. (54) directly, but start with a variable of q given by
q
n
v
i
2
(55)
i 1
2 Since vi follows a standard normal distribution, q follows a distribution with a freedom of n . Substituting Eq. (53) into Eq. (55), we obtain
Kunihiro Suzuki
28
2
2
2 ui ti 2ti 2
u t i i q 2 i 1 1 n
n
u
1 1 2
i
(56)
i 1
We want to relate q to r . Therefore, we investigate last terms in Eq. (56) with respect to the corresponding deviations as n
u
2
i
n
u
u u
i
i 1
2
i 1 n
u
u 2 ui u u u 2 2
i
i 1 n
n
n
i 1
i 1
i 1
2 ui u 2u ui u u 2
n
u
i
u nu 2 2
i 1
n
ti ui
i 1
n
t
i
(57)
ui u u
t t
i 1
n
t t u u t u u u t i
i
i
i 1
i
t t u
n
t t u u nt u i
i
i 1
n
t
i
2
i 1
n
t
t t
i
(58)
2
i 1
n
t
i
t
2 2 ti t t
i 1
n
t
i
t
i 1
n
t
i
2
2t
n
t
i
i 1
t
t 2
t
(59)
n
t
2
i 1
2 nu 2
i 1
Substituting Eqs. (57), (58), and (59) into Eq. (56), we obtain
A Probability Distribution for a Correlation Factor
q
u t i i 2 1 i 1 n
2
2
2 ui ti 2ti 2
n
1 1 2
u
1 1 2
n n 2 2 ui u nu 2 ti t i 1 i 1
1 1 2
n n 2 ui u 2 ti t i 1 i 1
i
i 1
1 nu 2 2 nt u 2 nt 2 1 2 q0 q1
q
1 1 2
n
u
i
2
2 ui ti 2ti 2
i 1
n
n
i 1
1 ui u 2 2 ti t 2 1 i 1 i 1 1 nu 2 2 nt u 2 nt 2 1 2 q0 q1
ui u 2 ti t 2
n n 1 ui u 2 nu 2 2 ti t 2 1 i 1 i 1
(60)
ui u nt u 2 ti t 2 nt 2
i 1
n
29
n
i 1
n
n
i 1
(61)
ui u 2 ti t 2
ui u nt u 2 ti t 2 nt 2
where n n 2 u u 2 ti t ui u 2 i i 1 i 1 n ti t ui u 1 2 2 2 i 1 T U 2 TU TU 1 2 1 U 2 2 rTU 2T 2 1 2
q0
1 1 2
n
t
i
i 1
t
2
n u 2 2 t u 2 t 2 2 1 n u t 2 1 2
q1
(62)
(63)
Kunihiro Suzuki
30 We define U and T as n
u u
U
2
(64)
i
i 1
n
t t
T
2
(65)
i
i 1
q1 is also expressed as q1
n u t 1 2
n 1 1 2 n
1 n 1 n
ui ti
u t i i 1 2
v
2
2
(66)
2
2
i
1 Since vi follows a standard normal distribution,
n
n
v
i
i 1
follows a standard normal
distribution. distribution. Therefore, q1 follows a 2
We analyze q0 next. We introduce a variable n
t
i
t vi
i 1
T
(67)
and analyze it before doing analysis of q0 . Substituting Eq. (53) into Eq. (67), we obtain
A Probability Distribution for a Correlation Factor n
t t
ui ti
i
1 2
i 1
T n
31
t t u u t i
1
i
i
i 1
1 2
t
T
n n ti t ui u ti t 1 i 1 i 1 T T 1 2 n ti t ui u 2 1 T i 1 U UT T 1 2 1 rU T 1 2
2
(68)
We define q2 as q2 2
1 r 2U 2 2r UT 2T 2 1 2
(69)
Finally, q0 is related to q2 as q0
1 1 U 2 2 rTU 2T 2 r 2U 2 2r UT 2T 2 q2 2 2 1 1
1 r2 2 U q2 1 2 q2 q3
(70)
where q3 is defined as q3
1 r2 2 U 1 2
(71)
Kunihiro Suzuki
32
Therefore, q can be expressed by the sum of them as q q0 q1
(72)
q1 q2 q3
Since q1 follows a 2 distribution with a freedom of 1, q2 q3 follows a 2 distribution with a freedom of n 1 . We then analyze q2 and q3 separately. Let us consider for fixed t of t0 , which is given by n
t t v i
i
i 1
T n
t t v i
i
i 1
n
t t
2
i
i 1
t0 t
n
vi i 1
n t0 t 1 n
2
n
v
i
i 1
(73)
This follows a standard normal distribution. Therefore,
q2 v, t 0 2
follows a
2
q v, t 0 distribution with a freedom of 1. Since Eq. (73) does not include t0 , 2 follows a
2 distribution with a freedom of 1 for any t . We can conclude that q2 follows a 2 2 distribution with a distribution with a freedom of 1. Consequently, q3 follows a freedom of n 2 . We further consider q4 as
q4 t T 2 Inspecting Eq. (65), this follows a
(74)
2 distribution with a freedom of n 1 .
A Probability Distribution for a Correlation Factor
33
Finally, we obtain three independent variables q2 , q3 , and q4 whose corresponding probability distributions we know. We introduce three variables below instead of using q2 , q3 , and q4 . We consider a variable
as
1
1
(75)
rU T
2
Since q2 follows a distribution with a freedom of 1, follows a standard normal distribution, and the corresponding probability distribution is given by 2
2
2 exp 2 2 1
f
(76)
Next, we introduce a variable
1 q3 2
1
2 1 2
1 r U 2
(77)
2
Since q3 follows a 2 distribution with a freedom of n 2 , we obtain f q3
1 n2 2 2
n2 2
q3
n4 2
q exp 3 2
(78)
Therefore, we can evaluate a probability distribution for as follows. d
1 dq3 2
We then have
(79)
Kunihiro Suzuki
34 f q3 dq3 f d
1 n2 2 2
n2 2
1 n2 2
2
n4 2
n4 2
exp 2d
(80)
exp d
Finally, we introduce a variable of 1 q4 2 1 T2 2
(81)
Since q4 follows a 2 distribution with a freedom of n 1 , we obtain a probability distribution for as f d
1 n 1 2 2
n 1 2
1 n 1 2
n 3 2
2
n 3 2
exp 2d
(82) exp d
Therefore, the simultaneous three variable probability distribution is given by
f , ,
1 2
e
2 2
n4 2 e
n 3 2 e
n 2 n 1 2 2
1 n 2 n 1 2 2 2
n 4 n 3 2 2
(83) exp 2 2
We convert this probability distribution to the one for r , T ,U . We evaluated the partial differentials below.
A Probability Distribution for a Correlation Factor rU T 1 2 r r 1 1
(84)
U
1 2
rU T 1 T T 1 2 1
1 2
1 1 2
(86)
r
1 1 r2 U 2 2 r 2 1 r
(85)
rU T 1 2 U U 1
35
1
1 2
rU
1 1 r2 U 2 2 T 2 1 T
(87)
2
(88)
0 1 1 r2 U 2 2 U 2 1 U
(89)
1 1 r2 U 1 2 1 2 T r 2 r 0
(90)
1 2 T T 2 T T
(91)
1 T2 U 2 U 0
(92)
Kunihiro Suzuki
36
We then obtain corresponding Jacobian as 1 , ,
r , T ,U
1
2
U
1 rU 2 1 2
1
2
3 2 2
1
1 2
r
T 1
1
1 1 r2 U 1 2 0
0
0
1
(93)
TU 2
Therefore, we obtain a probability distribution with respect to a variable set of r , T ,U as f r , T ,U
1
1
1 2 1 2
T2 2
1 n 2 n 1 2 2 2
TU 2
3 2 2
1 r2 U 2
n 3
2 1 exp 2 1 2 1 TU 2 3
1
2 2
1 n4 2 2
1 n 3 2 2
T 2 2r TU 1 n 2 n 1 2 2 2
U
2
1 r 2
n4 2 2
1
n4 2
U n4
T n 3
1 exp 2 1 2
n4 2
U
2
T 2 2r TU
1 1 1 n 2 n 1 n 7 2 2 2 1 2 2 2
1 exp 2 1 2
U
2
1 r 2
T 2 2r TU
(94) n 1 2
n4 2
T n 2U n 2
A Probability Distribution for a Correlation Factor
37
T and U have value ranges between 0 and ∞. We further convert variables of T and U
to and given by T e 2 2 U e
(95)
Modifying these equations, we obtain TU U e T
(96)
The corresponding partial differentials are evaluated as
T 1 e2 2
(97)
T 1 e2 2
(98)
U 1 e 2 2
(99)
U 1 e 2 2
(100)
The corresponding Jacobian is given by
T ,U ,
1
2 1 2
1 e2 2
e2 e
2
1 e 2 2
1 1 2 1 e2 e 2 e e2 2 2 2 2 1 1 4 4 1 2
1
(101)
Kunihiro Suzuki
38
Therefore, the term Q U , T including T ,U is given by 1 Q T ,U T n 2U n 2 exp 2 1 2
n2
U
2
T 2 2r TU
n2
1 exp 2 1 2 n 2 exp cosh r 2 1
e2
e 2
e e 2r
(102)
changes from 0 to ∞, and changes from -∞ to ∞. Therefore, the corresponding integration is given by
Q T ,U dTdU
1 n2 exp 1 2 cosh r d d 2 0
n 1 1 2
n 1
(103)
d n 1 0 cosh r 1
We set a variable t
cosh r 1 2
(104)
and obtain
n2 exp 1 2 cosh r d 0
1 2 cosh r
n2
1 2 cosh r
n 1
0
0
t n 2 exp t
1 2 dt cosh r
t n 2 exp t dt
1 2 n 1 cosh r
n 1
(105) Since cosh is an even function, and hence we change the integration region from 0 to ∞, and multiply it by 2 and obtain
A Probability Distribution for a Correlation Factor 1 1 1 n 2 n 1 n 7 2 2 2 1 2 2 2
f r
n 1 1 2
n 1
n4 2
1 d n 1 0 cosh r
n 1
1 r 2
n 1 2
1
n 2 n 1 2n 3 2 2
(106)
n 1 2
1 1 r 2
39
2
n4 2
1 d n 1 0 cosh r
Since the term in Eq. (106) is performed further as n 1 n 2 n 1 n 3 2 n 2 , n2 n 2 2 2
(107)
Eq. (106) is reduced to n 1
f r
n 2 n 1 2 n 3 2 2
n 1
1
n2
n 1 2
1 1 r n 2 2
n 1 2
1 1 r 2
2
n4 2
n 1 2
1 1 r 2
2
n4 2
2
n4 2
1 d n 1 0 cosh r
1 d n 1 0 cosh r
(108)
1 d n 1 0 cosh r
We further perform integration in Eq. (108) as follows. We introduce a variable N as
n 1 N
(109)
The term is then expanded as 1
cosh r
n 1
1
cosh r N
1 cosh N
1 cosh N
r 1 cosh
m 0
N
N N 1
(110)
N m 1 r m m!
cosh m
Kunihiro Suzuki
40
where we utilize a Taylor series below. f x 1 x
N
f 1
f x
x x 0
1 2 f 2! x 2
x2 x 0
1 3 f 3! x3
x3 x 0
1 1 1 Nx N N 1 x 2 N N 1 x 3 2! 3! N N 1 N m 1 m x m! m 0
(111)
where x
r cosh
(112)
in this derivation process. Therefore, we can focus on the integration of the form given by
1 du 0 cosh k u
(113)
Performing a variable conversion as x
1 cosh u
(114)
we then obtain dx
sinh u du cosh 2 u
cosh 2 u 1 du cosh 2 u 1 x 2 2 1du x
x 1 x 2 du
(115)
A Probability Distribution for a Correlation Factor
41
The integration of Eq. (113) is then reduced to 1 1 k du dx x k 0 cosh u 1 x 1 x 2 0
1
x k 1 dx 0 1 x 2
(116)
Further, we convert the variable as
x y
(117)
and obtain
dx 1 dy 2 y
(118)
Therefore, the integration is reduced to k 1
1
1 y 2 1 x k 1 dx dy 1 y 2 y 0 1 x 2 0
1 2
1
0
k
y2
1
1 y
dy
1 k 1 B , 2 2 2
k 1 2 2 k 1 2
(119)
where is a Gamma function and B is a Beta function (see Appendix 1-2 of volume 3). The integration is then given by
Kunihiro Suzuki
42
1 d n 1 0 cosh r
1 N 0 cosh
m 0
N N 1
N N 1
cosh m
m!
N m 1
r m
1
0 cosh N m
d d
r m
1 2
N m 2 N m 1 2
r
1 2
N N 1 N m 2 2 2 N m N m 1 2 2 2
N m 1 m!
m 0
N m 1 r m
m!
m 0
N N 1
1 N N 1 2 2
m 0
N N 1
N m 1 m!
m
(120)
Since we have relationships given by N N 1 2 N N 2 ! 2 2 2
(121)
N m N m 1 1 N m N m 2 2 2
(122)
the term in Eq. (120) is reduced to
1 2
N N 1 N m 2 2 2 N m N m 1 2 2 2
2 N N 2 ! 2 N m 1 2 1 N m 22 N m 2
2m
2
m
N m 2 2 N 2 ! N m N m 2 2 N 2 ! N m 1!
(123)
A Probability Distribution for a Correlation Factor
43
Therefore, we substitute this into Eq. (120), we obtain
1 d n 1 0 cosh r
1 N N 1 2 2
N N 1
m!
m 0
N m 1
r
m
N m 2 2 N 2 ! N m 1!
2m
1 N N 1 2 2
1 r m 2m N 2 ! m! m 0
N m 2 2 N 1!
r 2m 1 m! N N 1 m 0 2 2 m
N m 2 2 N 1
r 2m 1 m! n 1 n 2 m 0 2 2 m
(124)
n 1 m 2 2 n 2
Therefore, the probability distribution for sample correlation factor r is given by f r
n2
n2
n 1 2
1 1 r 2
2
n 1 2 2
n4 2
n4 r2 2
1 1 n 1 2
1 1 r 2
2
n4 2
n 1 n 2 2 2
1 d n 1 0 cosh r
m 0
r 2m 1 m! n 1 n 2 m 0 2 2
m
n 1 m 2 2 n 2
(125)
2r m 2 n 1 m m!
2
This is a rigorous model for a probability distribution for correlation factor. However, it is rather complicated and is difficult to use. Therefore, we consider a variable conversion
Kunihiro Suzuki
44
and derive an approximated normal distribution, and do range analysis with the approximated normal distribution.
5. APPROXIMATED NORMAL DISTRIBUTION FOR CONVERTED VARIABLE We convert the variable r to as 1
1 r
ln 2 1 r
(126)
This can be expanded as 1 r 1 r 1 1 r r3 r5 3 5 1 2
ln
2k 1r 1
(127)
2 k 1
k 1
This changes a variable range from to when r changes from -1 to +1 as shown in Figure 1. Let us consider the region for r 0 . When r is less than0.5, is almost equal to r , and then it deviate from r and become much larger than r when r approaches to 1, which is the required feature. 3 2
1 r
0 -1 -2 -3 -1
Figure 1. The relationship between
-0.5
0 r
0.5
and sample correlation factor r .
1
A Probability Distribution for a Correlation Factor
45
We then derive a probability distribution for . We solve Eq. (126) with respect to r and obtain r tanh
(128)
Each increment is related to each other as dr
1 d cosh 2
(129)
Therefore, the corresponding probability distribution is given by
f
1 2
n 1 2
n 1 n 2 2 2
1 tanh 2
n4 2
cosh 2
2 tanh m 2 n 1 m
m!
m 0
2
(130)
Probability density
12 10 8 6 4
n = 100
f(r) f() f(r) f() f(r) f() =0
= 0.4
= 0.8
2 0 -0.5
0
Figure 2. Probability distribution function for
0.5 r,
r
and
1
1.5
.
Figure 2 shows a probability distribution for r and with a sample number of 100. f r
becomes narrow and asymmetrical with increasing , which is due to the fact the sample correlation factor has a upper limit of 1. On the other hand, f has the almost identical shape and move parallel. f and f r
becomes same as the absolute value of decreases.
In the small , we can expect a small r . We can then approximate as
Kunihiro Suzuki
46 1 3
1 5
r r3 r5 r
(131)
Therefore, f r and f are the same.
5 4 f()
n = 100
= 0.0 = 0.4 = 0.8
3 2 1 0
-0.4
-0.2
0 -
0.2
Figure 3. The probability distribution with respect to the deviation of
0.4
.
We study the average of the . We guess that the average of can be obtained by substituting into Eq. (126) as
1 1 ln 2 1
(132)
Therefore, we replot by changing the lateral axis in Figure 2 from to , which is shown in Figure 3. All data for 0,0.4,0.8 become almost the same. Therefore, the average is well expressed by Eq. (132), and all the profiles are the same if we plot them with a lateral axis of . Next, we consider the variance of the profile. When we use a lateral axis as , the profiles do not have dependence. Therefore, we set 0 . In this case, f f r , and the probability distribution f t 2 converted from f r is a t distribution with a freedom of n 2 . The variance t n 2 of
the t distribution with a freedom of n 2 is given by
A Probability Distribution for a Correlation Factor t n 2 2
n2 n4
47 (133)
We re-express the relationship between t and r as r
t n2
(134)
1 r2
This is not so simple relationship, but we approximate that r is small, and obtain t n 2r
(135)
We then have r
t
(136)
n2
In this case, the variance for r expressed by r 2
1
n2
2
r2n 2
is given by
t2n 2
1 n2 n2 n4 1 n4
(137)
This is the approximated one. When n is large, we can expect that this treatment is accurate. However, we may not expect accurate one for small n . Therefore, we propose that we modify the model of Eq. (137) and change the variable as 2
1 n
(138)
This model includes a parameter . We decide the value by comparing the calculated results. In the standard text book, is set at 3. The model and numerical data agree well with
2.5 as shown in Figure 4.
Kunihiro Suzuki
48 0.8
n=5 =2
0.7
f()
= 2.5
0.6
=3
0.5 =4
0.4 0.3
-0.4
-0.2
0 (a)
0.2
0.4
1.2 n = 10 =2 = 2.5
1.1
f()
=3
1.0
=4
0.9 -0.3 -0.2 -0.1
0 (b)
Figure 4. Comparison of analytical models with various
0.1
0.2
0.3
and numerical data.
4.0 = 2.5
n=5 n = 10 n = 50 n = 100 Gauss
f()
3.0 2.0 1.0 0.0 -2
-1
0
1
2
Figure 5. Comparison between analytical model and numerical model for various n.
A Probability Distribution for a Correlation Factor
49
Figure 5 compares an analytical model with 2.5 and a numerical model for various n . An analytical model well reproduces the numerical data for any n . Therefore, the probability distribution for can be approximated with a normal distribution as
f
1 2
n 1 2
1 tanh 2
n4 2
n 1 n 2 cosh 2 m 0 2 2 2 1 ln 1 2 1 1 exp 1 2 2 n 2.5 n 2.5
2 tanh m 2 n 1 m m!
2
(139) When we perform the conversion given by
1 1 ln 2 1 z 1 n 2.5
(140)
z follows a standard normal distribution as
z2 f z exp 2 2 1
(141)
We finally evaluate the normal distribution approximation for 0 . We use a sample number n 5 where the accuracy is sensitive. The analytical model reproduces the numerical data with
2.5 as shown in Figure 6.
Kunihiro Suzuki
50
= 1e-8 = 0.4 = 0.8 Gauss( = 3) Gauss( = 2.5)
0.7 n=5
0.6
f()
0.5 0.4 0.3 0.2 0.1 0.0 -2
-1
0
1
2
3
Figure 6. Comparison of numerical and analytical model for various at the sample number of 5. The parameters for the analytical model are set at 3 and 2.5.
6. PREDICTION OF CORRELATION FACTOR We showed that the converted variable follows a normal distribution and we further obtained approximated averages and variances. Therefore, related prediction range is given by 1 1 rxy ln 2 1 rxy
zP
1 1 1 1 rxy ln ln n 2.5 2 1 2 1 rxy 1
zP
1 n 2.5
(142)
We then obtain the following form min max
(143)
where max is evaluated from the equation of 1 1 max ln 2 1 max
1 1 rxy ln 2 1 rxy
1 zP n 2.5
(144)
A Probability Distribution for a Correlation Factor
51
We set GH as
1 1 rxy GH ln 2 1 rxy
zP
1 n 2.5
(145)
and obtain max as a function of GH as below.
1 max ln 2GH 1 max 1 max e 2GH 1 max 1 max 1 max e 2GH
1 e 2GH
max
max
e2GH 1
e 2GH 1 e 2GH 1
(146)
Similarly, we obtain min as
min
e2GL 1 2G e L 1
(147)
where
1 1 rxy GL ln 2 1 rxy
zP
1 n 2.5
(148)
Figure 7 shows the prediction range for a sample correlation factor. The range is very wide, and hence we need a sample number at least 50 to ensure the range is less than 0.5.
Kunihiro Suzuki
52
1.0
xy
0.5
n=5 10 20 40 60 100
0.0 -0.5 -1.0 -1.0
n = 100
-0.5
0.0 rxy (a)
60
0.5
5 10 20 40
1.0
1.0 n = 100 400
0.5
xy
1000
0.0 100 400
-0.5 -1.0 -1.0
n = 1000
-0.5
0.0 rxy (b)
0.5
1.0
Figure 7. Prediction range for sample correlation factor. (a) n 100 , (b) n 100 .
Let us study Eq. (143) in more detail. When n is sufficiently large, we can neglect the second term in
GH
, and it is reduced
to
1 1 rxy GH ln 2 1 rxy
Then the max is also reduced to
(149)
A Probability Distribution for a Correlation Factor
max
1 rxy ln 1 rxy
1 rxy ln 1 r e xy
e
1 rxy 1 rxy 1 rxy 1 rxy
53
1 1
1 1
2rxy 2
rxy
(150)
Then, the sample correlation factor becomes that of population. When n is not so sufficiently large, we can approximate as follows. 1 rxy exp 2GH exp ln 1 rxy 1 rxy exp zP 1 rxy 1 rxy 1 z P 1 rxy
2 zP n 2.5 n 2.5 2
(151)
n 2.5 2
Therefore, we obtain
max
1 rxy 2 1 zP 1 1 rxy n 2.5 1 rxy 2 1 zP 1 1 rxy n 2.5 1 rxy rxy zP n 2.5 1 rxy 1 zP n 2.5
Similarly, we obtain an approximated min as
(152)
Kunihiro Suzuki
54
min
rxy z P 1 zP
1 rxy n 2.5 1 rxy n 2.5
(153)
The approximated model is compared with the rigorous one as shown in Figure 8. The agreement is not good for n less than 50, but the simple model does work for n larger than 100. The accuracy of the model is related to the Taylor expansion, and it is expressed by
n
3 4 z 2p
(154)
We decide the prediction probability as 0.95, and corresponding Eq. (154) is reduced to n
zp
3 4 1.962 18.4
(155)
This well explains the results in Figure 8.
1.0
xy
0.5
n = 20 (Rigorous) n = 40 (Rigorous) n = 100 (Rigorous) n = 200 (Rigorous) n = 20 (Simple) n = 40 (Simple) n = 100 (Simple) n = 200 (Simple)
0.0 -0.5 -1.0 -1.0
is 1.96. Therefore,
-0.5
0.0 rxy
0.5
Figure 8. Comparison between simple and rigorous correlation factor models.
1.0
A Probability Distribution for a Correlation Factor
55
7. PREDICTION WITH MORE RIGOROUS AVERAGE MODEL Let us inspect the Figure 6 in more detail. The analytical normal distribution deviates from the numerical data for large . The analytical models shift left slightly with increasing . This is due to the inaccurate approximation of the model for an average. This is can be improved by using a model given by
1 1 ln 2 1
2 n 1
(156)
Figure 9 compares the modified analytical model and numerical data, where both agree well for various . Finally, we express a probability distribution function as
f
1 2
n 1 2
1 tanh 2
n4 2
2 tanh m 2 n 1 m
m! n 1 n 2 cosh 2 m 0 2 2 2 1 ln 1 2 1 2 n 1 1 exp 1 2 2 n 2.5 n 2.5
2
(157)
= 1e-8 = 0.4 = 0.8 Gauss( = 0.25)
0.7 n=5
0.6
f()
0.5 0.4 0.3 0.2 0.1 0.0 -2
-1
0
1
2
3
Figure 9. Comparison with analytical data with modified average model with numerical data.
Kunihiro Suzuki
56
Therefore, we use a normal distribution with an average of
1 1 ln 2 1
2 n 1 ,
and a
1
standard deviation of n2.5 . Consequently, we set the normalized variable as
z
1 1 rxy ln 2 1 rxy
1 1 1 ln 2 1 2 n 1 1 n 2.5
(158)
Then, it follows a standard normal distribution given by f z
z2 exp 2 2 1
(159)
Based on the above analysis, we predict the range of correlation factor. First of all, we want to obtain the central value of denoted as cent , which is
r
commonly regarded as xy . We can improve the accuracy as follows. cent should satisfy the relationship given by
cent 1 1 cent 1 1 rxy ln ln 2 1 cent 2 n 1 2 1 rxy
(160)
If there is no the second term in the left side of Eq. (160), we obtain
cent rxy
(161)
We construct a function given by
1 1 1 1 rxy fcent ln ln 2 1 2 n 1 2 1 rxy
(162)
A Probability Distribution for a Correlation Factor
57
cent is a root of the equation given by
fcent cent 0
(163)
r We can expect this cent is not far from the xy from Figure 6 and Figure 9. Therefore, we can expect
fcent f rxy fcent ' rxy rxy
(164)
Therefore, we obtain
0 f rxy f ' rxy cent rxy
(165)
Therefore, we obtain cent as
cent rxy
fcent rxy
fcent ' rxy
(166)
where
f cent rxy
f cent ' rxy
1 rxy 1 ln ln 1 rxy 2
rxy 1 1 rxy ln 2 n 1 2 1 rxy
rxy
2 n 1
(167)
1 1 1 1 2 1 rxy 1 rxy 2 n 1 1 1 2 1 rxy 2 n 1
Therefore, we obtain
(168)
Kunihiro Suzuki
58
cent
1 2 n 1 rxy rxy 1 1 1 rxy2 2 n 1
1 1 rxy2 1 1 2 1 rxy 2 n 1
rxy
(169)
Next, we evaluate the range of the correlation factor range. The related estimated range is expressed by 1 1 rxy ln 2 1 rxy
1 1 1 1 1 rxy ln ln zP n 2.5 2 1 2 n 1 2 1 rxy
1 zP n 2.5
(170)
Form this equation, we want to obtain the final form below. min max
(171)
We cannot obtain analytical forms for min and max from the Eq. (170). Therefore, we show approximated analytical forms and a numerical procedure to obtain rigorous one. 2 n 1 We first neglect a correlation term of , and set max , min under this treatment as max 0 , min 0 .
max 0 can be evaluated from the equation given by 1 1 max 0 1 1 rxy ln ln 2 1 max 0 2 1 rxy
zP
1 n 2.5
(172)
We introduce a variable as 1 1 rxy GH ln 2 1 rxy
zP
1 n 2.5
(173)
A Probability Distribution for a Correlation Factor
59
We then have an analytical form for max 0 as below.
1 max 0 ln 2GH 1 max 0 1 max 0 e 2 GH 1 max 0 1 max 0 1 max 0 e 2GH
1 e 2 GH
max 0
max 0
e
2 GH
(174)
1
e 2 GH 1 e 2 GH 1
Similarly, we obtain
min 0
e2GL 1 2G e L 1
(175)
where
1 1 rxy GL ln 2 1 rxy
zP
1 n 2.5
(176)
Next, we include the correlation factor. max can be evaluated from the equation given by 1 1 max ln 2 1 max
max 1 1 rxy ln 2 n 1 2 1 rxy
zP
1 n 2.5
(177)
We construct a function given by 1 1 1 1 rxy f ln ln 2 1 2 n 1 2 1 rxy
zP
1 n 2.5
(178)
Kunihiro Suzuki
60
max is a root of the equation given by
f max 0
(179)
We can expect that this max is not far from the max 0 from Figure 6 and Figure 9. Therefore, we can expect
f f max 0 f ' max 0 max 0
(180)
We then obtain
0 f max 0 f ' max 0 max max 0
(181)
Therefore, we obtain max as
max max 0
f max 0
(182)
f ' max 0
where f max 0
max 0 1 1 max 0 1 1 rxy ln ln 2 1 max 0 2 n 1 2 1 rxy
1 n 2.5
max 0
2 n 1
f ' max 0
zP
(183)
1 1 1 1 2 1 max 0 1 max 0 2 n 1 1 1 max 0 2
Therefore, we obtain
1 2 n 1
(184)
A Probability Distribution for a Correlation Factor 1 2 n 1
max max 0
1 2 1 max 0
1 2 n 1
61
max 0
1
2 1 max 0
1
2 1 max 0
1 2 n 1
max 0 (185)
Similarly, we can obtain min . min can be evaluated from the equation given by
min 1 1 min 1 1 rxy ln ln 2 1 min 2 n 1 2 1 rxy
1 zP n 2.5
(186)
We introduce a function for min given by 1 1 1 1 rxy g ln ln 2 1 2 n 1 2 1 rxy
zP
1 n 2.5
(187)
min is given by
min min 0
g min 0
(188)
g ' min 0
where g ' min 0
1 1 min 0 2
1 2 n 1
(189)
We then obtain 1
min
2 1 min 0
1 2 1 min 0
1 2 n 1
min 0
(190)
Kunihiro Suzuki
62
8. TESTING OF CORRELATION FACTOR The sample correlation factor is not 0 even when correlation factor of the population is 0. We can test it setting 0 in Eq. (125), and obtain n 1 2 1 r2 n2 2
f r
n4 2
(191)
We introduce a variable t n2
r
(192)
1 r2
The corresponding increment is 1 r2 r
2r
2 1 r 2 dr 1 r2
dt n 2
1 r2 r2 n2 n2
1 r2 1 r2 1 3 2 2
1 r
(193) dr
dr
The relationship between t and r is modified from Eq. (192) as
t 2 1 r 2 n 2 r 2
(194)
2 We solve Eq. (194) with respect to r , and obtain
r2
t2 n 2 t 2
Therefore, the probability distribution for r is given by
(195)
A Probability Distribution for a Correlation Factor
63
n 1 n4 2 1 r 2 2 dr f r dr n2 2 n 1 n4 1 2 1 r2 2 dt n2 n 2 3 2 1 r2 2
n 1 n 1 2 1 r 2 2 dt n 2 n 2 2 n 1 n 1 2 1 t2 2 1 dt n 2 n 2 n 2 t 2 2 n 1 n 1 1 n2 2 2 dt n 2 n 2 n 2 t 2 2
1
n 1
2 n 1 1 1 2 dt t2 n 2 n 2 1 2 n2 n 1 n 1 2 1 2 1 t 2 dt n2 n 2 n 2 2 n 2 1 n 2 1 2 2 2 1 1 t dt n2 n 2 n 2 2
(196)
Therefore, the probability distribution function f r is converted to
f t
n 2 1 n 2 1 2 2 2 1 t 1 n2 n 2 n 2 2
(197)
This is a t distribution with a freedom of n 2 . When we obtain sample correlation factor r from n samples, we evaluate t given by t n2
r 1 r2
(198)
Kunihiro Suzuki
64
We decide a prediction probability P and obtain the corresponding P point of tP with a t distribution with a freedom of n 2 , and compare this with Eq.(198), and perform a test as t t p t t p
Two variables do not dependent each other Two variables depend on each other
(199)
9. TESTING OF TWO CORRELATION FACTOR DIFFERENCE We want to test that the two groups correlation factors are the same or not. We treat two groups A and B and corresponding correlation factors and sample numbers are rA and rB , and nA and nB . We introduce variables given by
1 1 rA z A ln 2 1 rA
(200)
1 1 rB zB ln 2 1 rB
(201)
The average of z A and z B are 0. The corresponding variances are given by
A
1 nA 2.5
(202)
B
1 nB 2.5
(203)
2
2
We introduce a variable as Z z A zB 1 1 rA 1 1 rB ln ln 2 1 rA 2 1 rB 1 1 rA 1 rB ln 2 1 rB 1 rA
(204)
A Probability Distribution for a Correlation Factor
65
The rA rB is identical to Z 0 . The average of
Z is zero, and variance 2 is
2 A B 2
2
(205)
Therefore, we can evaluate corresponding z P for a normal distribution and we can judge whether the correlation factor is the same if the following is valid.
Z zp
(206)
SUMMARY The probability distribution function of two variables is given by
f x1 , x2
1 exp 2 1 2 2 2 1 2 1 2 1
2
2 2 x1 2 x1 x2 x2 2 2 2 2 2 1 2 1
This can be extended to multi variables given by f x1 , x2 ,
, xm
1
2
m 2
D2 exp 2
where 11 2 2 21 2 m1
and
12 2
1m2
2 22
2 2m
m 22
D is given by
2 mm
Kunihiro Suzuki
66
11 2 12 2 21 2 22 2 , xm m m1 2 m 2 2
D 2 x1 1 , x2 2 ,
1m 2 x1 1 2 m 2 x2 2
mm 2 xm m
The probability distribution for a sample correlation factor is given by n 1 2
1 1 r f r 2
2
n4 2
n 1 n 2 2 2
m 0
2r m 2 n 1 m m!
2
Introducing a variable , the distribution function is converted to 1 r
1
ln 2 1 r Performing numerical study, we show that the average and variance of this function are given by
1 1 ln 2 1
2
2 n 1
1 n 2.5
Therefore, the distribution is approximately expressed by
f
exp 2 2 2 2 1
2
The central correlation factor in the population cent is given by
cent
1 1 rxy2 1 1 2 1 rxy 2 n 1
rxy
A Probability Distribution for a Correlation Factor We then obtain an error range as
min max where
1
max
2 1 max 0
1 2 1 max 0
1 2 n 1
max 0
1
min
2 1 min 0
1 2 1 min 0
1 2 n 1
max 0
e2GH 1 2G e H 1
min 0
e2GL 1 2G e L 1
1 1 rxy GH ln 2 1 rxy 1 1 rxy GL ln 2 1 rxy
zP
min 0
1 n 2.5
1 zP n 2.5
The correlation factor distribution for 0 is given by n 1 2 1 r2 f r n2 2
n4 2
67
Kunihiro Suzuki
68 Introducing a variable t n2
r 1 r2
We modify the distribution as
n 2 1 n 2 1 2 2 1 1 t 2 f t n 2 n 2 n 2 2 This is a t distribution with a freedom of n 2 . We can hence compare this with the P point t P with a freedom of n 2 .
Chapter 3
REGRESSION ANALYSIS ABSTRACT We studied the significance of relationship between two variables by evaluating a correlation factor in the previous two chapters. We then predict the objective variable value for given explanatory variable values, which is called as the regression analysis. We also evaluate the accuracy of the regression line, and predictive a range of regression and objective value. We further evaluate the predictive range of factors of regression line, the predictive range of regression, and the range of object values. We first focus on the linear regression to express the relationship, and then extend it to various functions.
Keywords: regression line, linear function, logarithmic regression, power law function, multi-power law function, leverage ratio, logistic curve
1. INTRODUCTION We can evaluate two variables such as smoking and life expectancy. We can obtain a corresponding correlation factor. The correlation factor value expresses the strength of the relationship between the two variables. We further want to predict the life expectancy of a person who takes one box tobacco per a day. We should express the relationship between two variables quantitatively, which we treat in this chapter. The relationship is expressed by a line in general, and it is called as regression line. Further, the relationship is not expressed with only a line, but may be with an arbitrary function. We perform matrix operation for the analysis in this chapter. The matrix operation is summarized in Chapter 15 of volume 3.
Kunihiro Suzuki
70
2. PATH DIAGRAM It is convenient to express the relationship between variables qualitatively. A path diagram is sometimes used to do it. The evaluated variables are expressed by a square, and the correlation is expressed by an arrowed arch line, and the objective and explanatory variables are connected by an arrowed line where the arrow is terminated to the objected variable. The error is expressed by a circle. Therefore, the objective variable y and the explanatory variable x are expressed as shown in Figure 1.
Figure 1. Path diagram for regression.
3. REGRESSION EQUATION FOR A POPULATION First, we consider a regression for a population, where we obtain N sets of data x1 , y1 , x2 , y2 , , xN , yN . We obtain a scattered plot as shown in the previous chapter. We want to obtain a corresponding theory that expresses relationship between the two variables. Table 1. Relationship between secondary industry number and detached house number Prefecture ID 1 2 3 4 5 6 7 8 9 10 11 12
Scondary industry number 630 1050 350 820 1400 1300 145 740 1200 200 450 1500
Detached house number 680 1300 400 1240 1300 1500 195 1050 1250 320 630 1750
Regression Analysis
71
We obtain the relationship between a secondary industry number and a detached house number as shown in Table 1. Figure 2 shows the scattered plot of the Table 1. We can estimate the correlation factor as shown in the previous chapter, and it is given by
xy 0.959
(1)
That is, we have a strong relationship between the secondary industry number and the detached house number.
Detached house number
2000
1800 y = 1.0245x + 132.55
1600 1400 1200 1000 800 600 400 200 0 0
500
1000
1500
2000
Secondary industry number Figure 2. Dependence of detached house number on secondary industry number.
We assume that the detached house number data yi is related to the secondary industry data xi as yi a0 a1 xi ei Yi ei
(2)
where a0 and a1 are a constant number independent on data i . ei is the error associated with the data number i . Yi is the regression value and is given by Yi a0 a1 xi
(3)
2 N 0, e We assume that the ei follows the normal distribution with , where
Kunihiro Suzuki
72 1 N 2 yi Yi N i 1
e 2
(4)
and it is independent on variable xi , that is Cov xi , ei 0
(5)
We show how we can derive the regression line. The deviation from the regression line and data ei is expressed by
ei yi Yi
yi a0 a1 xi
(6)
2 A variance of ei is denoted as e and is given by 1 2 e1 e22 eN 2 N 2 2 1 y1 a0 a1 x1 y2 a0 a1 x2 N
e 2
y N a0 a1 xN
2
(7)
2 We decide variables a0 and a1 so that e has the minimum value. We first differentiate e 1 a0 a0 N 2
1 N
e 2 with respect to a0 and set it to be zero as follows
N
y a i
i 1
0
N
2 y a i
i 1
1 2 N
N
y i 1
i
0
a1
a1 xi
2
a1 xi
1 N
2 y a1 x a0
N
x i 1
i
a0
(8)
0
where
y
1 N yi N i 1
(9)
Regression Analysis
x
1 N xi N i 1
73 (10)
Therefore, we obtain
y a0 a1 x
Differentiating
(11)
e 2 with respect to a1 , we obtain
2 e 1 N yi a0 a1 xi a1 a1 N i 1 2
1 N 2 yi a0 a1 xi xi N i 1
(12)
0
Substituting Eq. (11) of a0 into Eq. (12), we obtain 1 N xi yi a0 a1 xi N i 1 1 N xi yi a1 xi y a1 x N i 1 1 N xi yi y a1 xi x 0 N i 1
(13)
It should also be noted that 1 N x yi y a1 xi x 0 N i 1
(14)
Subtracting Eq. (14) from Eq. (13), we obtain 1 N xi x yi y a1 xi x 0 N i 1
We then obtain
(15)
Kunihiro Suzuki
74 a1
xy 2
(16)
xx 2
where
xx 2
xy 2
1 N 2 xi x N i 1
1 N yi y xi x N i 1
Eq. (16) can be expressed as a function of a correlation factor
a1
(17)
(18)
xy
as
xy 2 xx 2
xy xy
xx 2 yy 2 xy 2 xy 2
xx 2
(19)
yy 2 xx 2
where
xy
xy 2
(20)
xx 2 yy 2
a0 can be determined from Eq. (11) as
a0 y a1 x y xy
yy 2 2
xx
x
Finally, we obtain a regression line as
(21)
Regression Analysis
Y y xy xy
yy 2 xx 2
yy 2 xx 2
x xy
yy 2 xx 2
75
x
(22)
x x y
This can be modified as a normalized form as Y y
yy 2
x x xy 2 xx
(23)
Eq. (22) can then be expressed with a normalized variable as
zY xy z x
(24)
where zY
zx
Y y
yy 2
(25)
x x
xx 2
(26)
Now, we can evaluate the assumption for e . The average of e is denoted as e and can be evaluated as 1 N yi Yi N i 1 1 N yi a0 a1 xi N i 1 1 N yi y a1 xi x N i 1 0
e
This is zero as was assumed. The covariance between x and e is given by
(27)
Kunihiro Suzuki
76
1 N xi x ei N i 1 1 N xi x yi a0 a1 xi N i 1 1 N xi x yi y a1 xi x N i 1
Cov x, e
xy a1 xx 2
2
2
xy
xy 2 2
xx
xx 2
0
(28)
Therefore, both variables are independent as is assumed. The variance of e can be also evaluated as e 2 e12 e2 2 eN 2 N 1 2 2 2 y1 a0 a1 x1 y2 a0 a1 x2 yn a0 a1 xN N 2 2 2 1 y1 y a1 x1 x y2 y a1 x2 x yn y a1 xn x N 2 2 2 1 y1 y y2 y y N y N 2a1 y1 y x1 x y2 y x2 x y N y xN x N 2 a 2 2 2 1 x1 x x2 x xN x N
yy 2a1 xy a12 xx 2
yy
2
2
2
yy 2 xx 2
yy 2 yy xy 2
yy
2
2
2 2
xy xy
2
2 yy xy xx 2 2 xx
xy 2 xx 2 yy 2
yy xy 2 2
1 2
xy
(29)
4. DIFFERENT EXPRESSION FOR CORRELATION FACTOR Let us consider the correlation between yi and the regression line data denoted by Yi , where
Regression Analysis
Yi
yy 2 xx 2
77
xy xi x y
(30)
The average and variance of Y are given by
Y y
YY 2
(31)
yy 2 xx
2
xy2 xx 2 (32)
The correlation between yi and Yi is then given by
1 N yi y Yi y N i 1 2 yy 2 YY
2 1 N yy y x i y 2 xy i x y y N i 1 xx
yy 2 yy
2
xx 2
xy
xx
2
xy2 xx 2
1 N yi y xi x N i 1
yy 2 xx
yy 2
2
xy yy 2 xx 2
1 yi y xi x N i 1 N
yy 2 xx 2
xy
(33)
It should be noted that the correlation between y and regression line data Y is equal to the correlation factor between y and x .
5. FORCED REGRESSION In some cases, we want to set some restrictions, that is, we sometimes want to set a constant boundary value at 0 for x 0 . In this special case, we can obtain a1 by setting a0 0 in Eq. (12) and obtain
Kunihiro Suzuki
78 e 1 N 2 yi a1 xi a1 a1 N i 1 2
1 N 2 yi a1 xi xi N i 1
(34)
0
This can be modified as 1 N yi a1 xi xi N i 1 1 N yi y y a1 xi x x xi x x N i 1 1 N yi y y a1 xi x a1 x xi x N i 1 1 N yi y y a1 xi x a1 x x N i 1
(35)
xy a1 xx y x a1 x 2 2
2
0
Therefore, we obtain
a1
xy 2 y x xx 2 x 2 xy
xx 2 yy 2 2
xy
xy 2 y x
(36)
xx 2 x 2 xy xx 2 yy 2 y x xx 2 x 2
We then obtain a corresponding regression equation as
Y
xy xx 2 yy 2 x y xx 2 x 2
x (37)
In general, we need not to set a0 as 0, but set is as a given value, and the corresponding a1 is then given by
Regression Analysis
79
xy xx 2 yy 2 x y a0 x
a1
2 xx
x2
(38)
and the regression equation is given by
Y
xy xx 2 yy 2 x y a0 x
2 xx
x2
x a0
(39)
6. ACCURACY OF THE REGRESSION The determination coefficient below is defined and frequently used to evaluate the accuracy of the regression, which is given by R2
r 2 yy 2
(40)
2 where r is the variance of Y given by
r 2
2 1 N Yi y N i 1
(41)
This is the ratio of the variance of the regression line to the variance of y . If whole data are on the regression line, that is, the regression is perfect,
yy2
2 is equal to r , and
R2 1 . On the other hand, when the deviation of data from the regression line is significant, the determinant ratio is expected to become small. We further modify the variance of y related to the regression value as 1 N 1 N 1 N 1 N
2 yy
y
y
y
Yi Yi y
N
i
i 1 N
i
i 1 N
y
i
i 1 N
e i 1
2 i
e r 2
Yi
2
1 N
2
2
1 N
2
Y i
i 1
Y N
i 1
i
(42)
N
y
2
y
2
1 N
2
1 2 N
N
y i 1
N
e a i 1
i
0
i
Yi Yi y
a1 xi
Kunihiro Suzuki
80
Therefore, the determination coefficient is reduced to
r 2 R 2 e r 2 2
1
e 2 yy 2
(43)
Substituting Eq. (29)into Eq. (43), we obtain
R2 1 xy
yy 2 yy
2
1 2
xy
2
(44)
Consequently, the determination coefficient is the square of the correlation factor, and has a value range between 0 and 1. From Eqs. (43) and (44), we can easily evaluate
e 2 as
e 2 1 xy2 yy 2
(45)
7. REGRESSION FOR SAMPLE We can perform similar analysis for the sample data. We assume n samples. The regression line can be expressed by
Yˆ aˆ0 aˆ1 x
(46)
The form of the regression line is the same as the one for population. However, the coefficients aˆ0 and aˆ1 depend on extracted samples, and are given by
aˆ1
S yy
2
S xx
2
rxy
(47)
Regression Analysis
81
a0 y a1 x y
S yy
2
S xx
2
rxy x
(48)
Finally, we obtain a regression line for sample data as
Yy
S yy
2
S xx
2
rxy x x
(49)
where sample averages and sample variances are given by x
1 xi n
(50)
y
1 yi n
(51)
S xx 2
S yy 2
S xy 2
x
x
i
2
(52)
n
y
i
y
2
(53)
n
x x y i
i
y
n
S xy
(54)
2
rxy
S xx S yy 2
2
(55)
As in the population, the sample correlation factor is expressed by a different way as 1 yi y Yi y rxy n 2 2 S yy SYY
(56)
Kunihiro Suzuki
82
The determinant factor for the sample data is given by R2
Sr
2
S yy
1
2
Se
2
S yy
2
(57)
where Sr 2
y
i
y Yi y n
(58)
and it can be proved as the similar way as population as
S yy Sr Se 2
2
2
(59)
The variances are ones for sample data, and hence we replace them by unbiased ones as n R
*2
1
e n
T
Se
2
2
(60)
S yy
a0 and a1 are used for Se 2 and hence freedom is decreased by two, and the corresponding freedom e is given e n 2
y
(61) 2
is used in S yy , and hence the freedom is decreased by one, and the corresponding freedom T is given T n 1
R*2 is called as a degree of freedom adjusted coefficient of determination.
(62)
Regression Analysis
83
8. RESIDUAL ERROR AND LEVERAGE RATIO We study a residual error and a leverage ratio to evaluate the accuracy of the regression. The residual error is given by
ek yk yˆk
(63)
The normalized one is then given by
ek
zek
se
(64)
2
This follows a standard normal distribution with
N 0,12
. We can set a certain critical
e'
value for k , and evaluate whether it is outlier or not. We can evaluate the error of the regression on the different stand point of view, called
h
it as a leverage ratio, and denote it as kk . This leverage ratio expresses how each data contribute to the regression line, and is evaluated as follows. The predictive value for
k -th sample is given by
Yk aˆ0 aˆ1 xk
y aˆ1 xk x
We substitute the expression of
(65)
ˆ1 given by, a
S xy 2
aˆ1
S xx 2
into Eq. (66), and obtain
(66)
Kunihiro Suzuki
84 S xy 2
Yk y
S xx 2
xk x
1 n xi x yi y 1 n n yi i 1 xk x 2 n i 1 S xx n
1 n yi n i 1
x
i
i 1
x yi
nS xx 2
(67)
xk x
This can be expressed with a general form as Yk hk1 y1 hk 2 y2
hkk yk
hkn yn
(68)
Comparing Eq. (67) with (68), we obtain hkk as
hkk
1 xk x 2 n nS xx
2
(69)
hkk is the variation of the predicted value when yk changed by 1. The predicted value should be determined by the whole data, and the variation should be small to obtain an accurate regression. The leverage ratio increases with increasing the deviation of x from the sample
x , and it is determined only with xi independent on yi . Therefore, the data far from the average of x influences much to the regression. This means that we should average of
not use the data much far from the x The leverage ratio range can be evaluated below. It is obvious that the minimum value is given by
hkk
1 n
(70)
We then consider the maximum value. We set xi x X i . hkk must have the maximum value for h11 of hnn . We assume hnn as the maximum value, which does not eliminate generality. We have the relationship given by
Regression Analysis n
X i 1
i
0
85 (71)
Extracting the term X n form Eq.(71), we obtain n 1
Xn Xi
(72)
i 1
We then have 1 hnn 1
1 xn x 2 n nS xx
2
1 2 n 1 S xx 2 xn x 2 nS xx
1 n 1 S xx 2 X n 2 2 nS xx
n X i2 1 2 i 1 2 n 1 Xn n nS xx n 1 X i2 X n2 1 2 i 1 2 n 1 Xn n nS xx
1 n 1 n 1 2 X n 2 Xi n 2 nS xx n i 1 1 nS xx 2
(73)
2 n 1 n 1 Xi n 1 X i 2 i n1 1 0 n i 1
Therefore, we have hnn 1
(74)
Consequently, we obtain
1 hkk 1 n Sum of the leverage ratio is given by
(75)
Kunihiro Suzuki
86 n
h k 1
n x x 1 k 2 nS xx k 1 n k 1 n
kk
2
S xx 2
1
(76)
S xx 2
2
Therefore, the average value for hkk is denoted as hkk
hkk
2 n
(77)
Therefore, we use a critical value for hkk as
hkk hkk
(78)
2 is frequently used, and evaluate whether it is outlier or not. From the above analysis, we can evaluate the error of the regression by zek or hkk . Merging these parameters, we also evaluate t value of the residual error given by zek
t
(79)
1 hkk
This includes the both criticisms of parameters of
zek
or hkk , and we can set a certain
value for t , and evaluate whether it is outlier or not. When hkk approaches to its maximum value of 1, the denominator approaches to 0, and t becomes large. Therefore, the corresponding data is apt to be regarded as an outlier.
9. PREDICTION OF COEFFICIENTS OF A REGRESSION LINE When we extract samples, the corresponding regression line varies, that is, the factors
aˆ 0
and
aˆ1
vary depending on the extracted samples. We study averages and variances of
aˆ 0
and
aˆ1
in this section.
Regression Analysis
87
We assume that the population set has a regression line given by
yi a0 a1 xi ei
where
a0
and
a1
for i 1,2, , n
(80)
are established values.
0, . xi
ei
is independent on each other and follow a
2
normal distribution with (80) is then only
ei
is a given variable. The probability variable in Eq.
.
We consider the case where we obtain n set of data and
aˆ1
xi , yi . The corresponding aˆ 0
are given by S xy 2
aˆ1
S xx 2
n
x i 1
i
x yi y nS xx 2
n
xi x
i 1
nS xx
yi
n
xi x
a
i 1
2
nS xx 2
0
a1 xi ei
(81)
aˆ0 y aˆ1 x n xi x 1 n y y x i 2 i n i 1 i 1 nS xx n 1 x x i 2 x yi nS xx i 1 n n 1 x x i 2 x a0 a1 xi ei nS xx i 1 n
We notice that only
ei
(82)
is a probability variable, and evaluate the expected value as
Kunihiro Suzuki
88 n
E aˆ1 i 1
xi x 2
nS xx n
a1
n
a0 a1 xi i 1
xi x
i 1
nS xx
xi
n
xi x
xi x
a1 i 1
2
2
nS xx
xi x nS xx 2
n
for i 1
E ei
xi x nS xx
a1
2
x 0
(83)
E aˆ0 E y aˆ1 x y E aˆ1 x y a1 x a0
(84)
n x x V aˆ1 V i 2 yi i 1 nS xx 2
x x i 2 V yi i 1 nS xx n
2
x x i 2 V ei i 1 nS xx n
2
x x 2 i 2 e i 1 nS xx n
e 2 2 nS xx
(85)
n 1 x x V aˆ0 V i 2 i 1 n nS xx n 1 x x i 2 nS xx i 1 n
x yi 2
x V yi 2
n 1 x x i 2 x V ei nS xx i 1 n 2 n x x 2 2 x x 1 2 2 i2 2 x i x 2 e 2 n S xx i 1 n n 2 S xx 2 1 x 2 2 e n nS xx
(86)
Regression Analysis n n x x 1 x x Cov aˆ1 , aˆ0 Cov i 2 yi , i 2 nS xx i 1 n i 1 nS xx n x x 1 x x E i 2 i 2 x yi2 i 1 nS xx n nS xx
89
x yi
xi x 1 xi x x E yi2 2 2 nS xx i 1 nS xx n n x x 1 x x i 2 i 2 x E ei2 nS xx i 1 nS xx n n x x 1 x x 2 i 2 i 2 x e nS xx i 1 nS xx n n x x x x 2 i 2 i 2 x e nS xx i 1 nS xx n
x 2
nS xx
e 2
(87)
aˆ
We assume that 1 and normalized variables given by
z
aˆi ai
aˆ 0
follow normal distributions, and hence assume the
for i 0,1
V ai
(88)
This follows a standard normal distribution. Therefore, the predictive range is given by
aˆ1 z p
e 2 nS xx 2
a1 aˆ1 z p
1 x2 aˆ0 z p n nS 2 xx
e 2 nS xx 2
2 e a0 aˆ0 z p
(89) 1 x2 2 n nS xx
2
2 e
(90)
s 2
In the practical cases, we do not know e , and hence use e instead. Then, the normalized variables are assumed to follow a t distribution function with a freedom of
n 2 . Therefore, the predictive range is given by
Kunihiro Suzuki
90 aˆ1 t p n 2
se
2
nS xx 2
1 x2 aˆ0 t p n nS 2 xx
a1 aˆ1 t p n 2
2 se a0 aˆ0 t p
se
2
nS xx 2
1 x2 2 n nS xx
(91) 2 se
(92)
10. ESTIMATION OF ERROR RANGE OF REGRESSION LINE We can estimate the error range of value of a regression line. The expected regression value for value of
x0
is denoted as Y0 .
We assume that we obtain n set of data given by
xi , yi . The corresponding regression value is
Yˆ0 aˆ0 aˆ1 x0
(93)
The expected value is given by E Yˆ0 E aˆ0 E aˆ1 x0 a0 a1 x0 Y0
The variance of
(94)
Yˆ0 is given by
V Yˆ0 V aˆ0 aˆ1 x0
V aˆ0 2Cov aˆ0 , aˆ1 x0 V aˆ1 x02 1 x2 x 2 2 2x x 2 2 2 e 0 2 e 0 2 e n nS nS xx nS xx xx 2 1 x x 2 0 2 e nS xx n
We assume that it follows a normal distribution and hence the predicted value of is given by
(95)
Y0
Regression Analysis 1 x0 x 2 2 ˆ e Yo Yˆ0 z p Y0 z p 2 nS xx n
If we use
se 2
instead of
91
1 x0 x 2 2 e 2 nS xx n
(96)
e 2 for the variance, we need to use a t distribution with a
freedom of n 2 , and the predictive range is given by 1 x x 2 2 Yˆ0 t p n 2 0 2 se Y0 Yˆ0 t p n 2 nS xx n
1 x0 x 2 2 se 2 nS xx n
(97)
When an explanatory variable x changes from x0 to x1 , the corresponding variation ˆ ˆ of objective variable Yˆ from Y0 to Y1 . The variation of the objective variable is then given by
Yˆ Yˆ1 Yˆ0
(98)
Considering the error range, the total variation
D is given by
1 x x 2 2 D Yˆ1 t p n 2 1 2 se Yˆ0 t p n 2 nS xx n ˆ ˆ Y1 Y0 t p n 2 Yˆ t p n 2
1 x0 x 2 2 se 2 nS xx n
1 x0 x 2 2 1 x1 x 2 2 se se 2 2 nS xx nS xx n n
1 x0 x 2 2 1 x1 x 2 2 se s e 2 2 nS xx nS xx n n
(99)
If this is the positive, we can judge that the objective variable certainly changes. Therefore, we can evaluate this as below. Yˆ t p n 2
1 x0 x 2 2 1 x1 x 2 2 se se 2 2 nS xx nS xx n n
(100)
Kunihiro Suzuki
92
Next, we consider that the predictive value of
y
instead of the regression value of
Y for a given x of x0 . We predict yi ao a1 xi ei instead of yi ao a1 xi . Therefore, the variance is given by
1 x0 x 2 1 2 2 nS xx n
The predictive range for
(101)
y0
is given by
1 x0 x 2 2 ˆ e yo Yˆ0 z p Y0 z p 1 2 n nS xx
se 2
If we use a freedom of
instead of
e 2
1 x0 x 2 2 1 e 2 nS xx n
(102)
for the variance, we need to use a t distribution with
n 2 , and the predictive range is given by
1 x0 x 2 2 ˆ se yo Yˆ0 t p n 2 Y0 t p n 2 1 2 nS xx n
1 x0 x 2 2 1 se 2 nS xx n
(103) When we obtain the data set x, y , we can evaluate corresponding expected objective value as well as its error range for the x . We can compare these values with given y , and can judge whether it is reasonable or not. Let us consider an example. Table 2 shows scores of mathematics and physics of 40 members, where the score of mathematics is denoted as x and that of physics as y . The corresponding average and sample variances are given by
x 52.38 y 72.78
(104)
(105)
Regression Analysis
93
S xx 213.38
(106)
S yy 176.52
(107)
S xy 170.26
(108)
2
2
2
The sample correlation factor is given by
rxy 0.88
(109)
The center value for the correlation factor is given by
cent 0.87
which is a little smaller than
(110)
rxy
. The correlation factor range is predicted as
0.78 0.93
(111)
The regression line is given by
Yˆ 30.98 0.80x
(112)
The determinant factor and one with considering the freedom are given by R 2 0.77
(113)
R *2 0.76
(114)
The predicted regression and objective values are shown in Figure 3. The error range increases with the distance from the average location to the x increases. The original data are almost within the range of predicted objected values.
Kunihiro Suzuki
94
Table 2. Score of mathematics and physics of 40 members ID Mathematics 1 34 2 56 3 64 4 34 5 62 6 52 7 79 8 50 9 46 10 49 11 42 12 60 13 39 14 45 15 40 16 42 17 65 18 41 19 50 20 54
Physics 57 78 76 64 78 82 91 67 64 58 62 72 63 65 68 73 82 67 66 80
ID Mathematics 21 49 22 35 23 59 24 27 25 77 26 24 27 58 28 47 29 56 30 75 31 68 32 70 33 51 34 69 35 30 36 38 37 85 38 65 39 46 40 62
Physics 68 49 87 56 96 49 76 63 78 96 100 76 72 98 57 68 95 74 56 84
150 Data Center value Regression y value
y
100
50
0
0
20
40
x
60
80
100
Figure 3. Predicted regression and objective values. The original data is also shown.
Regression Analysis
95
11. LOGARITHMIC REGRESSION We assume the linear regression up to here. However, there are various cases where an appropriate fitted function is not a linear line. When the target function is an exponential one, given by
y beax
(115)
Taking logarithms for both sides of Eq. (115), we obtain ln y ax ln b
(116)
Converting variables as
u x v ln y
(117)
We can apply the linear regression. When
y
is proportional to power law of x , that is, it is given by
y bx a
(118)
Taking logarithms for both sides of Eq.(118), we obtain ln y a ln x ln b
(119)
Converting variables as
u ln x v ln y We can apply the linear regression.
(120)
Kunihiro Suzuki
96
12. POWER LAW REGRESSION We assume the data are expressed with a power law equation given by
y a0 a1 x a2 x 2
am x m
(121)
The error is given by 1 2 e1 e22 eN 2 N 1 N yi a0 a1 xi a2 xi 2 N i 1
e 2
am xi
m
(122) 2
We then evaluate
e 0, e 0, e 0, , e 0 a0 a1 a2 am 2
2
2
2
(123)
This gives us m 1 equations, and we can determine a0 , a1 , a2 , , am as follows.
e 2 N yi a0 a1 xi a2 xi 2 a0 N i 1 2
am xi m 0
(124)
This gives
a0 y a1 x a1 x 2
am x k
(125)
We also have e 2 N yi a0 a1 xi a2 xi 2 ak N i 1 2
am xi m xik
2 N yi y a1 xi x a2 xi 2 x 2 N i 1 2 N yi y a1 xi x a2 xi 2 x 2 N i 1 0
am xi m x m xik
am xi m x m xik x k
(126)
Regression Analysis
97
We then obtain K x11 K x 21 K xm1
K x1m a1 K y1 K x 22 a2 K y 2 K xmm am K ym
K x12 K x 22 K xm 2
(127)
where
x x
(128)
(129)
K xkl
1 N k xi x k N i 1
K yk
1 N yi y xik xk N i 1
l i
l
13. REGRESSION FOR AN ARBITRARILY FUNCTION We further treat an arbitrarily function with m parameters denoted as
f a1 , a2 ,
, am , x
e 2 is expressed by
The error
e 2
.
1 N yi f a1 , a2 , N i 1
, am , xi
2
(130)
Partial differentiating this with respect to a1 , we obtain e 1 a1 N 2
N
2 y i 1
i
f a1 , a2 ,
, am , xi
f a1 , a2 ,
, am , xi
a1
0
(131)
Arranging this equation, we obtain 1 N
N
y i 1
i
f a1 , a2 ,
, am , xi
f a1 , a2 ,
We set the left side of Eq. (132) as
, am , xi
a1 g1 a1 , a2 ,
0
, am
, that is,
(132)
Kunihiro Suzuki
98 g1 a1 , a2 ,
, am
1 N
N
y
i
i 1
f a1 , a2 ,
, am , xi
f a1 , a2 ,
, am , xi
a1
(133)
Performing similar analysis for the other parameters, we obtain
g1 a1 , a2 , , am 0 g 2 a1 , a2 , , am 0 g a , a , , a 0 m m 1 2
(134)
where g k a1 , a2 ,
, am
1 N
N
y
i
i 1
f a1 , a2 ,
, am , xi
f a1 , a2 ,
, am , xi
ak
(135)
The parameters a1 , a2 , , am can be solved numerically as follows. a ,a , We set initial value set as 0
0
1
2
, am
0
.
a , a , 1
The deviation from the updating value
a1 , a2 ,
1
, am
1
1
2
is given by
, am a1 a1 , a2 a2 , , am am 1
0
1
0
1
0
(136)
Let us consider the equation of gk a1 , a2 , , am 0 , which we want to obtain the corresponding set of a1 , a2 , , am . However, we do not know the set. We assume that the
a , a , 1
tentative set
1
1
2
0 g k a1 , a2 ,
1
is close to the one, then
, am
g k a1 , a2 , 0
, am
0
, am
0
ga
k
1
where
a1
g k a2 a2
g k am am
(137)
Regression Analysis g k a1 , a2 ,
99
, am
al
1 N
f a1 , a2 , , am , xi f a1 , a2 , , am , xi yi f a1 , a2 , al ak i 1 N
, am , xi
2 f a1 , a2 , al ak
, am , xi
(138) Therefore, we obtain gk g a1 k a2 a1 a2
gk 0 0 0 am gk a1 , a2 , , am am
(139)
We then obtain g1 g1 g1 0 0 0 a a1 a a2 a am g1 a1 , a2 , , am 2 m 1 g 2 g 2 g 2 0 0 0 a1 a2 am g 2 a1 , a2 , , am a a a 1 2 m g m g m g m 0 0 0 a a1 a a2 a am g m a1 , a2 , , am 2 m 1
(140)
We then obtain the below matrix. g1 a 1 g 2 a1 g m a 1
g1 a2 g 2 a2 g m a2
g1 g a 0 , a2 0 , , am 0 am a1 1 1 g 2 0 0 0 g a , a2 , , am a am 2 2 1 am g m a1 0 , a2 0 , , am 0 g m am
(141)
We then have g1 a 1 a1 g 2 a2 a 1 am g m a 1
g1 a2 g 2 a2 g m a2
1
g1 am g1 a1 0 , a2 0 , , am 0 g 2 0 0 0 g a , a2 , , am am 2 1 0 0 0 g m g m a1 , a2 , , am am
(142)
Kunihiro Suzuki
100
The updating value is then given by a11 a1 0 a1 1 0 a2 a2 a2 1 0 am am am
(143)
If the following hold abs a1 c abs a2 c abs a m c
(144)
the updating value is the ones we need. If they do not hold, we set a1 0 a11 0 1 a2 a2 0 1 am am
(145)
We perform this cycle while Eq. (144) becomes valid.
14. LOGISTIC REGRESSION We apply the above procedure to a logistic curve, which is frequently used. The linear regression line is expressed as y ax b
(146)
The objective value y increases monotonically with increasing an explanatory value of x . We want to consider a two levels categorical objective variable such as yes or no, good or bad, or so on. Then we regard y as the probability for the categorical data and modify the left hand of Eq. (146), and express it as y ln ax b 1 y
(147)
Regression Analysis
101
When we consider the case of yes or no, y corresponds to the probability of yes, and 1-y corresponds to the one of no. Eq. (147) can be solved with respect to y as y
1 1 exp ax b
(148)
1 1 k exp ax
where k eb
(149)
Y has a value range between 0 and 1, and saturates when it approaches to 1. We want to extend this form to have any positive value, and add one more parameter as f a1 , a2 , a3 ; x
a1 1 a2 exp a3 x
(150)
This is the logistic curve. In many cases, the dependence of the objective variable on explanatory variable is not simple, that is, even when the dependence is clear, it saturate for certain large explanatory variable values. A logistic curve is frequently used to express the saturation of phenomenon as mentioned above. We can easily extend this to include various explanatory variables given by
f a1 , a2 , a3i ; xi
a1 1 a2 exp a31 x1 a32 x2
a3m xm
(151)
In this analysis, we focus on one variable, and use Eq. (150). The error
e 2
e 2 for the logistic curve is expressed by
2 1 N yi f a1 , a2 , a3 ; x N i 1
Partial differentiating this with respect to a , we obtain
(152)
Kunihiro Suzuki
102
2 f a1 , a2 , a3 ; xi e 1 N 2 yi f a1 , a2 , a3 ; xi a1 N i 1 a1
1 N 1 2 yi f a1 , a2 , a3 ; xi 0 N i 1 1 a2 exp a3 xi
(153)
Arranging this equation, we obtain
1 N 1 yi f a1 , a2 , a3 ; xi 0 N i 1 1 a2 exp a3 xi We set the left side of Eq. (132) as ga1 a1 , a2 , a3
g a1 a1 , a2 , a3
(154)
, that is,
a1 1 N 1 0 yi N i 1 1 a2 exp a3 xi 1 a2 exp a3 xi
(155)
Performing a similar analysis for the other parameters, we obtain a1 exp a3 xi a1 1 N yi N i 1 1 a2 exp a3 xi 1 a2 exp a3 xi 2
g a2 a1 , a2 , a3
g a3 a1 , a2 , a3
a1a2 xi exp a3 xi a1 1 N yi N i 1 1 a2 exp a3 xi 1 a2 exp a3 xi 2
(156)
(157)
We then have g a1 a1 , a2 , a3 a1 a1 N 1 N 1 1 yi 2 a1 N i 1 1 a2 exp a3 xi N i 1 1 a2 exp a3 xi N 1 1 N i 1 1 a2 exp a3 xi 2
(158)
Regression Analysis
103
g a1 a1 , a2 , a3 a2 a1 N 1 N 1 1 y i 2 a2 N i 1 1 a2 exp a3 xi N i 1 1 a2 exp a3 xi exp a3 xi a1 N 2 1 a2 exp a3 xi exp a3 xi 1 N yi 4 N i 1 1 a2 exp a3 xi 2 N i 1 1 a exp a x 2 3 i 2a N exp a3 xi exp a3 xi 1 N 1 yi N i 1 1 a2 exp a3 xi 2 N i 1 1 a2 exp a3 xi 3
(159)
g a1 a1 , a2 , a3 a3 a1 N 1 N 1 1 y i a3 N i 1 1 a2 exp a3 xi N i 1 1 a2 exp a3 xi 2 a2 xi exp a3 xi a1 N 2a2 xi 1 a2 exp a3 xi exp a3 xi 1 N yi 4 N i 1 1 a2 exp a3 xi 2 N i 1 1 a exp a x 3 i 2 xi exp a3 xi 2a1a2 N xi exp a3 xi a N 2 yi N i 1 1 a2 exp a3 xi 2 N i 1 1 a2 exp a3 xi 3
(160)
g a2 a1 , a2 , a3 a1 a1 exp a3 xi 1 N a12 exp a3 xi 1 N y i a1 N i 1 1 a2 exp a3 xi 2 N i 1 1 a2 exp a3 xi 3 1 N 2a exp a x exp a3 xi 1 N 1 3 i yi 2 N i 1 1 a2 exp a3 xi N i 1 1 a2 exp a3 xi 3
2a N exp a3 xi exp a3 xi 1 N 1 yi N i 1 1 a2 exp a3 xi 2 N i 1 1 a2 exp a3 xi 3
(161)
Kunihiro Suzuki
104 g a2 a1 , a2 , a3 a2
N a1 exp a3 xi 1 N a12 exp a3 xi 1 yi 2 3 N i 1 1 a2 exp a3 xi N i 1 1 a2 exp a3 xi 1 N 2a1 exp 2a3 xi 1 a2 exp a3 xi yi 4 N i 1 1 a2 exp a3 xi 2 2 1 N 3a1 exp 2a3 xi 1 a2 exp a3 xi 6 N i 1 1 a2 exp a3 xi 3a 2 N exp 2a3 xi exp 2a3 xi 2a N 1 1 yi 3 4 N i 1 1 a2 exp a3 xi N i 1 1 a2 exp a3 xi
a2
(162)
g a2 a1 , a2 , a3 a3 a1 exp a3 xi 1 N a12 exp a3 xi 1 N yi a3 N i 1 1 a2 exp a3 xi 2 N i 1 1 a2 exp a3 xi 3 2 N a x exp a x 1 a exp a x 2 a a x exp 2 a x 3 i 2 3 i 3 i 1 a2 exp a3 xi 1 1 i 1 2 i yi 4 N i 1 1 a2 exp a3 xi 3 2 2 2 1 N a1 xi exp a3 xi 1 a2 exp a3 xi 3a1 a2 xi exp 2a3 xi 1 a2 exp a3 xi 6 N i 1 1 a2 exp a3 xi
(163)
xi exp a3 xi 2a1a2 N xi exp 2a3 xi a1 N yi yi 2 N i 1 1 a2 exp a3 xi N i 1 1 a2 exp a3 xi 3 2 N 2 N xi exp a3 xi xi exp 2a3 xi a 3a1 a2 1 N i 1 1 a2 exp a3 xi 3 N i 1 1 a2 exp a3 xi 4
g a3 a1 , a2 , a3 a1 a1a2 xi exp a3 xi 1 N a12 a2 xi exp a3 xi 1 N yi a1 N i 1 1 a2 exp a3 xi 2 N i 1 1 a2 exp a3 xi 3 a2 xi exp a3 xi 1 N 2a1a2 xi exp a3 xi 1 N yi N i 1 1 a2 exp a3 xi 2 N i 1 1 a2 exp a3 xi 3
a2 N
N
y i 1
i
2a a 1 2 2 N 1 a2 exp a3 xi xi exp a3 xi
3 i 1 1 a exp a x 2 3 i N
xi exp a3 xi
(164)
Regression Analysis
105
g a3 a1 , a2 , a3 a2 a1a2 xi exp a3 xi 1 N a12 a2 xi exp a3 xi 1 N yi a2 N i 1 1 a2 exp a3 xi 2 N i 1 1 a2 exp a3 xi 3 2 1 N a1 xi exp a3 xi 1 a2 exp a3 xi 2a1a2 xi exp 2a3 xi 1 a2 exp a3 xi yi 4 N i 1 1 a2 exp a3 xi
(165)
3 2 2 2 1 N a1 xi exp a3 xi 1 a2 exp a3 xi 3a1 a2 xi exp 2a3 xi 1 a2 exp a3 xi 6 N i 1 1 a2 exp a3 xi N xi exp a3 xi xi exp 2a3 xi a N 2a1a2 yi 1 yi N i 1 1 a2 exp a3 xi 2 N i 1 1 a2 exp a3 xi 3 2 N 2 N x exp a x x exp 2 a x a i 3 i i 3 i 3a1 a2 1 N i 1 1 a2 exp a3 xi 3 N i 1 1 a2 exp a3 xi 4
g a3 a1 , a2 , a3 a3 a1a2 xi exp a3 xi 1 N a12 a2 xi exp a3 xi 1 N yi a3 N i 1 1 a2 exp a3 xi 2 N i 1 1 a2 exp a3 xi 3 2 2 2 2 1 N a1a2 xi exp a3 xi 1 a2 exp a3 xi 2a1a2 xi xp 2a3 xi 1 a2 exp a3 xi yi 4 N i 1 1 a2 exp a3 xi 3 2 N a 2 a x 2 exp a x 1 a exp a x 3a 2 a 2 x 2 exp 2a x 1 a exp a x 1 1 2 i 3 i 2 3 i 1 2 i 3 i 2 3 i 6 N i 1 1 a exp a x 2 3 i 2 2 2 N N x exp a x x xp 2 a x 3 i 2a1a2 3 i aa i 1 2 yi yi i N i 1 1 a2 exp a3 xi 2 N i 1 1 a2 exp a3 xi 3
a12 a2 N
3a 2 a 2 1 2 N i 1 1 a exp a x 2 3 i N
xi2 exp a3 xi
3
(166)
i 1 1 a exp a x 2 3 i N
xi2 exp 2a3 xi
4
The incremental variation is given by g1 a1 a1 g 2 a2 a a 1 3 g 3 a1
g1 a2 g 2 a2 g3 a2
1
g1 a3 g1 a1 0 , a2 0 , a3 0 g 2 0 0 0 g 2 a1 , a2 , a3 a3 0 0 0 g3 g3 a1 , a2 , a3 a3
The updating value is then given by
(167)
Kunihiro Suzuki
106 a11 a1 0 a1 1 0 a2 a2 a2 1 0 am am am
(168)
If the following hold abs a1 c abs a2 c abs a m c
(169)
the updating value is the ones we need. If they do not hold, we set a1 0 a11 0 1 a2 a2 0 1 am am
(170)
We perform this cycle while Eq. (144) becomes valid. We can set initial conditions roughly as follows. We can see the saturated value inspecting the data and set it as a10 . Selecting two points of x1 , y1 , and x2 , y2 , we obtain y1
a10 1 a20 exp a30 x1
(171)
y2
a10 1 a20 exp a30 x2
(172)
We then modify them as follows. a10 1 a20 exp a30 x1 y1
(173)
Regression Analysis a10 1 a20 exp a30 x2 y2
107 (174)
Therefore, we obtain
a ln 10 1 ln a20 a30 x1 y1
(175)
a ln 10 1 ln a20 a30 x2 y2
(176)
We then obtain a10 1 y1 1 a30 ln x2 x1 a10 1 y2 a ln a20 ln 10 1 a30 x1 y 1 a a x2 x1 ln 10 1 ln 10 1 x2 x1 y1 x2 x1 y2
(177)
(178)
x2
a10 x2 x1 1 y ln 1 x1
a10 x2 x1 1 y2
Therefore, we obtain x2
a10 x2 x1 1 y a20 1 x1 a10 x2 x1 1 y2
(179)
Kunihiro Suzuki
108
Let us consider the data given in Table 3. The value of y saturate with increasing x as shown in Figure 4. Table 3. Data for logistic curve
x
y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3 5 9 14 23 35 55 65 80 90 91 94 97 99 100
100 80 ypredict
y
60 40
a1 = 99.4 a2 = 76.2 a3 = 0.63
20 0 0
5
10
x Figure 4. Logistic cured with regression one.
15
Regression Analysis
109
We set a10 as 100, and use the data set of 5, 23 and 10,90 . We then have a10 1 y 1 a30 ln 1 x2 x1 a10 1 y2 100 1 1 23 ln 10 5 100 1 90 0.68
(180)
x2
a10 x2 x1 1 y a20 1 x1 a10 x2 x1 1 y2 10
100 10 5 1 23 5
(181)
100 10 5 1 90 100.9
Using the initial conditions above, we perform a regression procedure shown in this section, and obtain the regression curve with factors of
a1 99.4 a2 76.2 a 0.63 3
(182)
The data and a regression cure agree well.
15. MATRIX EXPRESSION FOR REGRESSION The regression can be expressed with a matrix form. This gives no new information. However, the expression is simple and one can use the method alternatively. The objective and an explanatory variables are related as follows.
Kunihiro Suzuki
110
yi a0 a1 xi x ei
(183)
where we have n samples. Therefore, all data are related as y1 a0 a1 x1 x e1 y2 a0 a1 x2 x e2 y a a x x e 0 1 n 1 n
(184)
Therefore, the regression analysis is to obtain the factors a0 , a1 . We introduce matrixes and vectors as y1 y Y 2 yn
(185)
1 x1 x 1 x2 x X 1 xn x
(186)
a β 0 a1
(187)
e1 e e 2 en
(188)
Using the matrixes and vectors, the corresponding equation is expressed by
Y Xβ e The total sum of the square of error Q is then expressed by
(189)
Regression Analysis
111
Q eT e Y Xβ Y Xβ T
Y T βT X T Y Xβ
(190)
Y Y Y Xβ β X Y β X Xβ T
T
T
T
T
T
Y T Y 2βT X T Y βT X T Xβ
We obtain the optimum value by minimizing Q . This can be done by vector differentiating (see Appendix 1-10 of volume 3) with respect to β it given by Q T T 2 βT X T Y β X Xβ β β β 2 X T Y 2 X T Xβ 0
(191)
Therefore, we obtain β X T X X TY 1
(192)
SUMMARY To summarize the results in this chapter— The regression line is given by y y
yy 2
x x xy 2 xx
If we want to force the regression to start from 0, a0 , the regression line is given by
y
xy xx 2 yy 2 x y a0 x
2 xx
x2
x a0
The accuracy of regression is evaluated by
Kunihiro Suzuki
112 R2
r 2 xy2 yy 2
2 where r is the variance of Y given by
r 2
2 1 N Yi y N i 1
The error range for each factors are given by
aˆ1 t p n 2
se
2
nS xx 2
a1 aˆ1 t p n 2
1 x2 2 aˆ0 t p 2 se a0 aˆ0 t p n nS xx
se
2
nS xx 2
1 x 2 2 s 2 e n nS xx
The predictive value for the regression Y0 is given by 1 x0 x 2 2 ˆ se Yo Yˆ0 t p n 2 Y0 t p n 2 2 nS xx n
1 x0 x 2 2 se 2 nS xx n
The predictive y value y0 is given by 1 x0 x 2 2 ˆ se yo Yˆ0 t p n 2 Y0 t p n 2 1 2 nS xx n
where
Yˆ0 aˆ0 aˆ1 x0 When the target function is an exponential one, given by
y beax
1 x0 x 2 2 1 se 2 nS xx n
Regression Analysis
113
We converting variables as u x v ln y
We can apply the linear regression to the variables. When y is proportional to power law of x , that is, it is given by
y bx a We converting variables as u ln x v ln y
We can apply the linear regression to the variables. We assume the data are expressed with a power law equation given by
y a0 a1 x a2 x 2
am x m
The factors are given by a1 K x11 a2 K x 21 am K xm1
K x1m K x 22 K xmm
K x12 K x 22 K xm 2
1
K y1 K y2 K ym
where
x x
K xkl
1 N k xi xk N i 1
K yk
1 N yi y xik xk N i 1
l i
l
We assume an arbitrarily function having many factors is given by f a1 , a2 , , am , x We define the following function
Kunihiro Suzuki
114 g1 a1 , a2 ,
, am
g1 a 1 a1 g 2 a2 a 1 am g m a 1
g1 a2 g 2 a2 g m a2
1 N
N
y i 1
i
f a1 , a2 ,
, am , xi
a11 a1 0 a1 1 0 a2 a2 a2 1 0 am am am
We repeat this process until the followings hold. abs a1 c abs a2 c abs a m c
We apply this procedure to the logistic curve. The residual error is evaluated by zek 1 hkk
where hkk is the leverage ratio given by 1 x x hkk k 2 n nS xx
2
a1
1
g1 am g1 a1 0 , a2 0 , , am 0 g 2 0 0 0 g a , a2 , , am am 2 1 0 0 0 g m g m a1 , a2 , , am am
The updating value is then given by
t
f a1 , a2 ,
, am , xi
Regression Analysis The regression can be expressed using a matrix form as β X T X X TY 1
where yi a0 a1 xi x ei
The matrixes and vectors are defined as y1 y Y 2 yn 1 x1 x 1 x2 x X 1 xn x a β 0 a1
e1 e e 2 en
115
Chapter 4
MULTIPLE REGRESSION ANALYSIS ABSTRACT The objective variable is influenced by many kinds of variables in general. We therefore extend the analysis of a regression with one explanatory variable to multiple ones. We predict a theoretical prediction for given multi parameters with their error range. We also discuss the reality of the correlation relationship where the correlation factor is influenced by the other variables. The pure correlation without the influenced factor is also discussed. We also evaluate the effectiveness of the explanatory variables.
Keywords: regression, multiple regression, objective variable, explanatory variable, multicollinearity, partial correlation factor
1. INTRODUCTION We treated one explanatory variable in the previous chapter, and studied the relationship between an explanatory variable and one objective variable. However, the objective variable is influenced by various parameters. For example, the price of a house is influenced by location, house age, site area, etc. Therefore, we need to treat multi explanatory variables to predict the objective variable robustly. We first treat two explanatory variables, and then treat ones of more than two. We treat a sample regression in this chapter. The regression for population can be easily derived by changing sample averages and sample variances to the population’s ones as shown in the previous chapter. We perform matrix operation in this chapter. The basics of matrix operation are described in Chapter 15 of volume 3.
Kunihiro Suzuki
118
2. REGRESSION WITH TWO EXPLANATORY VARIABLES We treat one objective variable y , and two explanatory variables x1 and x2 in this section for simplicity. We derive concrete forms for the two variables and extend it to the one for multiple ones later.
2.1. Regression Line The objective variable yi and two explanatory variables xi1 and xi 2 in population can be related to yi a0 a1 xi1 a2 xi 2 ei
(1)
where a0 , a1 , and a2 are constant factors. We pick up n set data from the population, and want to predict the characteristics of the population. We can decide the sample factors a0 , a1 , and a2 from the sample data, which is not the same as the ones for population, and hence we denote them as aˆ0 , aˆ1 , and aˆ2 . The sample data can be expressed by y1 aˆ0 aˆ1 x11 aˆ2 x12 e1 y2 aˆ0 aˆ1 x21 aˆ2 x22 e2 yn aˆ0 aˆ1 xn1 aˆ2 xn 2 en
(2)
ei is called as residual error. A sample regression value Yi is expressed by Yi aˆ0 aˆ1 xi1 aˆ2 xi 2
(3)
ei can then be expressed by ei yi Yi yi aˆ0 aˆ1 xi1 aˆ2 xi 2
(4) 2
S The variance of ei is denoted as e , and is given by
Multiple Regression Analysis Se 2
119
2 1 n yi aˆ0 aˆ1 xi1 aˆ2 xi 2 n i 1
(5)
2 We decide aˆ0 , aˆ1 , and aˆ2 so that Se has the minimum value. 2 Partial differentiating Se with respect to aˆ0 and set 0 for them, we obtain
Se 1 n 2 yi aˆ0 aˆ1 xi1 aˆ2 xi 2 0 aˆ0 n i 1 2
(6)
We then obtain 1 n 1 n 1 n 1 n ˆ ˆ ˆ ˆ ˆ ˆ y a a x a x y a a x a xi 2 i 0 1 i1 2 i 2 n i1 2 n i 0 1 n i 1 n i 1 i 1 i 1 y aˆ0 aˆ1 x1 aˆ2 x2 0
(7)
Se with respect to aˆ1 and aˆ2 , and set 0 for them, we obtain 2
Partial differentiating
Se 1 n 2 xi1 yi aˆ0 aˆ1 xi1 aˆ2 xi 2 0 aˆ1 n i 1
(8)
Se 1 n 2 xi 2 yi aˆ0 aˆ1 xi1 aˆ2 xi 2 0 aˆ2 n i 1
(9)
2
2
Substituting Eq. (7) into Eqs. (8) and (9), we obtain 1 n xi1 yi y aˆ1 xi1 x1 aˆ2 xi 2 x2 0 n i 1
(10)
1 n xi 2 yi y aˆ1 xi1 x1 aˆ2 xi 2 x2 0 n i 1
(11)
Note that the followings are valid.
Kunihiro Suzuki
120
1 n x1 yi y aˆ1 xi1 x1 aˆ2 xi 2 x2 0 n i 1
(12)
1 n x2 yi y aˆ1 xi1 x1 aˆ2 xi 2 x2 0 n i 1
(13)
Therefore, we obtain 1 n xi1 x1 yi y aˆ1 xi1 x1 aˆ2 xi 2 x2 S1y2 aˆ1S11 2 aˆ2 S12 2 n i 1 0
(14)
1 n xi 2 x2 yi y aˆ1 xi1 x1 aˆ2 xi 2 x2 S2 2y aˆ1S21 2 aˆ2 S22 2 n i 1 0
(15)
where S11
1 n 2 xi1 x1 n i 1
(16)
S22
1 n 2 xi 2 x2 n i 1
(17)
2
2
S12 S21 2
2
1 n xi1 x1 xi 2 x2 n i 1
(18)
S yy
1 n 2 yi y n i 1
(19)
S1y
1 n xi1 x1 yi y n i 1
(20)
S2 y
1 n xi 2 x2 yi y n i 1
(21)
2
2
2
Multiple Regression Analysis
121
Eqs. (14) and (15) are expressed with a matrix form as 2 S12 aˆ1 S1 y 2 2 S22 aˆ2 S2 y
S11 2 2 S 21
2
(22)
Therefore, we can obtain aˆ1 and aˆ2 as aˆ1 S11 ˆ 2 a2 S12 2
S 11 2 21 2 S
1
S1y2 2 S 2y 2 12 2 S S1 y 2 22 2 S S2 y
S12 2 S22 2
(23)
where S11 2 2 22 2 S S12
S 11 2 21 2 S
S
12 2
S12 2 S22 2
1
(24)
Once we obtain the coefficients aˆ1 and aˆ2 from Eq. (23), we can obtain aˆ0 from Eq. (7) as aˆ0 y aˆ1 x1 aˆ2 x2
(25)
2.2. Accuracy of the Regression The accuracy of the regression can be evaluated similarly to the case of one explanatory variable by a determinant factor given by R2
where
Sr
2
S yy 2
r 2 is the variance of
(26)
Y
given by
Kunihiro Suzuki
122 Sr 2
2 1 n Yi y n i 1
(27)
This is the ratio of the variance of the regression line to the variance of
y
. If whole
2
data are on the regression line, that is, the regression is perfect,
S yy
is equal to
2 Sr ,
and
R 1. On the other hand, when the deviation of data from the regression line is significant, the determinant ratio is expected to become small. We further modify the variance of y can also be modified in the previous chapter as 2
1 n 2 yi y n i 1 1 n 2 yi Yi Yi y n i 1 1 n 1 n 1 n 2 2 yi Yi Yi y 2 yi Yi Yi y n i 1 n i 1 n i 1 n n n 1 1 1 2 ei2 Yi y 2 ei a0 a1 xi1 a2 xi 2 n i 1 n i 1 n i 1
S yy 2
Se Sr 2
(28)
2
where we utilize Cov e, x1 Cov e, x2 0
(29)
Therefore, the determination coefficient is reduced to
R2
Sr
2
Se Sr 2
1
Se
2
2
S yy
2
(30)
Consequently, the determination coefficient is the square of the correlation factor, and has a value range between 0 and 1. The multiple correlation factor rmult is defined as the correlation between yi and Yi , and is given by
Multiple Regression Analysis
y n
rmult
i
i 1
y n
i
i 1
y Yi y
y
Y y n
2
123
2
i
i 1
(31)
We can modify the denominator of (31) as Se 2
1 n 2 yi Yi n i 1
2 1 n yi y Yi y n i 1 1 n 1 n 1 n 2 2 yi y Yi y 2 yi y Yi y n i 1 n i 1 n i 1 n 1 2 2 S yy S R 2 yi y Yi y n i 1
(32)
The cross term in Eq. (32) is performed as
1 n 1 yi y Yi y S yy 2 S R 2 Se 2 n i 1 2 1 2 2 2 2 S yy S R S yy S R 2
S R 2
(33)
Then, the multiple correlation factor is given by S R 2
rmult
S yy S R 2
2
S R 2
S yy 2
(34)
2 The square of rmult is the determinant of the multi regression, and is denoted as R and is given by
SR
R
S yy 2
S yy Se 2
2
2
S yy 2
2
1
Se
2
S yy 2
(35)
Kunihiro Suzuki
124
Substituting Eq. (34) into Eq. (35), we can express the determinant factor with a multiple correlation factor as 2 R2 rmult
(36)
The variances used in Eq. (35) are related to the samples, and hence the determinant with unbiased variances are also defined as n R
*2
1
e n
T
Se
2
(37)
2
S yy
2 Three variables a0 , a1 , and a2 are used in Se , and hence the freedom of it e is given
by e n 3
(38)
2 y is used in S yy , and hence the freedom of it T is given by
T n 1
(39)
R*2 is called as a degree of freedom adjusted coefficient of determination for a multiple regression. 2
R*2 can also be related to R as R*2 1
T 1 R2 e
(40)
2.3. Residual Error and Leverage Ratio We evaluate the accuracy of the regression. The residual error can be evaluated as ek yk Yk
(41)
Multiple Regression Analysis
125
This is then normalized by the biased variance and is given by
zek
ek se
(42)
2
se is given by 2
where
se 2
n
e
Se
2
(43)
This follows a standard normal distribution with Further, it is modified using leverage ratio
tk
zek
N 0,12
.
hkk as (44)
1 hkk
This can be used for judging the outlier or not. We should consider the leverage ratio for a multiple regression, which is shown below. The regression data can be expressed by a leverage ratio as Yk hk1 y1 hk 2 y2
hkk yk
hkn yn
(45)
On the other hand, Yk is given by Yk aˆ0 aˆ1 xk 1 aˆ2 xk 2
y aˆ1 x1 aˆ2 x2 aˆ1 xk1 aˆ2 xk 2 y aˆ1 xk1 x1 aˆ2 xk 2 x2
(46)
aˆ y xk1 x1 , xk 2 x2 1 aˆ2
Each term is evaluated below. y
1 n yk n i 1
(47)
Kunihiro Suzuki
126
S 11 2
aˆ a2
xk1 x1 , xk 2 x2 ˆ1 xk1 x1 , xk 2 x2
S
21 2
S 11 2 xk1 x1 , xk 2 x2 21 2 S
S 11 2 xk1 x1 , xk 2 x2 21 2 S
S1y2 2 22 2 S S2 y 1 n xi1 x1 yi y 12 2 S n i 1 n 22 2 S 1 xi 2 x2 yi y n i 1 S
12 2
(48)
1 n n xi1 x1 yi S i 1 n 22 2 S 1 x x y i2 2 i n i 1 12 2
Focusing on the k -th term, we obtain hkk
1 Dk2 n n
(49)
where S 11 2 S 12 2 xk1 x1 xk 2 x2 12 2 S S 22 2 xk 2 x2 11 2 12 2 xk1 x1 S xk 2 x2 S xk1 x x x S 12 2 x x S 22 2 k2 2 k1 1
Dk2 xk1 x1 xk1 x1
xk1 x1 S 2
11 2
2 xk1 x1 xk 2 x2 S
12 2
(50)
xk 2 x2 S 2
22 2
This Dk is called as Mahalanobis’ distance. Sum of the leverage ratio is given by n
1 n Dk2 k 1 n k 1 n n 2 2 1 ij 2 1 xki xi xkj x j S n k 1 i 1 j 1 n
hkk k 1
n xki xi xkj x j S ij 2 1 k 1 n i 1 i 1 2
2
2
2
1 Sij S i 1 i 1
1 2
2
ij 2
(51)
Multiple Regression Analysis
127
The average leverage ratio is then given by
hkk
3 n
(52)
Therefore, we use a critical value for hkk as
hkk hkk
(53)
2 is frequently used, and we evaluate whether it is outlier or not using the value.
When x1 and x2 are independent on each other, the inverse matrix is reduced to
S 11 2 12 2 S
1
S11 2 S12 2 2 22 2 2 S S12 S22 1 0 2 S11 1 0 2 S22 S
12 2
(54)
and we obtain
Dk2
xk1 x1 S11 2
2
xk 2 x2
2
(55)
S22 2
The corresponding leverage ratio is given by 1 xk1 x1 xk 2 x2 2 2 n nS11 nS22 2
hkk
2
(56)
Therefore, we can simply add the terms in the single regression. There are interactions between the explanatory variables, and the expression becomes more complicated in general, and we need to use the corresponding models.
Kunihiro Suzuki
128
2.4. Prediction of Coefficients of a Regression Line When we extract samples, the corresponding regression line varies, that is, the factors
aˆ 0 aˆ1 aˆ ˆ aˆ aˆ , , and 2 vary. We study averages and variances of 0 , a1 , and 2 in this section. We assume the pollution data are given by
yi a0 a1 xi1 a2 xi 2 ei
where
xi1 , xi 2
population.
ei
for i 1,2, , n
is the given variable,
a0
,
(57)
a1
, and a2 are established values for
is independent on each other and follows a normal distribution with
0, . The probability variable in Eq. (57) is then only e . 2
e
i
When we get sample data evaluated as
xi1 , xi 2 , yi
for i 1,2, , n . The coefficients are
S1y2 22 2 2 S S 2y S 11 2 S1y2 S 12 2 S2 2y 21 2 2 S S S 22 2 S 2 1y 2y
aˆ1 S ˆ 21 2 a2 S 11 2
S
12 2
(58)
Each coefficient is expanded as
aˆ1 S 11 2 S1y2 S 12 2 S2 2y 2
S p 1
1 p 2
S py 2
1 n 2 1 p 2 xip x p yi S n i 1 p 1
1 n 2 1 p 2 xip x p a0 a1 xi1 a2 xi 2 ei S n i 1 p 1
(59)
Multiple Regression Analysis
129
aˆ2 S 21 2 S1y2 S 22 2 S2 2y 2
S
2 p 2
p 1
S py 2
1 n 2 2 p 2 xip x p yi S n i 1 p 1
1 n 2 2 p 2 xip x p a0 a1 xi1 a2 xi 2 ei S n i 1 p 1
(60)
aˆ0 y aˆ1 x1 aˆ2 x2
1 n 1 n 2 1 p 2 1 n 2 2 p 2 y x S x x y xip x p yi 1 x2 S i ip p i n i 1 n i 1 p 1 n i 1 p 1
2 2 1 n 1 p 2 2 p 2 1 x S x x x xip x p yi 1 ip p 2S n i 1 p 1 p 1
2 2 1 n jp 2 xip x p yi 1 x j S n i 1 j 1 p 1
2 2 1 n jp 2 xip x p a0 a1 xi1 a2 xi 2 ei 1 x j S n i 1 j 1 p 1
We notice that only
E aˆ1
ei
is a probability variable, and evaluate the expected value as
1 n 2 1 p 2 xip x p a0 a1 xi1 a2 xi 2 ei S n i 1 p 1
a0
1 n 2 1 p 2 xip x p S n i 1 p 1
a1
1 n 2 1 p 2 xip x p xi1 S n i 1 p 1
a2
1 n 2 1 p 2 xip x p xi 2 S n i 1 p 1
(61)
1 n 2 1 p 2 xip x p ei S n i 1 p 1 2
a1 S p 1
a1
1 p 2
2
S p1 a2 S 2
p 1
1 p 2
S p 2 2
(62)
Kunihiro Suzuki
130 where 2
S
ip 2
p 1
S pj ij
E aˆ2
2
(63)
1 n 2 2 p 2 xip x p a0 a1 xi1 a2 xi 2 ei S n i 1 p 1
a0
1 n 2 2 p 2 S xip x p n i 1 p 1
a1
1 n 2 2 p 2 S xip x p xi1 n i 1 p 1
a2
1 n 2 2 p 2 xip x p xi 2 S n i 1 p 1
(64)
1 n 2 2 p 2 xip x p ei S n i 1 p 1 2
a1 S p 1
2 p 2
2
S p1 a2 S 2
p 1
2 p 2
S p 2 2
a2
E aˆ0
2 2 1 n jp 2 xip x p a0 a1 xi1 a2 xi 2 ei 1 x j S n i 1 j 1 p 1
a0 a1
2 2 1 n jp 2 xip x p xi1 1 x j S n i 1 j 1 p 1
a2
2 2 1 n jp 2 1 x xip x p xi 2 jS n i 1 j 1 p 1
a0
2 2 1 n jp 2 xip x p 1 x j S n i 1 j 1 p 1
(65)
2 2 1 n jp 2 xip x p ei 1 x j S n i 1 j 1 p 1
2 2 1 n jp 2 1 x xip x p jS n i 1 j 1 p 1
n 2 n 2 1p 2 2p 2 a0 a0 x1 S xip x p a0 x2 S xip x p i 1 p 1 i 1 p 1 2 n 2 n 1p 2 2p 2 a0 a0 x1 S xip x p a0 x2 S xip x p i 1 i 1 p 1 p 1 a0
(66)
Multiple Regression Analysis
a1
131
2 2 1 n jp 2 xip x p xi1 1 x j S n i 1 j 1 p 1
a1 x1 a1
1 n 2 1 p 2 1 n 2 2p 2 xip x p xi1 a1 x2 S xip x p xi1 x1 S n i 1 p 1 n i 1 p 1 2
a1 x1 a1 x1 S
1 p 2
p 1
a1 x1 a1 x1 S
11 2
2
S p1 a1 x2 S 2
2 p 2
p 1
(67)
S p1 2
S11 2
0
a2
2 2 1 n jp 2 xip x p xi 2 1 x j S n i 1 j 1 p 1 2
a2 x2 a2 x1 S
1 p 2
p 1 2
a2 x2 a2 x1 S
n
i 1
1 p 2
p 1
a2 x2 a2 x2 S
x
22 2
ip
x p xi 2 a2 x2 S 2
p 1
2
S p 2 a2 x2 S 2
2 p 2
2 p 2
p 1
x n
i 1
ip
x p xi 2
(68)
S p 2 2
2
S22
a2 x2 a2 x2 0 2 2 1 n jp 2 xip x p ei 0 1 x j S n i 1 j 1 p 1
(69)
Therefore, we obtain E aˆ0 a0
(70)
We then evaluate variances of factors. 2
n 1 2 1p 2 V aˆ1 S xip x p V ei n i 1 p 1 2
1 1 n 11 2 S xi1 x1 S 12 2 xi 2 x2 V ei n n i 1 1 1 n 11 2 2 2 S xi1 x1 2S 12 2 S 11 2 xi1 x1 xi 2 x2 S 12 2 n n i 1 2 1 11 2 2 2 12 2 11 2 2 12 2 2 2 S S11 2S S S12 S S22 s e n 1 11 2 11 2 2 12 2 12 2 2 2 2 S 11 2 S 12 2 S 12 2 S 22 S S S11 S S21 S se n
S
11 2
n
se
2
x 2
i2
2 2 x2 se
(71)
Kunihiro Suzuki
132 2
n 1 2 2p 2 V aˆ2 S xip x p V e n i 1 p 1 2
1 1 n 21 2 S xi1 x1 S 22 2 xi 2 x2 se 2 n n i 1
1 1 n 21 2 2 2 S xi 2 x2 2S 21 2 S 22 2 xi1 x1 xi 2 x2 S 22 2 n n i 1 2 1 21 2 2 2 21 2 22 2 2 22 2 2 2 S S22 2S S S12 S S22 s e n 1 21 2 21 2 2 22 2 2 22 2 21 2 2 22 2 2 2 S S S2 2 S S12 S S S12 S S22 se n
S
22 2
n
se
x 2
i2
2
2 2 x2 se
(72)
2
2 2 1 n jp 2 V aˆ0 V 1 x j S xip x p a0 a1 xi1 a2 xi 2 ei n i 1 j 1 p 1 2 n 1 2 2 jp 2 1 x j S xip x p V e i 1 n j 1 p 1 2 2 1 1 n 1 x j S j1 2 xi1 x1 S j 2 2 xi 2 x2 V e n n i 1 j 1 n 2 11 11 2 12 2 1 x1 S xi1 x1 S xi 2 x2 x2 S 21 2 xi1 x1 S 22 2 xi 2 x2 se 2 n n i 1 2 11 2 xi1 x1 S 12 2 xi 2 x2 x2 S 21 2 xi1 x1 S 22 2 xi 2 x2 1 1 n 1 x1 S 2 se n n i 1 2 x S 11 2 x x S 12 2 x x x S 21 2 x x S 22 2 x x i1 1 i 2 2 2 i1 1 i 2 2 1 2 2 2 11 2 xi1 x1 S 12 2 xi 2 x2 x22 S 21 2 xi1 x1 S 22 2 xi 2 x2 2 1 1 n x1 S 1 se n n i 1 2 x x S 11 2 x x S 12 2 x x S 21 2 x x S 22 2 x x 1 2 i1 1 i2 2 i1 1 i2 2 2 2 11 2 2 2 2 12 2 11 2 12 2 x S x x S x x 2 S S x x x x 1 i1 1 i2 2 i1 1 i2 2 2 1 1 1 n 2 21 2 2 2 2 22 2 21 2 22 2 xi1 x1 S xi 2 x2 2S S xi1 x1 xi 2 x2 se 2 x2 S n n n i 1 S 11 2 S 21 2 xi1 x1 2 S 11 2 S 22 2 xi1 x1 xi 2 x2 2 x1 x2 S 12 2 S 21 2 xi 2 x2 xi1 x1 S 12 2 S 22 2 xi 2 x2 xi 2 x2
2 2 x 2 S 11 2 S 2 S 12 2 S 2 2S 11 2 S 12 2 S 2 1 11 22 12 2 2 1 1 22 x2 2 S 21 2 S11 2 S 22 2 S22 2S 21 2 S 22 2 S12 2 se 2 n n S 11 2 S 21 2 S11 2 S 22 2 S12 2 2 x x 1 2 S 12 2 S 21 2 S 2 S 22 2 S 2 21 22
1 1 11 2 22 2 12 2 2 x12 S x2 2 S 2 x1 x2 S se n n 1 1 2 11 2 2 22 2 12 2 21 2 2 x1 x2 S x1 S x2 S S se n n
2 2 1 jl 2 2 1 x j xl S se n j 1 l 1
1 1 x1 n
S 11 2 x2 21 2 S
S S
12 2 22 2
x1 2 se x2
(73)
Multiple Regression Analysis
133
where we utilize the following relationship in the last derivation step.
S
12 2
S
21 2
(74)
The covariance between aˆ1 and aˆ2 is given by
Cov aˆ1 , aˆ2
2 2 p ' 2 1 1 n 2 1 p 2 xip ' x p ' V e S xip x p S n n i 1 p 1 p '1 1 1 n 11 2 2 12 2 21 2 22 2 S xi1 x1 S xi 2 x2 S xi1 x1 S xi 2 x2 se n n i 1
2 11 2 21 2 11 2 22 2 1 1 n S S xi1 x1 S S xi1 x1 xi 2 x2 2 se n n i 1 S 12 2 S 21 2 x x x x S 12 2 S 22 2 x x 2 i2 2 i1 1 i2 2 11 2 21 2 2 22 2 2 1 S S S11 S S12 2 se n S 12 2 S 21 2 S 2 S 22 2 S 2 21 22
1 12 2 2 S se n (75) Similarly, we obtain
Cov aˆ2 , aˆ1
1 21 2 2 S se n
(76)
V aˆ j , aˆl is generally expressed by
Consequently, the
V aˆ j
Cov aˆ j , aˆl
1 jl 2 2 S se n
The covariance between
and
aˆ0 and aˆ1 is given by
(77)
Kunihiro Suzuki
134 Cov aˆ0 , aˆ1
2 2 2 1p 2 11 n jp 2 xip x p S xip x p V e 1 x j S n n i 1 j 1 p 1 p 1 2 11 n j1 2 xi1 x1 S j 2 2 xi 2 x2 S 11 2 xi1 x1 S 12 2 xi 2 x2 se 2 1 x j S n n i 1 j 1
11 2 xi1 x1 S 12 2 xi 2 x2 1 1 n 1 x1 S 11 2 xi1 x1 S 12 2 xi 2 x2 se 2 S n n i 1 x S 21 2 x x S 22 2 x x i1 1 i2 2 2 11 2 12 2 xi1 x1 S xi 2 x2 11 2 1 1 n x1 S xi1 x1 S 12 2 xi 2 x2 se 2 S n n i 1 x S 21 2 x x S 22 2 x x i1 1 i2 2 2 2 11 2 11 2 12 2 11 2 x S S x x S S x i1 1 i 2 x2 xi1 x1 1 21 2 11 2 xi1 x1 2 S 22 2 S 11 2 xi 2 x2 xi1 x1 1 1 n x2 S S 2 se 2 11 2 12 2 12 2 12 2 n n i 1 x S S xi1 x1 xi 2 x2 S S xi 2 x2 1 x S 21 2 S 12 2 x x x x S 22 2 S 12 2 x x 2 i1 1 i2 2 i2 2 2 11 2 11 2 2 12 2 11 2 2 x1 S S S11 S S S 21 x S 21 2 S 11 2 S 2 S 22 2 S 11 2 S 2 11 21 1 2 2 se 11 2 12 2 2 12 2 12 2 2 n x S S S S S S 1 12 22 21 2 12 2 2 22 2 12 2 2 x2 S S S12 S S S 22 1 11 2 12 2 2 x1 S x2 S se n 1 11 2 12 2 x S S 1 n x2
(78) Similarly, we obtain
1 21 2 22 2 2 x1 S x2 S se n 1 21 2 22 2 x 2 S S 1 se n x2
Cov aˆ0 , aˆ2
(79)
Consequently, the Cov aˆ0 , aˆ j is generally expressed by
1 Cov aˆ0 , aˆ j n
2
x S l
l 1
jl 2
2 se
(80)
Multiple Regression Analysis
aˆ
We assume that 1 and normalized variables. The parameter given by
t
aˆi ai
V ai
aˆ 0
135
follow normal distributions, and hence assume the
for i 0,1, 2 (81)
follows a t distribution function with a freedom of e , where e n 3
(82)
Therefore, the predictive range is given by
aˆi t p e ; P V aˆi ai aˆi t p e ; P V aˆi
(83)
2.5. Estimation of a Regression Line
x , x , y We assume that we obtain n set of data i1 i 2 i . The corresponding regression value is given by Y aˆ0 aˆ1 x1 aˆ2 x2
(84)
The expected value is given by E Y E aˆ0 E aˆ1 x1 E aˆ2 x2 a0 a1 x1 a2 x2
The variance of Y is given by
(85)
Kunihiro Suzuki
136 V Y V aˆ0 aˆ1 x1 aˆ2 x2 2
2
2
V aˆ0 2 x j Cov aˆ0 , aˆ j x j xl Cov aˆ j , aˆl j 1
j 1 l 1
2 2 2 1 2 2 1 2 2 jl 2 2 jl 2 2 jl 2 2 1 x j xl S se x j xl S se x j xl S se n n j 1 l 1 n j 1 l 1 j 1 l 1
2 2 1 jl 2 2 1 x j xl 2 x j xl x j xl S se n j 1 l 1
(86)
Let us consider the term below. 2
2
2 x x S j 1 l 1 2
jl 2
j l
2
x j xl S
jl 2
j 1 l 1
2
2
x j xl S
jl 2
j 1 l 1
(87)
The second term is modified as 2
2
x j xl S
jl 2
j 1 l 1
2
2
xl x j S
lj 2
j 1 l 1 2
2
xl x j S
jl 2
j 1 l 1
(88)
where we utilize the relationship of S
lj 2
S
jl 2
(89)
Therefore, we obtain 2
2
2 x x S j 1 l 1
j l
jl 2
x j xl xl x j S 2
2
jl 2
j 1 l 1
Substituting Eq. (90) into Eq. (86), we obtain
(90)
Multiple Regression Analysis V Y
137
2 2 1 jl 2 2 1 x j xl 2 x j xl x j xl S se n j 1 l 1
2 2 1 jl 2 2 1 x j xl x j xl xl x j x j xl S se n j 1 l 1
2 2 1 jl 2 2 1 x j x j xl xl S se n j 1 l 1
1 1 x1 x1 n 1 2 1 D 2 se n
S 11 2 x2 x2 21 2 S
x1 x1 2 se 22 2 S x2 x2 S
12 2
(91)
where D2 x1 x1
S 11 2 x2 x2 21 2 S
S S
12 2 22 2
x1 x1 x2 x2
(92)
Therefore, the variable given by t
Y E Y V Y
aˆ0 aˆ1 x1 aˆ2 x2 a0 a1 x1 a2 x2 1 D 2 2 se n n
(93)
follows a t distribution. We can then estimate the value range of the regression given by 1 D2 2 1 D2 2 aˆ0 aˆ1 x1 aˆ2 x2 t p e ; P se Y aˆ0 aˆ1 x1 aˆ2 x2 t p e ; P se n n n n
(94) The objective variable range is given by y aˆ0 aˆ1 x1 aˆ2 x2 e
and the corresponding range is given by
(95)
Kunihiro Suzuki
138
1 D 2 2 1 D 2 2 ˆ ˆ ˆ aˆ0 aˆ1 x1 aˆ2 x2 t e , P 1 s y a a x a x t , P e 1 se 0 1 1 2 2 e n n n n
(96)
2.6. Selection of Explanatory Variables We used two explanatory variables. However, we want to know how each variable contributes to express the subject variable. We want to know the validity of each explanatory variable. One difficulty exists that with increasing number of explanatory variable, the determinant factor increases. Therefore, we cannot use the determinant factor to evaluate the validity. We start with regression without explanatory variable, and denote it as model 0. The regression is given by Model 0 : Yi y
(97) Se M 0 2
The corresponding variance Se M 0 2
is given by
1 n 1 n 2 2 yi Yi yi y S yy 2 n i 1 n i 1
(98)
In the next step, we evaluate the validity of x1 and x2 , and denote the model as model 1. The regression using explanatory variable is given by xl
Yi a0 a1 xil1
(99)
The corresponding variance Se2M 1 is given by Se M 1 2
2 1 n 1 n 2 yi Yi yi a0 a1 xil1 n i 1 n i 1
Then the variable
(100)
Multiple Regression Analysis nS F
2 e M 0
nSe M 1
1
2
nS
2 e M 1
e M 0
e M 1
139
(101)
e M 1
follows F distribution with freedom and are given by
F e M 0 e M 1 ,e M 1
where
e M 0
and
e M 1
are the
e M 0 n 1
(102)
e M 1 n 2
(103)
We can judge the validity of the explanatory variable as
F1 F e M 1 , e M 1 e M 0 F1 F e M 0 e M 1 , e M 1
valid
(104) invalid
We evaluate F1 for x1 and x2 , that is, l1 1,2 , and evaluate the corresponding F1 . If both F1 for l1 1,2 are invalid, we use the model 0 and the process is end. If at least one of the F1 is valid, we select the larger one, and the model 1 regression is determined as
Yi a0 a1 xil1
(105)
where l1 1 or 2
(106)
depending on the value of the corresponding F1 . We then evaluate the second variable and denote it as model 2. The corresponding regression is given by
Yi a0 a1 xil1 a2 xil2
(107)
Kunihiro Suzuki
140
The corresponding variance is given by 2 Se M 2
2 1 n 1 n 2 yi Yi yi a0 a1 xil1 a2 xil2 n i 1 n i 1
(108)
Then the variable
F2
nS
2 e M 1
nSe M 2 2
nSe M 2
e M 1
e M 2
(109)
2
e M 2
follows F distribution with freedom and are given by
F e M 1 e M 2 ,e M 2
where
e M 1
and
e M 2
are the
e M 1 n 2
(110)
e M 2 n 3
(111)
We can judge the validity of the explanatory variable as
F2 F e M 2 , e M 2 e M 1 F2 F e M 1 e M 2 , e M 2
valid
(112) invalid
If both F2 are invalid, we use the model 1 and the process is end. If F2 is valid, we use two variables as the explanatory ones.
3. REGRESSION WITH MORE THAN TWO EXPLANATORY VARIABLES We move to an analysis where the explanatory variable number is bigger than two and we denote the number as m . The forms for the multiple regressions are basically derived in the previous section. We then neglect some parts of the derivation in this section.
Multiple Regression Analysis
141
3.1. Derivation of Regression Line We pick up n set data from the population, and want to predict the characteristics of the population. We can decide the sample factors a0 , a1 , a2 ,
, am from the sample data, which
is not the same as the ones for the population, and hence we denote them as aˆ0 , aˆ1 ,
,
and aˆm a. The sample data can be expressed by
y1 aˆ0 aˆ1 x11 aˆ2 x12 aˆm x1m e1 y2 aˆ0 aˆ1 x21 aˆ2 x22 aˆm x2 m e2 (113)
yn aˆ0 aˆ1 xn1 aˆ2 xn 2
where
aˆm xnp en
ei is a residual error. A sample regression value Yi is expressed by
Yi aˆ0 aˆ1 xi1 aˆ2 xi 2
aˆm xim
(114)
ei can then be expressed by
ei yi Yi yi aˆ0 aˆ1 xi1 aˆ2 xi 2
aˆm xim
(115)
2
S The variance of ei is denoted as e , and is given by Se 2
1 n yi aˆ0 aˆ1 xi1 aˆ2 xi 2 n i 1
We decide aˆ0 , aˆ1 , Partial differentiating them, we obtain
aˆm xim
2
(116) 2
S , and aˆm so that e
Se
2
has the minimum value.
with respect to aˆ0 , aˆ1 ,
, and aˆm and setting 0 for
Kunihiro Suzuki
142
S11 2 2 S21 S 2 m1
2 2 S1m aˆ1 S1 y 2 2 S2 m aˆ2 S2 y 2 aˆ p 2 Smm S py
S12 2
S22 2
Sm 2 2
(117)
Therefore, we can obtain the coefficients as aˆ1 S11 ˆ 2 a2 S21 aˆ p Sm 21 2
S 11 2 21 2 S S m1 2
S12
S1m 2 S2 m 2 Smm
2
2
S22 2
Sm 2 2
S S S
1
S1y2 2 S2 y S 2 py
S1y2 2 2m 2 S S2 y mm 2 2 S S py
12 2
S
22 2
m 2 2
1m 2
(118)
where S 11 2 21 2 S S m1 2
S S S
12 2 22 2
m 2 2
After obtaining aˆ1 ,
S11 2 2 2m 2 S S21 mm 2 2 S Sm1
S
1m 2
S12 2
S22 2
Sm 2 2
S1m 2 S2 m 2 Smm 2
1
(119)
, aˆm , we can decide aˆ0 as
aˆ0 y aˆ1 x1 aˆ2 x2
aˆm xm
(120)
3.2. Accuracy of the Regression The accuracy of the regression can be evaluated similarly to the case of two explanatory variables by determinant factor given by
Multiple Regression Analysis R2
Sr
143
2
S yy 2
(121)
2 where Sr is the variance of Y given by
Sr 2
1 n 2 Yi y n i 1
(122)
The variance of y has the same expression as the one with two explanatory variables given by
S yy Se Sr 2
2
2
(123)
Therefore, the determination coefficient is reduced to R2
S r
2
Se S r 2
1
Se
2
2
S yy
2
(124)
Consequently, the determination coefficient is the square of the correlation factor, and has a value range between 0 and 1. The multiple correlation factor rmult is defined as the correlation between yi and Yi , and is given by
y n
rmult
y n
i 1
i
i 1
S S
i
y Yi y
y
Y y 2
n
i 1
2
i
2 R 2 yy
We can modify the denominator of (31) as
(125)
Kunihiro Suzuki
144 Se 2
1 n 2 yi Yi n i 1
2 1 n yi y Yi y n i 1 1 n 1 n 1 n 2 2 yi y Yi y 2 yi y Yi y n i 1 n i 1 n i 1 n 1 2 2 S yy S R 2 yi y Yi y n i 1
(126)
The cross term in Eq. (32) is performed as
1 n 1 yi y Yi y 2 S yy 2 SR 2 Se 2 n i 1 1 2 2 2 2 S yy S R S yy S R 2 2 S R
(127)
Then, the multiple correlation factor is also the same form with two explanatory variables given by S R 2
rmult
S yy S R 2
2
S R 2
S yy 2
(128)
2 The square of rmult is the determinant of the multi regression, and is denoted as R and is given by
SR
R
S yy 2
S yy Se 2
2
2
S yy 2
2
1
Se
2
S yy 2
(129)
Substituting Eq. (34) into Eq. (35), we can express the determinant factor with multiple correlation factor as 2 R2 rmult
(130)
Multiple Regression Analysis
145
The variances used in Eq. (35) are related to the samples, and hence the determinant with unbiased variances are also defined as
n R*2 1
e n
T
Se
2
2
(131)
S yy
where
e n m 1
(132)
T n 1
(133)
and
The validity of the regression can be evaluated with a parameter as
F
sr
se
2
(134)
2
where
sr
nSr m
se
nSe n m 1
2
2
2
(135)
2
(136)
We also evaluate the critical F value of Fp m, n m 1 , and judge as F Fp m, n m 1 invalid F Fp m, n m 1 valid
(137)
Kunihiro Suzuki
146
3.3. Residual Error and Leverage Ratio We evaluate the accuracy of the regression. The residual error can be evaluated as
ek yk Yk
(138)
This is then normalized by the biased variance and is given by
zek
ek se
(139)
2
se is given by 2
where
se 2
n
e
Se
2
(140)
This follows a standard normal distribution with Further, it is modified as
tk
N 0,12
.
zek
(141)
1 hkk
This can be used for judging the outlier or not. We should consider the leverage ratio hkk for a multiple regression. The regression data can be expressed by leverage ratio as
Yk hk1 y1 hk 2 y2
hkk yk
On the other hand, Yk is given by
hkn yn
(142)
Multiple Regression Analysis
Yk aˆ0 aˆ1 xk1 aˆ2 xk 2 y aˆ1 x1 aˆ2 x2
aˆm xkm
aˆm xm aˆ1 xk1 aˆ2 xk 2
y aˆ1 xk1 x1 aˆ2 xk 2 x2 y xk1 x1
147
aˆm xkm
aˆm xkm xm
aˆ1 ˆ a xkm xm 2 aˆm
xk 2 x2
(143)
Each term is evaluated below.
1 n y yk n i 1
xk1 x1
xk1 x1
xk1 x1
xk1 x1
xk1 x1
xk 2 x2
xk 2 x2
xk 2 x2
xk 2 x2
xk 2 x2
(144)
xkm
aˆ1 ˆ a xm 2 aˆm
S11 2 2 S xkm xm 21 S 2 m1 S 11 2 21 2 S xkm xm S m1 2 S 11 2 21 2 S xkm xm S m1 2
S 11 2 21 2 S xkm xm S m1 2
S12 2
2
S22
Sm 2 2
S S S
S S S
S S S
12 2 22 2
m 2 2
12 2 22 2
m 2 2
12 2 22 2
m 2 2
S1m 2 S2 m 2 Smm 2
1
S1y2 2 S2 y S 2 py
S1y2 2 S2 y S 2 mm 2 S S py 1 n xi1 x1 yi y n 1m 2 S i 1 1 n 2 m 2 xi 2 x2 yi y S n i 1 mm 2 S 1 n n xim xm yi y i 1 1 n xi1 x1 yi n 1m 2 i 1 S 1 n 2 m 2 xi 2 x2 yi S n i 1 mm 2 S 1 n x x y m i n im i 1
S
1m 2
2 m 2
(145)
Kunihiro Suzuki
148
Focusing on the k -th term, we obtain hkk
1 Dk2 n n
(146)
where
Dk2 xk1 x1
S 11 2 21 2 S xkm xm S m1 2
xk 2 x2
S S S
12 2 22 2
m 2 2
1m 2
x x k1 1 xk 2 x2 S mm 2 S xkm xm
S
2 m 2
(147)
D2
This k is called as square of Mahalanobis’ distance. Sum of the leverage ratio is given by n 1 n Dk2 h kk k 1 k 1 n k 1 n n
1
1 n p p xki xi xkj x j S ij 2 n k 1 i 1 j 1
n p p xki xi xkj x j S ij 2 1 k 1 n i 1 i 1 p
p
1 Sij S 2
(148)
ij 2
i 1 i 1
Since we obtain
S S 2
ij
ij 2
1 0 0
Therefore, we obtain
0
0
0 0 1
(149)
Multiple Regression Analysis n
h k 1
kk
m
m
1 Sij S 2
149
ij 2
(150)
i 1 j 1
1 m
The average leverage ratio is then given by
hkk
m 1 n
(151)
Therefore, we use a critical value for hkk as
hkk hkk
(152)
2 is frequently used, and evaluate whether it is outlier or not.
3.4. Prediction of Coefficients of a Regression Line for Multi Variables
aˆ aˆ aˆ 2
We can simply extend the range of the variables of 0 , 1 , variables forms. We assume the pollution set has a regression line is given by
yi a0 a1 xi1 a2 xi 2 x ,x , where i1 i 2
, xim
values for population.
ei
am xim ei
for i 1,2, , n
are the given variables,
, aˆm from the two
(153)
a0 a1 a2 ,
,
,
,
, am are established
is independent on each other and follows a normal distribution
0, e with . The probability variable in Eq. (57) is then only . 2
i
Extending the forms for two variables, we can evaluate the averages and variables of the factors as the followings. The averages of the factors are given by E aˆk ak
for k 0,1, 2,
,m
The variances of the factors are given by
(154)
Kunihiro Suzuki
150 V aˆk
S
kk 2
n
se
2
for k 1, 2, , m
1 V aˆ0 1 x1 n
(155)
S 11 2 21 2 S xm S m1 2
x2
S S S
12 2 22 2
m 2 2
x 1 2 m 2 x2 2 S se mm 2 S xm
S
1m 2
(156)
The covariance between aˆ j and aˆl are given by Cov aˆ j , aˆl
1 jl 2 2 S se n
for j , l 1, 2,
,m
(157)
aˆ ˆ The covariance between a0 and j are given by
Cov aˆ0 , aˆ j
1 j1 2 S n
S
j 2 2
a a a2
We assume that 0 , 1 , that the normalized variables
t
aˆi ai
V ai
,
S
jm 2
x1 x2 s 2 e xm
(158)
, am follow normal distributions, and hence assume
for i 0,1, 2 (159)
follows a t distribution function with a freedom of e , where e n m 1
(160)
Therefore, the predictive range is given by
aˆi t p e ; P V aˆi ai aˆi t p e ; P V aˆi
(161)
Multiple Regression Analysis
151
3.5. Estimation of a Regression Line
x , x , We assume that we obtain n set of data i1 i 2 regression value is given by Y aˆ0 aˆ1 x1 aˆ2 x2
, xim , yi
. The corresponding
aˆm xm
(162)
The expected value is given by E Y E aˆ0 E aˆ1 x1 E aˆ2 x2 a0 a1 x1 a2 x2
E aˆm xm
am xm
(163)
The variance of Y is given by V Y V aˆ0 aˆ1 x1 aˆ2 x2 1 1 x1 x1 n 1 2 1 D 2 se n
x2 x2
aˆm xm S 11 2 21 2 S xm xm S m1 2
S S S
12 2 22 2
m 2 2
x x 1 1 x2 x2 2 S s e mm 2 S xm xm
S
1m 2
2 m 2
(164)
where
D 2 x1 x1
x2 x2
S 11 2 21 2 S xm xm S m1 2
S S S
12 2 22 2
m 2 2
1m 2
x x 1 1 x2 x2 S mm 2 S xm xm
S
2 m 2
(165)
We can then estimate the value range of the regression. The regression range is given by
aˆ0 aˆ1 x1 aˆ2 x2
1 D 2 2 aˆm xm t p e ; P se n n
(166)
Y aˆ0 aˆ1 x1 aˆ2 x2
1 D 2 2 aˆm xm t p e ; P se n n
Kunihiro Suzuki
152
The objective variable range is given by y aˆ0 aˆ1 x1 aˆ2 x2
aˆm xm e
(167)
and the corresponding range is given by
aˆ0 aˆ1 x1 aˆ2 x2
1 D 2 2 aˆm xm t e , P 1 se n n
y
(168)
aˆ0 aˆ1 x1 aˆ2 x2
1 D 2 2 aˆm xm t e , P 1 se n n
3.6. The Relationship between Multiple Coefficient and Correlation Factors We study the relationship between multiple factors and correlation factors. We focus on the two variables x1 and x2 for simplicity, and also assume that the variables are normalized. The factors a1 , a2 are related to the correlation factors as 1 r12
r12 a1 r1 1 a2 r2
(169)
Therefore, the factors are solved as a1 1 r12 a2 r12 1 1 1 r 2 12 r12 2 1 r12
1
r1 r12 r2 1 r 2 12 r2 r12 r1 2 1 r12
r1 r2 r12 1 r12 2 r1 r2 1 1 r12 2
(170)
Multiple Regression Analysis
153
We then obtain 1 r12 a1 r1
(171)
1 r12 2
1 r12 a2 r2
r2 r1
r1 r2
(172)
1 r12 2
We assume that r1 r2 which do not induce loosing generality since the form is symmetrical. If there is no correlation between two variables, that is, r12 0 , we obtain a1 r1
(173)
a2 r2
(174)
The factors are equal to the corresponding correlation factors as are expected.
2 r1 = 0.8
a1, a2
r2 = 0.5
1
a1
0
a2
-1 -2
0
0.5 r12
1
Figure 1. Relationship between multiple factor and the correlation factors.
Figure 1 shows the relationship between the coefficients and correlation factors with respect to the correlation factor between two variables.
Kunihiro Suzuki
154
The variable a1 with large r1 decreases a little with increasing r12 , and increases significantly for larger r12 . On the other hand, the variable a2 with small r2 decreases little with decreasing r12 , and decreases significantly for larger r12 . Differentiating a1 with respect to r12 , we obtain
a1 r1 r12
r2 r 1 r12 2 1 r12 2 2r12 r1 r1
1 r
2 2
12
r2
1 r
2 2
12
(175)
2 r1 r12 2 r12 1 r2
Setting 0 for Eq. (175), we obtain r122 2
r1 r12 1 0 r2
(176)
Solving this equation we obtain 2
r12
r r1 1 1 r2 r2
(177)
Since r12 is smaller than 1, we select the root with negative sign root before the root, and then it is reduced to 2
r r r12 1 1 1 r2 r2 1 2 r r1 1 1 r2 r2
(178)
a1 has the minimum value for the above r12 , and then increases with further increasing.
On the other hand, differentiating a2 with respect to r12 , we obtain
Multiple Regression Analysis r122 2
r2 r12 1 0 r1
155 (179)
Solving this with respect to r12 , we obtain 2
r r r12 2 2 1 r1 r1
(180)
Since r2 r1 , the term in the root become negative, and hence, we have no root. This means that the factor decreases monotonically as is expected from the figure. Inspecting the above, we can expect multiple factors not so far from the correlation factor within this minimum r12 . In this region, we can roughly image that the both factors are smaller than the independent correlation factors due to the interaction between the explanatory variables. However, they are significantly different from the ones with exceeding the minimum value of r12 , and the feature of the multiple factor is complicated. We cannot have a clear image for the value of the factors under this strong interaction between explanatory variables. It hence should be minded that we should use independent variables for the explanatory as possible as we can.
3.7. Partial Correlation Factor We can extend the partial correlation factor which is discussed in the previous chapter to the multiple variables. We consider an objective variable y and m kinds of explanatory variables
x1 , x2 ,
, xm .
Among the explanatory variables, we consider the correlation between y and x1 . In this case, both y and x1 have relationship between the explanatory variables. Neglecting a variable x1 , we can evaluate the regression relationship of y as
Yk c0 c2 xk 2 c3 xk 3 where
cm xkm
(181)
Kunihiro Suzuki
156 c2 S22 2 c3 S32 2 cm Sm 2 2
S23 2
S33 2
Sm 3 2
We can decide c2 , c3 ,
S2 m 2 S3 m 2 Smm 2
1
S2 2y 2 S3 y S 2 my
, cm using Eq. (182). After obtaining c2 , c3 ,
(182) , cm , we can
decide c0 from the equation below. y c0 c2 x2 c3 x3
cm xm
(183)
On the other hand, we regard x1 as a subject variable, and it is expressed by
X k1 d0 d2 xk 2 d3 xk 3
dm xkm
(184)
where d 2 S22 2 d3 S32 d p Sm 22 2
d2 , d3 ,
S23 2
S33 2
Sm 3 2
S2 m 2 S3 m 2 Smm 2
1
2 S21 2 S31 S 2 m1
(185)
, dm can be decided from the equation. d 0 can be evaluate as
x1 d0 d2 x2 d3 x3
dm xm
(186)
We then introduce the following two variables given by
uk yk Yk
(187)
vk1 xk1 X k1
(188)
These variables can be regarded as ones eliminating the influence of the other explanatory variables. We can then evaluate the followings.
Multiple Regression Analysis
u
v1
1 n uk n k 1
(189)
1 n vk1 n k 1
(190)
1 n 2 uk u n k 1
(191)
1 n 2 vk1 v1 n k 1
(192)
Suu 2
2
Sv1v1
157
Suv 1 2
1 n uk u vk1 v1 n k 1
Therefore, the corresponding partial correlation factor r1 y , 2,3,
r1 y , 2,3,
,m
(193) ,m
can be evaluated as
Suv 21 Sv12v1 Suu 2
(194)
We can evaluate the partial correlation factors for other explanatory variables.
3.8. Selection of Explanatory Variable We used many explanatory variables. However, some of them may not contribute to express the subject variable. Furthermore, if the interaction between explanatory variables is significant, the determinant of the matrix becomes close to zero, and the related elements in the inverse matrix become unstable or we cannot obtain results with the related numerical error, which is called as multicollinearity. One difficulty exists that the determinant factor increases with increasing number of explanatory variables. Therefore, we cannot use the determinant factor to evaluate the validity for the regression.
Kunihiro Suzuki
158
We start with a regression without any explanatory variable, and denote it as model 0. The regression is given by Model 0 : Yi y
(195) Se M 0 2
The corresponding variance Se M 0 2
is given by
1 n 1 n 2 2 y Y yi y S yy 2 i i n i 1 n i 1
(196)
x In the next step, we evaluate the validity of x1 , x2 , and m , and denote the model as model 1.
The regression using explanatory variable is given by xl
Yi a0 a1 xil1
(197) Se M 1 2
The corresponding variance Se M 1 2
is given by
2 1 n 1 n 2 yi Yi yi a0 a1 xil1 n i 1 n i 1
(198)
Then the variable nS F
2 e M 0
1
nSe M 1 2
2 nSe M 1
e M 0
e M 1
(199)
e M 1
follows F distribution with freedom and are given by
F e M 0 e M 1 ,e M 1
where
e M 0
and
e M 1
are the
e M 0 n 1
(200)
e M 1 n 2
(201)
Multiple Regression Analysis
159
We can judge the validity of the explanatory variable as
F1 F e M 1 , e M 1 e M 0 F1 F e M 0 e M 1 , e M 1
valid
(202) invalid
We evaluate F1 for x1 and x2 , that is, l1 1,2, , m , and evaluate the corresponding F1 .
If both F1 are invalid, we use the model 0 and the process is end. If at least one of the F1 is valid, we select the maximum one, and the model 1 regression is determined as
Yi a0 a1 xil1
(203)
Where l1 correspond to the selected variable in the above step. We then evaluate the second variable denote it as model 2. The second variables are residual ones where the first one is selected in the above step. The corresponding regression is given by
Yi a0 a1 xil1 a2 xil2
(204)
The related variance is given by Se M 2 2
2 1 n 1 n 2 yi Yi yi a0 a1 xil1 a2 xil2 n i 1 n i 1
(205)
Then the variable
F2
nS
2 e M 1
nSe M 2 2
nS
2 e M 2
e M 1
e M 2
(206)
e M 2
follows F distribution with freedom and are given by
F e M 1 e M 2 ,e M 2
where
e M 1
and
e M 2
are the
Kunihiro Suzuki
160
e M 1 n 2
(207)
e M 2 n 3
(208)
We can judge the validity of the explanatory variable as
F2 F e M 2 , e M 2 e M 1 F2 F e M 1 e M 2 , e M 2
valid
(209) invalid
If both F2 are invalid, we use the model 1 and the process is end. If at least one of the F2 is valid, we select the maximum one, and the model 2 regression is determined as
Yi a0 a1 xil1 a2 xil2
(210)
where l 2 correspond to the selected variable in the above step. We repeat the process until all F are invalid, or check all variables.
4. MATRIX EXPRESSION FOR MULTIPLE REGRESSION The multiple regression can be expressed with a matrix form. This gives no new information. However, the expression is simple and one can use the method alternatively. The objective and p kinds of explanatory variables are related as follows.
yi a0 a1 xi1 x1 a2 xi 2 x2
a p xip x p ei
(211)
where we have n samples. Therefore, all data are related as y1 a0 a1 x11 x1 a2 x12 x2 y2 a0 a1 x21 x1 a2 x22 x2 yn a0 a1 xn1 x1 a2 xn 2 x2
a p x1 p x p e1 a p x2 p x p e2 a p xnp x p e1
(212)
Multiple Regression Analysis
161
We introduce matrixes and vectors as y1 y Y 2 yn
(213)
1 x11 x1 1 x21 x1 X 1 xn1 x1
x12 x2 x22 x2 xn 2 x2
x1 p x p x2 p x p xnp x p
(214)
a0 a1 β ap
(215)
e1 e e 2 en
(216)
Using the matrixes and vectors, the corresponding equation is expressed by Y Xβ e
(217)
The total sum of the square of error Q is then expressed by Q eT e Y Xβ Y Xβ T
Y T βT X T Y Xβ
(218)
Y T Y Y T Xβ βT X T Y βT X T Xβ Y T Y 2βT X T Y βT X T Xβ
We obtain the optimum value by minimizing Q . This can be done by vector differentiating (see Appendix 1-10 of volume 3) with respect to β it given by
Kunihiro Suzuki
162
Q T T 2 βT X T Y β X Xβ β β β 2 X T Y 2 X T Xβ 0
(219)
Therefore, we obtain β X T X X TY 1
(220)
SUMMARY To summarize the results in this chapter— We assume that the regression is expressed with m explanatory variables given by Y a0 a1 X1 a2 X 2
am X m
The coefficients are given by 2
aˆ1 S11 ˆ 2 a2 S21 2 aˆ p Sm 1
S12 2
S22 2
Sm 2 2
S1m 2 S2 m 2 Smm 2
1
S1y2 2 S2 y S 2 py
The accuracy of the regression can be evaluated with a determinant factor given by R2
Sr
2
S yy 2
S yy 2
where
and
2 Sr
is the variance of Y given by
S yy
1 n 2 yi y n i 1
Sr
1 n 2 Yi y n i 1
2
2
Multiple Regression Analysis The validity of the regression can be evaluated with a parameter F given by
F
sr
se
2 2
where sr
nSr m
se
nSe n m 1
2
2
2
2
2
The variance Se Se 2
is given by
1 n 2 yi Yi n i 1
We also evaluate the critical
F
value of Fp m, n m 1 , and judge as
F Fp m, n m 1 invalid F Fp m, n m 1 valid
The residual error for each data can be evaluated as tk
zek 1 hkk
where zek
ek
hkk
1 Dk2 n n
se
2
163
Kunihiro Suzuki
164 and
Dk2 xk1 x1
This
xk 2 x2
S 11 2 21 2 S xkm xm S m1 2
S S S
12 2 22 2
m 2 2
1m 2
x x k1 1 2m 2 S xk 2 x2 mm 2 S xkm xm
S
Dk2 is called as square of Mahalanobis’ distance.
The error range of regression line are given by
aˆ0 aˆ1 x1 aˆ2 x2
1 D 2 2 aˆm xm t p e ; P se n n
Y aˆ0 aˆ1 x1 aˆ2 x2
1 D 2 2 aˆm xm t p e ; P se n n
The error range of objective values are given by
aˆ0 aˆ1 x1 aˆ2 x2
1 D 2 2 aˆm xm t e , P 1 se n n
y aˆ0 aˆ1 x1 aˆ2 x2
1 D 2 2 aˆm xm t e , P 1 se n n
The partial correlation factor, which is the pure correlation factor between factor 1 and r the objective variable y, 1 y , 2,3, , m can be evaluated as
r1 y , 2,3,
,m
Suv 21 Sv12v1 Suu 2
where the variables u , v is the explanatory and objective variable eliminating the influence to the other variables given by
Multiple Regression Analysis
165
uk yk Yk vk1 xk1 X k1 Yk
and X k are regression variables and they are expressed by the other explanatory variables given by
Yk c0 c2 xk 2 c3 xk 3
cm xkm
X k1 d0 d2 xk 2 d3 xk 3
dm xkm
The corresponding coefficients are determined by the standard regression analysis and is given by c2 S22 2 c3 S32 2 cm Sm 2 2
S23 2
S33 2
Sm 3 2
S2 m 2 S3 m 2 Smm 2
1
S2 2y 2 S3 y S 2 my
We can decide c0 as y c0 c2 x2 c3 x3 d 2 S22 2 d3 S32 d p Sm 22 2
S23 2
S33 2
Sm 3 2
cm xm
S2 m 2 S3 m 2 Smm 2
1
2 S21 2 S31 S 2 m1
We can decide d 0 as x1 d0 d2 x2 d3 x3
dm xm
We start with the regression without explanatory variable, and denote it as model 0. The regression is given by
Kunihiro Suzuki
166 Model 0 : Yi y Se M 0 2
The corresponding variance Se M 0 2
is given by
1 n 1 n 2 2 yi Yi yi y S yy 2 n i 1 n i 1
In the next step, we pick up one explanatory variable and denote the model as model 1. The regression using explanatory variable is given by xl
Yi a0 a1 xil1 Se M 1 2
The corresponding variance Se M 1 2
is given by
2 1 n 1 n 2 y Y yi a0 a1 xil1 i i n i 1 n i 1
Then the variable
nS F
2 e M 0
1
nSe M 1 2
2 nSe M 1
e M 0
e M 1
e M 1
follows F distribution with freedom and are given by
F e M 0 e M 1 ,e M 1
where
e M 0 n 1 e M 1 n 2 We can judge the validity of the explanatory variable as
e M 0
and
e M 1
are the
Multiple Regression Analysis
F1 F e M 1 , e M 1 e M 0 F1 F e M 0 e M 1 , e M 1
167
valid invalid
We select the variable that hold above evaluation and the maximum one. We repeat this process as follows. We validate the explanatory variables.
Fi 1
nS nS 2 e Mi
2 e Mi 1
2 nSe Mi 1
e Mi
e Mi 1
e MI 1
follows F distribution with freedom and are given by
F e Mi e Mi 1 ,e Mi 1
where
e Mi
and
e Mi 1
e Mi n i 1
e Mi 1 n i 2 We select the variable that has the maximum value holding the equations. The evaluation is all invalid, the selection of the variable finish. The multiple regression can be expressed using a matrix form as β X T X X TY 1
where
yi a0 a1 xi1 x1 a2 xi 2 x2
a p xip x p ei
The matrixes and vectors are defined as y1 y Y 2 yn
are the
Kunihiro Suzuki
168 1 x11 x1 1 x21 x1 X 1 xn1 x1 a0 a1 β ap e1 e e 2 en
x12 x2 x22 x2 xn 2 x2
x1 p x p x2 p x p xnp x p
Chapter 5
STRUCTURAL EQUATION MODELING ABSTRACT Latent variables are introduced and the relationship between the variables and obtained variable data are related quantitatively in structural equation modeling (SEM). Therefore, SEM enables us to draw a systematic total relationship structure of the subject. We start with k kinds of observed variables, and propose a path diagram which shows the relationship of the observed variables and their cause or results introducing latent variables. We evaluate the variances and covariances using the observed variable data, and they are also expressed with parameters related to the latent variables. We then determine the parameters so that the deviation between two kinds of expressions for variances and covariances is the minimum.
Keywords: structural equation modeling, observed variable, latent variable, path diagram
1. INTRODUCTION Structural equation modeling (SEM) shows a cause of the obtained data introducing variables which depends on an analysist’s image. Therefore, it is vital to search the cause of the obtained data, and SEM is intensively investigated these days. It is developed in psychology field, where a doctor must search the origin of the action, observation, or medical examination of the patient. This technology is also applied to marketing field and the accommodating fields are extended to more and more. On the other hand, this technology depends on the analysists’ image, and we have not unique results with this technology. We need to understand the meaning of the analytical process although it is quite easy to handle.
170
Kunihiro Suzuki
In this chapter, we focus on the basis of the technology. The application of it should be referred with the other books.
2. PATH DIAGRAM A path diagram is used in a structural equation modeling. Figure 1 shows the elements of figure used in a path diagram: Correlation, arrow from cause and results, observed variable, latent variable, and error. The data structure is described using these elements.
Figure 1. Figure elements used in a path diagram.
Figure 2. Example of a path diagram.
Structural Equation Modeling
171
3. BASIC EQUATION We form a path diagram using the elements as shown in Figure 2. We investigate the corresponding basic equation. We have the following equations from the path diagram, which are given by. x1i fi e1i
(1)
x2i fi e2i
(2)
x3i gi e3i
(3)
x4i gi e4i
(4)
gi fi egi
(5)
where i is the data ID and we assume n data, that is, i 1,2, , n . The variance for a variable x1i is defined S11 2
1 n 2 x1i x1 n i 1
(6)
We can evaluate this variance using the observed data. This variance can also be expressed with the latent variable as 1 n 2 x1i x1 n i 1 1 n 2 fi e1i x1 n i 1
S11 2
(7)
Further, we obtain from Eq. (1) as 1 n x1i n i 1 1 n 1 n fi e1i n i 1 n i 1
x1
f e1
(8)
Kunihiro Suzuki
172 where f
e1
1 n fi n i 1
(9)
1 n e1i n i 1
(10)
Substituting these into Eq. (7), we obtain 1 n 2 fi e1i x1 n i 1 2 1 n fi f e1i e1 n i 1 2 1 n 1 n 2 fi f 2 fi f n i 1 n i 1
S11 2
e
1i
e1
1 n 2 e1i e1 n i 1
2 S ff Se1 2
2
(11)
We assume that the correlation between latent variable and error is zero. We can obtain other variables with similar way as S22 2 S ff Se2
(12)
S33 2 S gg Se3
(13)
S44 2 S gg Se4
(14)
2
2
2
2
2
2
2
2
2
S gg 2 S ff Seg 2
2
2
(15)
Latent variable g can be expressed with the other latent variable f . Therefore, we 2
2
2 eliminate S gg from S33 and S44 as
Structural Equation Modeling S33 2 S gg Se3 2
2
173
2
2 2 S ff Seg Se3 2
2
2
2 2 S ff 2 Seg Se3 2
2
S44 2 S gg Se4 2
2
2
(16)
2
2 2 S ff Seg Se4 2
2
2
2 2 S ff 2 Seg Se4 2
2
2
(17)
Next, we consider covariances as shown below. 1 n x1i x1 x2i x2 n i 1 1 n fi e1i x1 fi e2i x2 n i 1 1 n fi f e1i e1 fi f e2i e 2 n i 1 2 1 n fi f n i 1
S12 2
S ff 2
(18)
1 n x1i x1 x3i x3 n i 1 1 n fi e1i x1 gi e3i x3 n i 1 1 n fi f e1i e1 gi g e3i e3 n i 1
S13 2
g
1 n fi f n i 1
i
g
(19)
Substituting gi of Eq. (5) into Eq. (19), we further obtain 2
1 n fi f gi g n i 1 1 n fi f fi f egi eg n i 1
S13
S ff
2
(20)
Kunihiro Suzuki
174
1 n x1i x1 x4i x4 n i 1 1 n fi e1i x1 gi e4i x4 n i 1 1 n fi f e1i e1 gi g e4i e4 n i 1 1 n fi f gi g n i 1
S14 2
S ff 2
(21)
1 n x2i x2 x3i x3 n i 1 1 n fi e2i x2 gi e3i x3 n i 1 1 n fi f e2i e2 gi g e3i e3 n i 1 1 n fi f fi f egi eg n i 1
S23 2
S ff 2
(22)
1 n x2i x2 x4i x4 n i 1 1 n fi e2i x2 gi e4i x4 n i 1 1 n fi f e2i e2 gi g e4i e4 n i 1 1 n fi f gi g n i 1 1 n fi f fi f egi eg n i 1
S24 2
S ff
2
(23)
Structural Equation Modeling
175
1 n x3i x3 x4i x4 n i 1 1 n gi e3i x3 gi e4i x4 n i 1 1 n gi g e3i e3 gi g e4i e4 n i 1 1 n gi g gi g n i 1
S34 2
S gg 2
2 S ff Seg 2
2
S ff Seg 2
2
(24)
We can evaluate all Sij
2
using the observed data, and they are also expressed with
latent variables and related to parameters. The deviation between the observed one and theoretical one is denoted as Q and is expressed with
S S S 2 S S 2 S S S S S S 2 S S S S S S S 2
2 2 2 2 2 2 2 2 2 Q S11 2 S ff Se1 2 S12 S ff 2 S13 S ff 2 S14 S ff 2
2
2 22
2
2 33
2
2
2 ff
2
2 eg
2 e3
2 44
2
2
2 ff
2
2 eg
2 e4
2 ff
2 e2
2 23
2 ff
2
2
2
2 34
2 24
2 ff
2 ff
2 eg
2
2
2
2
(25) We impose that the parameters should be determined so that Q is the minimum. The parameters we should decide are as follows. , , , , , Se 2 , Se 2 , Se 2 , Se 2 , Se 2 , S ff2 1
2
3
4
g
(26)
We express the parameters as symbolically as . Partial differentiate Q with respect to , we impose that it is zero. Q 0
(27)
Kunihiro Suzuki
176
Therefore, we obtain 11 equations for 11 parameters. The solution for the parameters can be evaluated simply as below. We assume that the kinds of observed variable as k . In this case, we have x1 , x2 , x3 , x4 ,and k is 4. We have variances with the kinds of
1 k k 1 2
(28)
In this case we have variables of number of
1 1 k k 1 4 5 10 2 2
(29)
We express the number of parameters as p . The following should be held to decide parameters uniquely.
1 p k k 1 2
(30)
In this case, p 11 , and hence the condition is invalid, and we cannot decide parameters in the above case. When the imposed condition is not valid, we have two ways. One way is to decide the parameter values so that the imposed condition becomes valid, and decide the other parameters. The other one is to change the model.
SUMMARY To summarize the results in this chapter— We evaluate the variances and covariances of k kinds of observed data. That is, we obtain Sij based on the observed data. 2
We also evaluate the Sij
2
including parameters assumed in the path diagram. We
denote them as aij . We then evaluate a parameter associated with the deviation as
Structural Equation Modeling
Q Sij aij i, j
2
177
2
Denoting parameters included in aij as , we impose that Q 0
We have number of n equations which is the number of the parameters in the assumed path diagram. We set the number of observed variables as k . The followings condition must be held to decide parameters.
1 n k k 1 2 When the imposed condition is not valid, we have two ways. One way is to decide the parameter values so that the imposed condition becomes valid, and decide the other parameters. The other one is to change the model.
Chapter 6
FUNDAMENTALS OF A BAYES’ THEOREM ABSTRACT We assume that an event is related to some causes. When we obtain some event, the Bayes’ theorem enables us to specify the probability of the cause of the event. The related cause must be a certain one, and hence the related probability must be 1. However, we do not know which is the real cause, and predict it based on the obtained data. We can make the value of the probability close to 1 or 0 using Bayesian updating based on the obtained data. Many evens are related to each other in many cases. The events are expressed by nodes, and the relationship is expressed by a network connected by lines with arrows, which is called as a Bayesian network. All nodes have their conditional probabilities. A relationship between any two nodes in the network can be evaluated quantitatively using the conditional probabilities.
Keywords: Bayes’ theorem, conditional probability, Bayesian updating, priori probability, posterior probability, Bayesian network, parent node, child node
1. INTRODUCTION A Bayes’ theorem is based on the conditional probability, which is well established in standard statistics. Therefore, the Bayes’ theorem is not so different from the standard statistics in that standpoint of view. In the standard statistics, we basically obtain data associated with causes and predict results based on the data. The Bayes’ theorem discusses the subjects in the opposite direction with incomplete data set. We obtain the results and predict the related cause in the Bayes’ theorem. Further, even when the prediction is not so accurate at the beginning, it is improved by updating related data. We can get vast data and accumulate them easily these days. Therefore, the Bayesian theory utilizing updating data
Kunihiro Suzuki
180
plays an important role in these days. We further treat the case where many events are related to each other, which can be analyzed with a Bayesian network.
2. A BAYES’ THEOREM Let us consider an example of a medical check as shown in Figure 1. When we check a sick people, we obtain a result of positive reaction with a ratio of 98%. The event of the positive reaction is denoted as E , and the corresponding probability is denoted as P E 0.98
(1)
Note that we consider only sick people. The high percentage is good itself, but it is not enough for the medical check. It is important that it does not react to non-sick people. Therefore, we assume the probability where the non-sick people react with the medical check as P E 0.05
Figure 1. Reaction ratio of sick and non-sick people.
(2)
Fundamentals of a Bayes’ Theorem
181
The sick people are denoted as A and the non-sick people as B . The probability of 98% of Eq. (1) is the one among the sick people, and hence the notation should be modified to express the situation clearly. We then denote Eq. (1) as
P E A 0.98
(3)
The right side of A in Eq. (3) expresses the population for the probability where event E occurs. P EB The probability for the non-sick people is then denoted as and is expressed as
P E B 0.05 This expresses that the event
(4) E
occurs for the non-sick people
The medical check is required to be high for Therefore, this example is an expected one.
P E A
B
.
and to be low for
P E B
.
We further need a ratio of the sick people to the total one, which is denoted as P A , and assume it to be 5% here. Therefore, we have P A 0.05
(5)
There are only sick or non-sick people, and hence the ratio of the non-sick people is given by P B 1 P A 0.95
(6)
Strictly speaking, P A means the probability that even A occurs among the whole events. The whole events are A and A in this case. We denote the whole case as . Therefore, P A should be expressed with
P A P A
(7)
We frequently ignore this . This is implicit assumption and simplifies the expression. We then have all information associated with this subject. The ratio that people obtain positive reaction in the medical check is given by
Kunihiro Suzuki
182
P E P E A P A P E B P B 0.98 0.05 0.05 0.95 0.0965
(8)
Note that this total ratio is as low as about 10%. However, this ratio is not so interesting. This simply means that the sick people ratio is low. We study the probability that the Bayes’ theorem treats. We select one person from a set, and perform the medical check, and obtain the positive reaction. What is the probability that the person is really sick? The probability is denoted as . The corresponding situation is shown in Figure 2. The probability is the ratio of the positively reacted people in A to the total reacted people, and we do not care about the non-reacted people. We discuss the situation under where we obtain the positive reaction. The ratio of the positively reacted people in A is given by P AE
P E A P A
(9)
The ratio of the positively reacted people in B is given by
P E B P B
(10)
Therefore, the target probability is evaluated as P A E
P E A P A
P E A P A P E B P B
0.98 0.05 0.98 0.05 0.05 0.95 0.51
(11)
This is called as Bayes’ probability, which expresses the probability of the cause for obtained result. This may be rather lower value than we expect since P E A 0.98 . The appreciation of the result is shown below. The positively reacted people constitution is schematically shown in Figure 2. The evaluated ratio is the reacted sick people to the total reacted people. Although the reaction ratio for the non-sick people is low, the number of the non-sick people is much larger than the sick people. Therefore, the number of the reacted people among non-sick people is comparable with the number of the reacted people among the sick people. Consequently, we have a rather low value ratio.
Fundamentals of a Bayes’ Theorem
183
Figure 2. Reaction ratio of sick and non-sick people.
We can easily extend this theory to a general case where the causes associated with the obtained event are many.
m kinds of subjects are denoted as Ai i 1, 2, , m . When we have an event
E
, the
probability that the cause of the event Ai is expressed by P Ai E
P E Ai P Ai PE
P E Ai P Ai
P E A1 P A1 P E A2 P A2
P E Am P Am
P E Ai P Ai
PE A P A m
j 1
j
j
(12)
This is called as a Bayes’ theorem.
3. SIMILAR FAMOUS EXAMPLES There are other two famous examples of the Bayes’ theorem, which we introduce here briefly.
Kunihiro Suzuki
184
3.1. Three Prisoners There are three prisoners, A , B , and C . One of them is decided to be pardoned although the selection is at random and we do not know who he is beforehand. After the decision, the jailer knows who is pardoned. The prisoner A asks the jailer to notice the unpardoned one among the prisoners B and C . The jailer answers to the prisoner A
that the prisoner B is unpardoned.
The prisoner A thought that the probability that he is pardoned was 1 3 before he heard the jailer’s answer. The prisoner A then heard that the prisoner B is unpardoned and hence the prisoners A or C is pardoned. Therefore, the prisoner A thought that the probability that he (the prisoner A ) is pardoned is increased from 1 3 to 1 2 . In this problem, we assume that the jailer is honest and tells a prisoner name B or C with the same probability of 1 2 when the prisoner A is pardoned. We study whether the above discussion is valid or not. The events are denoted as follows. A: The prisoner A is pardoned. B: The prisoner B is pardoned. C: The prisoner C is pardoned. S A : The jailer talks that the prisoner A is not pardoned. S B : The jailer talks that the prisoner B is not pardoned. SC : The jailer talks that the prisoner C is not pardoned.
The probability that each member is pardoned is the same and is given by P A P B P C
1 3
(13)
We assume that the jailer is honest and does not tell a lie as mentioned before, and hence
P SB B 0
(14)
Fundamentals of a Bayes’ Theorem
P SB C 1
185
(15)
Let us consider the case where the prisoner A is pardoned. We then have P S B A
1 2
(16)
P SC A
1 2
(17)
The probability P A SB
P A SB
is given by P S B A P A
P S B A P A P S B B P B P S B C P C
1 1 2 3 1 1 1 1 0 1 2 3 3 3 1 3
(18)
which is not 1 2 . Therefore, the probability that the prisoner A is pardoned is invariant.
3.2. A Monty Hall Problem We treated the Monty Hall problem in Chapter 2 of volume 1, and study again the problem in the standpoint of the Bayes’ theorem. We explain the problem again. There are three rooms where a car is in one room and goats are in the other rooms. The doors are shut at the beginning. We call the room and the door where a car is in as winning ones. The guest selects one room first and does not check whether it is a winning one or not, that is, the door is not opened. Monty Hall checks the other rooms. There must be one nonwinning room at least, that is, there is a room where a goat is in. Monty Hall deletes the non-winning room, that is, he opens the door where a goat is in. The problem is whether the guest should change the selected room or keep his first selected room. The door that the guest selects is denoted as A and the door that Monty Hall opens is denoted as C , and the rest door is denoted as B .
Kunihiro Suzuki
186
The probability that the door A is the winning one is denoted as P A , the probability that the door B is the winning one is denoted as P B , and the probability that the door C is the winning one is denoted as P C . The event that Monty Hall opens the door is denoted as D . The important probability is P D A
P D A
P DB and .
is the probability that Monty Hall opens the door C under the condition that
the door A is the winning one. In this case, non-winning door may be B or C , and hence the probability that Monty Hall opens door C is 1 2 . Therefore, we obtain P D A P D B
1 2
(19)
is the probability that Monty Hall open the door C under the condition that the
door C is the winning one. In this case, non-winning door must be C , and hence the probability that Monty Hall opens door C is 1 . Therefore, we obtain
P D B 1
(20)
Finally we can evaluate the probability where the guest does not change the room as P A D
P D A P A
P D A P A P D B P B
1 1 2 3 1 1 1 1 2 3 3 1 3
(21)
If the guest changes the selected room from A to B , the corresponding probability is given by
Fundamentals of a Bayes’ Theorem P B D
187
P D B P B
P D A P A P D B P B 1
1 3
1 1 1 1 2 3 3 2 3
(22)
Therefore, the guest should change the room from A to B . What is the difference between the three prisoner problem and the Monty Hall problem. One can change the decision in Monty Hall problem after he gets information, while one cannot change it in the three prisoner problem. If the prisoner A can be changed to prisoner B , the situation becomes identical to the Monty Hall problem.
4. BAYESIAN UPDATING The probabilities shown up to here may be against our intuition. However, one of our targets is specify the cause related to the event. For example, the resulted probability for medical check was about 0.5 even when we obtain positive reaction. The value of the probability may be interesting. However, what can we do with the probability of 0.5? We need to obtain extreme probability values close to 1 or 0 to do something further. Bayesian updating is a procedure to make the probability values to the extreme ones.
4.1. Medical Check Let us consider the situation of the sick people example in more detail. We obtain positive reaction E with the medical check, and the probability for the person to be sick is given by P A E
P E A P A
P E A P A P E B P B
0.98 0.05 0.98 0.05 0.05 0.95 0.51
(23)
Kunihiro Suzuki
188
We can also evaluate the probability for the person to be non-sick after we obtain event E as PB E
P E B P B
P E A P A P E B P B
0.05 0.95 0.98 0.05 0.05 0.95 0.49
(24)
We do not know the detail of the checked person before the medical check. Therefore, we use a general probability values for P A and P B . However, we now know that his medical check result is positive. Therefore, we can regard P A and P B for the person as
PA E
, and
PB E
, respectively. We can then regard
P A E P A
(25)
P B E P B
(26)
The probabilities of P A 0.05 and P B 0.95 before the medical check are called as
priori
probabilities,
P B E P B 0.49
and
the
probabilities
P A E P A 0.51
and
are called as posterior probabilities, which is shown in Figure 3.
Figure 3. Priori and posterior probability where we have a positive reaction in the medical check.
Fundamentals of a Bayes’ Theorem
189
4.2. Double Medical Checks We cannot do clearly with posterior probability of Eqs. (25) and (26) since they are not so extreme values. We try the same medical check again, and assume that we have again the positive reaction. It is not the same as the first one. We use posterior probabilities in this case. Let us consider the situation of the sick people example in more detail. We obtain positive reaction E with the medical check, and the probability for the person to be sick is given by P A E
P E A P A
P E A P A P E B P B
0.98 0.51 0.98 0.51 0.05 0.49 0.95
E
(27)
We can also evaluate the probability for the person to be non-sick after we obtain event as PB E
P E B P B
P E A P A P E B P B
0.05 0.49 0.98 0.51 0.05 0.49 0.05
(28)
Therefore, the person should perform further process as shown in Figure 4.
Figure 4. Priori and posterior probability after we have positive reaction in the second medical check.
Kunihiro Suzuki
190
4.3. Two Kinds of Medical Checks We consider two kinds of medical checks here. The first one is alternatively rough estimation, and the second accurate one, which is shown in Figure 5. The positive reaction probabilities for the test 1 is denoted as P1 E B 0.1
, and the positive reaction probabilities for the test 2 as
P1 E B 0.03
. We assume the prior probabilities of
P A 0.05
and
P1 E A 0.8
and
P2 E A 0.98
P B 0.95
and
.
Figure 5. Two kinds of medical check. The test 1 is relatively rough check compared with the test 2 since the related probabilities are not so extreme.
We perform the test 1 medical check first. If we obtain negative reaction, we can evaluate the posterior probabilities as
P1 A E
P1 E A P A
P1 E A P A P1 E B P B
0.2 0.05 0.2 0.05 0.90 0.95 0.01
P1 B E
P1 E B P B
(29)
P1 E A P A P1 E B P B
0.90 0.95 0.2 0.05 0.90 0.95 0.99
(30)
Fundamentals of a Bayes’ Theorem
191
Therefore, we can convincingly judge that the person is not sick, and finish the medical check. The corresponding probabilities update feature is shown in Figure 6.
Figure 6. Priori and posterior probability after we have negative reaction in the test1 medical check.
We then consider a case where we obtain positive reaction in the test 1 medical check. The corresponding posterior probabilities are given by P1 A E
P1 E A P A
P1 E A P A P1 E B P B
0.80 0.05 0.80 0.05 0.10 0.95 0.30
P1 B E
(31)
P1 E B P B
P1 E A P A P1 E B P B
0.10 0.95 0.80 0.05 0.10 0.95 0.70
(32)
The corresponding Bayesian updating process is shown in Figure 7. We cannot judge clearly with these posterior probabilities, and perform the test 2 medical check.
Figure 7. Priori and posterior probability after we have positive reaction in the test 1 medical check.
Kunihiro Suzuki
192
We use Eqs. (31) and (32) as the priori probabilities for the test 2. If we obtain positive reaction in this medical check, the posterior probabilities are evaluated as follows. P2 A E
P2 E A P A
P2 E A P A P2 E B P B
0.98 0.30 0.98 0.30 0.03 0.70 0.93
P2 B E
(33)
P2 E B P B
P2 E A P A P2 E B P B
0.03 0.70 0.98 0.30 0.03 0.70 0.07
(34)
Therefore, the person is probably sick. If we obtain negative reaction in this medical check, the posterior probabilities are evaluated as follows.
P2 A E
P2 E A P A
P2 E A P A P2 E B P B
0.02 0.30 0.02 0.30 0.97 0.70 0.01
P2 B E
P2 E B P B
(35)
P2 E A P A P2 E B P B
0.97 0.70 0.02 0.30 0.97 0.70 0.99
(36)
Therefore, the person is probably non-sick.
4.4. A Punctual Person Problem We can treat some ambiguous problem using the procedure shown in the previous section. The theory is exactly the same as the one in the previous section, but we should define the ambiguous problem by making some assumptions.
Fundamentals of a Bayes’ Theorem
193
Let us consider the case for a punctual person, where we want to decide whether a person is punctual or not punctual. The ambiguous point in this problem is that a punctual person is not clearly defined in general. We feel that a punctual person is apt to be on time with high probability and not punctual people are not apt to keep a time. The probability is not 1 or 0, but some values between 0 and 1 for both cases. We need to define punctual quantitatively, although it is rather difficult. We form two groups: One group consists of persons who are thought to be punctual, and the other group consists of persons who are thought to be not punctual. We then gather data of on time or off time and finally get the data such as shown in Table 1. This is a definition of punctual or not punctual. That is,
Punctual:
On time probability
Not punctual:
4 1 ;Off time probability 5 5
On time probability
1 3 ;Off time probability 4 4
Table 1. Definition of punctual or not punctual Character
Off time( D1 )
On time( D2 )
Not Punctual ( H1 )
3 4
1 4
Punctual ( H 2 )
1 5
4 5
We investigate a person S whether he is punctual or not. The unpunctual case is denoted as H1 , and punctual case as H 2 . The off-time is denoted as D1 and on time as D2 , which is also shown in Table 1. S’ s score in recent five meetings are D2 , D2 , D1 , D1 , D1 . Step 0 We set the initial value for a member S. We have an impression that he is not punctual, and hence set the initial one as
P H1 0.6
(37)
P H 2 0.4
(38)
Kunihiro Suzuki
194
Step 1 Next we re-evaluate S considering the obtained data. In the first meeting, he was on time, and hence, we have
P H1 D2
P D2 H1 P H1
P D2 H1 P H1 P D2 H 2 P H 2
1 0.6 4 1 4 0.6 0.4 4 5 0.319 P H 2 D2
(39)
P D2 H 2 P H 2
P D2 H1 P H1 P D2 H 2 P H 2
4 0.4 5 1 4 0.6 0.4 4 5 0.681
(40)
Therefore, the updating P H1 and P H 2 are
P H1 0.319
(41)
P H 2 0.681
(42)
Step 2 He was also on time in the second meeting, and we have
P H1 D2
P D2 H1 P H1
P D2 H1 P H1 P D2 H 2 P H 2
1 0.310 4 1 4 0.310 0.681 4 5 0.125
(43)
Fundamentals of a Bayes’ Theorem
P H 2 D2
195
P D2 H 2 P H 2
P D2 H1 P H1 P D2 H 2 P H 2
4 0.681 5 1 4 0.310 0.681 4 5 0.875
(44)
Therefore, the updating P H1 and P H 2 are
P H1 0.125
(45)
P H 2 0.875
(46)
Step 3 He was off time in the third meeting, and we have
P H1 D1
P D1 H1 P H1
P D1 H1 P H1 P D1 H 2 P H 2
3 0.125 4 3 1 0.125 0.875 4 5 0.349 P H 2 D1
(47)
P D1 H 2 P H 2
P D1 H1 P H1 P D1 H 2 P H 2
1 0.875 5 3 1 0.125 0.875 4 5 0.651 Therefore, the updating P H1 and P H 2 are
(48)
Kunihiro Suzuki
196
P H1 0.349
(49)
P H 2 0.651
(50)
Step 4 He was off time in the fourth meeting, and we have
P H1 D1
P D1 H1 P H1
P D1 H1 P H1 P D1 H 2 P H 2
3 0.349 4 3 1 0.349 0.651 4 5 0.668 P H 2 D1
(51)
P D1 H 2 P H 2
P D1 H1 P H1 P D1 H 2 P H 2
1 0.651 5 3 1 0.349 0.651 4 5 0.332
(52)
Therefore, the updating P H1 and P H 2 are
P H1 0.668
(53)
P H 2 0.332
(54)
Step 5 He was off time in the fifth meeting, and we have
Fundamentals of a Bayes’ Theorem
P H1 D1
P D1 H1 P H1
P D1 H1 P H1 P D1 H 2 P H 2
3 0.668 4 3 1 0.668 0.332 4 5 0.883 P H 2 D1
197
(55)
P D1 H 2 P H 2
P D1 H1 P H1 P D1 H 2 P H 2
1 0.332 5 3 1 0.668 0.332 4 5 0.117
(56)
Therefore, the updating P H1 and P H 2 are
P H1 0.883
(57)
P H 2 0.117
(58)
Therefore, we are convincing that he is not punctual based on the data. It should be noted that the results depend on the definition of the punctual people. However, we can treat such ambiguous issue quantitatively using the procedure.
4.5. A Spam Mail Problem We receive many mails every day. In the huge mails, we have sometimes spam mails. We want to divide the spam mail definitely by evaluating words in the mail. This is one famous example that the Bayesian updating procedure accommodates. We make a dictionary where key words and related probabilities are listed as word 1, 2, 3,・・・, where three of which are shown in Table 2.
Kunihiro Suzuki
198
Table 2. Dictionary for evaluating spam mails Word W01 W02 W03
Spam rs1 rs2 rs3
Non-spam rns1 rns2 rns3
We denote the event of a spam mail as S , and a non-spam mail as S . We assume that we find words 1, 2, and 3 in a mail, and want to evaluate whether the mail is spam or not. We need to define the initial probability for that
PS
and
PS
. We assume that we know
P S 1 rs 0
P S rs 0
. Therefore, we can assume that . We find the word 1, and the corresponding probability is given by
P S W 01
P S W 01
P W 01 S P S
P W 01 S P S P W 01 S P S rs1rs 0 PS rs1rs 0 rns1 1 rs 0
(59)
P W 01 S P S
P W 01 S P S P W 01 S P S rns1 1 rs 0
rs1rs 0 rns1 1 rs 0
PS (60)
We find the word 2, and the corresponding probability is given by
P S W 02
P W 02 S P S
P W 02 S P S P W 02 S P S
rs1rs 0 rs1rs 0 rns1 1 rs 0 rns1 1 rs 0 rs1rs 0 rs 2 rns 2 rs1rs 0 rns1 1 rs 0 rs1rs 0 rns1 1 rs 0 rs 2
rs 2 rs1rs 0 PS rs 2 rs1rs 0 rns 2 rns1 1 rs 0
(61)
Fundamentals of a Bayes’ Theorem
P S W 02
199
P W 02 S P S
P W 02 S P S P W 02 S P S
rns1 1 rs 0 rs1rs 0 rns1 1 rs 0 rns1 1 rs 0 rs1rs 0 rs 2 rns 2 rs1rs 0 rns1 1 rs 0 rs1rs 0 rns1 1 rs 0 rns 2
rns 2 rns1 1 rs 0
rs 2 rs1rs 0 rns 2 rns1 1 rs 0
PS (62)
We finally find the word 3, and corresponding probabilities are given by
P S W 03
P W 03 S P S
P W 03 S P S P W 03 S P S
rs 2 rs1rs 0 rs 2 rs1rs 0 rns 2 rns1 1 rs 0 rns 2 rns1 1 rs 0 rs 2 rs1rs 0 rs 3 rns 3 rs 2 rs1rs 0 rns 2 rns1 1 rs 0 rs 2 rs1rs 0 rns 2 rns1 1 rs 0 rs 3
P S W 03
rs 3rs 2 rs1rs 0 PS rs 3rs 2 rs1rs 0 rns 3rns 2 rns1 1 rs 0
(63)
P W 03 S P S
P W 03 S P S P W 03 S P S
rns 2 rns1 1 rs 0 rs 2 rs1rs 0 rns 2 rns1 1 rs 0 rns 2 rns1 1 rs 0 rs 2 rs1rs 0 rs 3 rns 3 rs 2 rs1rs 0 rns 2 rns1 1 rs 0 rs 2 rs1rs 0 rns 2 rns1 1 rs 0 rns 3
rns 3 rns 2 rns1 1 rs 0
rs 3 rs 2 rs1rs 0 rns 3rns 2 rns1 1 rs 0
PS (64)
We can easily extend this to the general case where we find m kinds of keywords in a mail. The probabilities that the mail is spam or non-spam are given by
Kunihiro Suzuki
200 m
PS
rs 0 rsi i 1
m
m
i 1
i 1
rs 0 rsi 1 rs 0 rnsi
(65)
m
PS
1 rs 0 rnsi i 1
m
m
i 1
i 1
rs 0 rsi 1 rs 0 rnsi
(66)
It should be noted that we neglect the correlation between each word in the above analysis.
4.6. A Vase Problem We consider a vase problem, which is frequently treated and is convenient to extend the formula. We know the two kinds of vases where black and gray balls are in, and we know the each ratio of black ball number to the total one. We select one vase where we do not know which one it is. We take a ball from the vase and return it. Getting the data, we want to guess which vase we use.
Figure 8. A vase problem with two kinds of balls.
Let us consider two vases A and B as shown in Figure 8. The vase A contains balls with a ratio of gray to black is 4:1, and B with a ratio of 2:3. We express the event of
Fundamentals of a Bayes’ Theorem
201
obtaining a black ball as E and a gray balls as E , and hence the probabilities are expressed as
P E A
1 5
(67)
P E B
3 5
(68)
These probabilities characterize the vases. The probabilities that we obtain a black ball are given by P A E
PB E
P E A P A
P E A P A P E B P B
(69)
P E B P B
P E A P A P E B P B
(70)
The probabilities that we obtain a gray ball are given by
P AE
P BE
P E A P A
(71)
(72)
P E A P A P E B P B
P E B P B
P E A P A P E B P B
We select one type of a vase and cannot distinguish which vase it is by its appearance. We do not know which vase it is. Therefore, one of the probability P A or P B is 1 and the other is 0. We pick up one ball and return it to the vase, and perform five times and obtain the results of EEEEE . Therefore, the number of black balls is larger and hence we can guess that the vase B is more plausible. We want to evaluate the probability more quantitatively. The procedures are as follows.
Kunihiro Suzuki
202
Step 0 If we have some information associated with the number of vase A and B , and hence its ratio, we can use them as the initial probability for P A and P B . If we do not have any information associated with the ratio, we assume the probabilities as P A P B
1 2
(73)
Step 1 We obtain a gray ball are given by
P AE
E in the first trial, and hence the corresponding probabilities
P E A P A
P E A P A P E B P B
4 1 5 2 4 1 2 1 5 2 5 2 0.667
P BE
(74)
P E B P B
P E A P A P E B P B
2 1 5 2 4 1 2 1 5 2 5 2 0.333
(75)
We regard these as P A and P B , that is,
P A P A E 0.667 P B P B E 0.333
(76)
Step 2 We obtain a black ball E in the second trial, and hence the corresponding probabilities are given by
Fundamentals of a Bayes’ Theorem P A E
203
P E A P A
P E A P A P E B P B
1 0.667 5 1 3 0.667 0.333 5 5 0.4
PB E
(77)
P E B P B
P E A P A P E B P B
2 0.333 5 4 2 0.667 0.333 5 5 0.6
(78)
We regard these as P A and P B , that is, P A P A E 0.4 P B P B E 0.6
(79)
Step 3 We obtain a black ball E in the third trial, and hence the corresponding probabilities are given by P A E
P E A P A
P E A P A P E B P B
1 0.4 5 1 3 0.4 0.6 5 5 0.182
(80)
Kunihiro Suzuki
204 PB E
P E B P B
P E A P A P E B P B
3 0.6 5 1 3 0.4 0.6 5 5 0.818
(81)
We regard these as P A and P B , that is, P A P A E 0.182 P B P B E 0.818
Step 4 We obtain a gray ball are given by
P AE
(82)
E in the third trial, and hence the corresponding probabilities
P E A P A
P E A P A P E B P B
4 0.182 5 4 2 0.182 0.818 5 5 0.308
P BE
(83)
P E B P B
P E A P A P E B P B
2 0.818 5 4 2 0.182 0.818 5 5 0.692
We regard these as P A and P B , that is,
(84)
Fundamentals of a Bayes’ Theorem
205
P A P A E 0.308 P B P B E 0.692
(85)
Step 5 We obtain a black ball E in the third trial, and hence the corresponding probabilities are given by P A E
P E A P A
P E A P A P E B P B
1 0.308 5 1 3 0.308 0.692 5 5 0.129
PB E
(86)
P E B P B
P E A P A P E B P B
3 0.692 5 0.871
(87)
We regard these as P A and P B , that is, P A P A E 0.182 P B P B E 0.818
(88)
The final probability of P A is much smaller than P B , and hence we guess the vase is type B . When we update the probability, the accuracy of the probability improves, which is shown later. We treat two types of vase A and B here. However, we can easily extend it to many kinds of Ai where i is 1, 2, ・・・, m. The events are expressed by the conditional probability of
P E j Ai
Ej
. We need to know
(89)
Kunihiro Suzuki
206
E
In this analysis, j is not limited to E or E as is the case for the example. We can use any number kinds of the event, but need to know the corresponding conditional probabilities of Eq. (89). We can set initial probabilities as P Ai
1 m
(90)
If we obtain an event of E1 , the corresponding probability under the event E1 are given by P Ai E1
P E1 Ai P Ai
P E1 Ai P Ai m
P Ai
i 1
(91)
We can update the probabilities of P Ai using this equation.
Figure 9. A vase problem with three kinds of balls.
4.7. Extension of Vase Problem We can easily extend the vase problem, where we can increase the number of the vase, and kinds of balls. We increase the vase number from 2 to 3 denoted as A1 , A2 , and A3 , and kind of balls form black and gray to black, gray, and white ones denoted as E1 , E2 , and E3 , which is shown in Figure 9. We characterize the vases as the ratio of each ball, which is given by
Fundamentals of a Bayes’ Theorem
207
P E1 A1
1 2 2 ; P E2 A1 ; P E3 A1 5 5 5
(92)
P E1 A2
3 1 1 ; P E2 A2 ; P E3 A3 5 5 5
(93)
P E1 A3
2 0 3 ; P E2 A3 ; P E3 A3 5 5 5
(94)
If we have some information associated with the initial condition for P Ai , we should use them. If we do not have any information associated with the ratio, we assume the identical probabilities given by P A1 P A2 P A3
1 3
(95)
If we obtain a certain ball which is expressed as Ei , then we can evaluate posterior probabilities as P A1 Ei
P A2 Ei
P A3 Ei
P Ei A1
P Ei A1 P A1 P Ei A2 P A2 P Ei A3 P A3
(96)
P Ei A2
P Ei A1 P A1 P Ei A2 P A2 P Ei A3 P A3
(97)
P Ei A3
P Ei A1 P A1 P Ei A2 P A2 P Ei A3 P A3
(98)
We can easily extend the number of the vase to m and kinds number of balls to l . We can then evaluate the posterior probabilities as P Ak Ei
P Ei Ak
PE m
j 1
i
Aj P Aj
P Aj where i 1,2, , l . Note that we need to know the initial priori probabilities .
(99)
Kunihiro Suzuki
208
4.8. Comments on Prior Probability In the Bayesian updating, we need to know the prior probabilities. Let us consider it using the two kinds of medical check example. We use the same probability associated with the medical check, but two different sets of prior probabilities. We assume that we obtain positive reactions for the test 1 and the test 2. The related posterior probabilities are given by P A EE
P B EE
P2 E A P1 E A P0 A
P2 E A P1 E A P0 A P2 E B P1 E B P0 B
(100)
P2 E B P1 E B P0 B
P2 E A P1 E A P0 A P2 E B P1 E B P0 B
(101)
We use two different prior probability sets of P0 A 0.05 and P0 B 0.95 , and and P0 B 0.5 . The related posterior profanities for sick are 0.93 and 1.00, respectively. The values are different, but there is no difference in the standpoint of the judge for next doing subjects. Therefore, the initial prior probabilities are not so important if we repeat the check many times. If we know the prior probabilities, we use it, and use certain ones if we do not know them in detail, and the identical one divided by the number of the cause m if we do not about it anything. P0 A 0.5
4.9. Mathematical Study for Bayesian Updating We observed that we obtain proper probability when we repeat Bayesian updating. We study the corresponding mathematics here. We consider the vase problem. A vase A contains black and gray balls and the ratio of black ball to the total is p1 . We have many other vases of which ratios are different from p1 . We do not know the vase from which we take balls although it is the vase A . When we perform N trials, we can expect p1 N black balls and 1 p1 N gray balls. What we should do is to prove the probability
has the maximum value for the vase A . The probability that we take a black ball is denoted as p in general. The probability that after the N times trial f is given by
Fundamentals of a Bayes’ Theorem f p Np1 1 p
209
N 1 p1
(102)
We differentiate this equation with respect to p and set it to 0, and obtain
df N 1 p1 N 1 p1 1 Np1 p Np1 1 1 p N 1 p1 p Np1 1 p dp Np Np1 1 1 p
N 1 p1 1
p1 1 p p 1 p1
Np Np1 1 1 p
N 1 p1 1
p1 p
0
(103)
Therefore, we obtain the maximum value for
p p1
(104)
Figure 10 shows the dependence of on probability on p for various trial numbers N and p1 : p1 0.2 for (a) and p1 0.5 for (b). The probability has the maximum value for
p1 , and it is limited to p1 with increasing N . The features are the same for p1 0.2 and 0.5.Therefore, we can expect one significant probability value compared with the others and can state clearly which vase it is with increasing updating numbers. 1.0
1.0 N = 10
0.6
0.8 N = 100
0.4
N = 50
0.6
N = 100
0.4 0.2
0.2 0.0 0.0
N = 10
N = 50
f(p)/f(p1)
f(p)/f(p1)
0.8
p1 = 0.5
p1 = 0.2
0.5 p (a)
1.0
0.0 0.0
0.5 p (b)
1.0
Figure 10. Dependence of probability on the probability for taking a black ball (a) p1 0.2 , (b) p1 0.5 .
Kunihiro Suzuki
210
5. A BAYESIAN NETWORK Many events are related to each other and some events are causes of the other events in a real world. Further, the cause events are also sometimes the results of other events. Consequently, the related events can be expressed by a network. The event in the network is denoted as a node, and a conditional probability is assigned to each node, which is called as the Bayesian network. Using the Bayesian network, we can predict the quantitative probability between any two events in the network.
Figure 11. Schematic figure for medical check.
5.1. A Child Node and a Parent Node Let us consider first the previous example of a medical check as shown in Figure 11. We denote the sick or non-sick as an event of
Ai A 1
, and if the person is non-sick, we denote
negative reaction event as
Ci C 1
Ci
Ai
. If the person is sick, we denote
Ai A 0
. We denote the positive or
. If the person obtains positive reaction, we denote
, and if the person obtains negative reaction, we denote
The event event
Ci
Ai
Ai
can be regarded as the cause of the event
. The event
Ai
Ci
Ci C 0
and
Ci
is also called as the parent event for the event
is also called as the child event for event nodes in the network.
Ai
. The events
Ai
.
is the result of the
Ci
and
, and the event
Ci
are called as
Fundamentals of a Bayes’ Theorem
211
We also need the related probabilities of the nodes, which are given by
P A 0.05
(105)
P A 1 P A 0.95
(106)
P C A 0.98
(107)
P C A 0.05
(108)
A
C
We regard this as a Bayesian network. We regard i as a cause of event i . Therefore, the corresponding network is expressed as shown in Figure 12. We note that the subscript i in Ai and i in Ci are independent each other. We use this expression for simplicity, and this is applied to the other figures.
Figure 12. A Bayesian network for the medical check.
Kunihiro Suzuki
212
P Ci
are not described in the figure. However, we can evaluate them as follows.
P C P C A P A P C A P A P C A 1 P A P C A P A 0.05 1 0.05 0.98 0.05 0.0965
(109)
P C 1 P C 1 0.0965 0.9305
We want to obtain the conditional probability
P A C
(110) P A C
and
, which is given by
P AC
P C A P A P C
0.98 0.05 0.0965 0.508
Figure 13. A Bayesian network with one hierarchy and one child node and two parents nodes.
(111)
Fundamentals of a Bayes’ Theorem
213
We extend the Bayesian network of one hierarchy and one child node and two parents nodes as shown in Figure 13. The nodes are denoted as Ai , Bi , and Ci . Ai and Bi are causes of Ci . Ci is called as a child node for nodes Ai and Bi , and nodes Ai and Bi are called as parents nodes for the node Ai . The events Bi , and Ci are regarded as the causes of the result Ai . When the event related to Ai occurs, we express it as Ai A 1 . When it does not occur, we express it as Ai A 0 . Summarizing above, we can express the status of each node as
A 1 Ai A 0
(112)
B 1 Bi B 0
(113)
C 1 Ci C 0
(114)
We assign probabilities to each node as follows. The data associated with the node Ai is given by
P A 0.05
(115)
and
P A 1 P A 0.95
(116)
The data associated with the node Bi is given by
P B 0.03
(117)
Kunihiro Suzuki
214 and
P B 1 P B 0.97
(118)
The data associated with the node C should be P C Ai events of Ai and Bi .
Bi
Bi
which is related to the
is then given by
Bi 1 P C Ai
P C Ai
P C Ai
Bi
(119)
The above is shown in in Figure 13 Using the above fundamental data, we then evaluate the probabilities below. P C P C A
B P A P B P C A
B P A P B
P C A B P A P B P C A B P A P B 0.01 1 0.05 1 0.03 0.30 1 0.05 0.03 0.80 0.05 1 0.03 0.95 0.05 0.03 0.058
(120)
We implicitly assume that the events P Ai
Ai
and
B j P Ai P B j
Bj
are independent and use
(121)
We further obtain P C 1 P C 0.942
(122)
We also evaluate conditional probabilities. We first evaluate P A C
PA C
, which is given by
P C A P A P C
(123)
Fundamentals of a Bayes’ Theorem
215
P C A We know P C and P A , but do not know which can be expanded as
P C A P C A B P B P C A B P B 0.80 1 0.03 0.95 0.03 0.805
(124)
Therefore, we obtain
P A C
P C A P A P C
0.805 0.05 0.058 0.694
(125)
We can obtain the other conditional probabilities with a similar way as
PB C
P C B P B P C
P C A
B P A P C A B P A P B P C
0.30 0.95 0.95 0.05 0.03 0.058
0.665
(126)
We then automatically obtain
P A C 1 P A C 0.306
(127)
P B C 1 P B C 0.335
(128)
Kunihiro Suzuki
216 We can further obtain
P A C
P C A P A P C
1 P C A P A P C
1 0.805 0.05
0.942 0.010
PB C
(129)
P C B P B P C
1 P C B P B P C 1 P C B
A P A P C B P C
A P A P B
1 0.30 0.95 0.95 0.05 0.03 0.942 0.021
(130)
We can then automatically obtain
P A C 1 P A C 0.09
(131)
P B C 1 P B C 0.979
(132)
Fundamentals of a Bayes’ Theorem
217
Figure 14. A Bayesian network which is added node Di to the one in Figure 13.
5.2. A Bayesian Network with Multiple Hierarchy We considered one hierarchy in the previous section, where there is one child node and its two parents nodes, that is, there is only one child node. We then extend the network by adding a node child node for nodes
Ai
and
Since we add the node is given by
Di
Bi
Di
as shown Figure 14. Node
, but it is also a child node for node
Di
Ci
is a
.
, we need to know the related probability; we assume which
P D C 0.01
(133)
P D C 0.70
(134)
We can easily evaluate P D as P D P D C P C P D C P C 0.01 0.942 0.70 0.058 0.050
(135)
Kunihiro Suzuki
218
We also need a conditional probability such as P A D
Ai A
Therefore,
, which can be evaluated as
P D A P A P D
(136)
In this equation, the unknown term is only we fix
P A D
and
P D A
Di D
P D A
. The route is
. The problem is how we treat the
Ci
Ai Ci Di
, which can be
, and
C or C .
can be expanded as
P D A P D C P C A P C P D C P C A P C P D C P C A P C P D C 1 P C A P C
(137)
P C A where also depend on Bi , we do not know its explicit expression, but can obtain as follows. P C A P C A
B P B P C A
B P B
0.80 1 0.03 0.95 0.03 0.8045
Therefore, we obtain
(138)
P D A
as
P D A P D C P C A P C P D C 1 P C A P C 0.70 0.8045 0.058 0.01 1 0.8045 1 0.058 0.0345
(139)
This can be generally expressed as
P D A P D Ci P Ci A P Ci i
We can easily extend this procedure to any network system.
(140)
Fundamentals of a Bayes’ Theorem
We consider the network shown in Figure 15, where we add a node conditional probability P A F
PA F
219
Fi
. The
can be evaluated as below.
P F A P A PF
(141)
where
P A F P F Di P Di C j P C j A P Di P C j i, j
(142)
One problem for the long Bayesian network is that the number of term increases significantly as shown in Eq. (142).
Figure 15. A Bayesian network, which is added node Fi to the one in Figure 14.
We treat one more example as shown in Figure 16. The target conditional probability is
P A G
. The corresponding routes are
Ai Ci Di Gi Fi Gi and Ai Ci Ei Gi Fi Gi . We can then evaluate it
as
Kunihiro Suzuki
220
P G F P F E P E
C P C A P F P E P C
P A G P G Fi P Fi D j P D j Ck P Ck A P Fi P D j P Ck i , j ,k
i
i , j ,k
i
j
j
k
k
i
j
k
(143)
where F depends on both D and E , and hence the related term is expanded as P Fi D j P Fi D j
E P E P Fi D j
E P E
(144)
P Fi E j P Fi E j
D P D P Fi E j
D P D
(145)
C depends on both
A
P Ck A P Ck A
and B , and hence the related term is expanded as B P B P Ck A
Figure 16. A Bayesian network, which has various routs.
B P B
(146)
Fundamentals of a Bayes’ Theorem
221
SUMMARY To summarize the results in this chapter: We consider m kinds of events Ai i 1, 2, , m . When we have an event E , the probability for the related cause of Ai is expressed by P Ai E
P Ai P E Ai
P A1 P E A1 P A2 P E A2
P Am P E Am
We select one cause or member k . The corresponding probability P k is 1 and the other probability is 0. However, we do not know what is the cause or who is the one, that is, we do not know which i is k . We update the P Ai with obtaining the results of E . We have some kinds of E which are denoted as Ea and Eb for example and we know the conditional probability
P Ea Ai
and
P Eb Ai
. The resultant events Ea , Eb ,
reflect
k. We assume that we obtain events Ea , Ea , Eb . Based on the results, we update the P Ai
as
P Ai Ea
P Ai Ea
P Ai Eb
P Ai P Ea Ai
P A1 P Ea A1 P A2 P Ea A2
P Ai P Ea Ai
P A1 P Ea A1 P A2 P Ea A2
P Ai P Eb Ai
P A1 P Eb A1 P A2 P Eb A2
P Am P Ea Am
P Am P Ea Am
P Am P Eb Am
P Ai
P Ai
P Ai
We can expect that the predominant i is the required k . A Bayesian network expresses the relationship between causes and results graphically. The probability for the event is related to the events in the previous step where the pervious event is one or many. We can evaluate probabilities related to any two event in the network. P A
Chapter 7
A BAYES’ THEOREM FOR PREDICTING POPULATION PARAMETERS ABSTRACT We predict a probability distribution associated with macro parameters of a set, where we use priori/likelihood/posterior distributions. We show first the likelihood distributions and multiply it to a priori distribution, and the resultant posterior distribution gives the information associated with the macro parameters of the population set. We show analytical approach first, where conjugate distributions are recommended to use. We then show numerical approach of Marcov Chain Monte Carlo (MCMC) method. In the MCMC method, there are two prominent ways of Gibbs sampling and Metropolis-Hastings algorithm. This process is extended to the procedures with various parameters in likelihood function which is constrained and controlled by the priori distribution function, which is called as a hierarchical Bayesian theory.
Keywords: Bayes’ theorem, conditional probability, priori probability, likelihood function, posterior probability; binomial distribution, beta distribution, beta function, multinomial distribution, dirichlet distribution, normal function, inverse gamma distribution, gamma function, Markov chain Monte Carlo, Gibbs sampling, Metropolis-Hastings algorithm, hierarchical Bayesian theory
1. INTRODUCTION Figure 1 shows the schematic targets of the Bayesian theory up to this chapter and the one of this chapter. In the Bayesian updating, we should know the probability of a certain cause. For example, we need to specify the ratio of black balls in a vase. However, we want to specify the ratio itself without assuming the ratio. In general, we want to predict macro
Kunihiro Suzuki
224
ammeters of a population set. We show analytical and numerical models to decide the parameters in this chapter.
Figure 1. The target of Bayesian updating is to identify the vase kinds. The target of the theory in this chapter is to predict the black ball ratio in the vase.
2. A MAXIMUM LIKELIHOOD METHOD A method to predict macro parameters of a population set consists of likelihood and priori distributions. We first discuss the likelihood function. The maximum likelihood method is utilized to predict parameters of a population set in general. We treat the macro parameter of the population set as a variable, and evaluate the probability based on obtained data. We impose that the parameter value gives the maximum probability. We start with this maximum likelihood method.
2.1. Example for a Maximum Likelihood Method We show two examples for a maximum likelihood method here.
2.2. Number of Rabbits on a Hill We want to know a total number of rabbits on a hill. We assume that the total number
which is unknown and should be evaluated as follows. We get a 15 rabbits, mark, and release them. After a while, we get n 30 rabbits, and found x 5 marked ones among them.
of the rabbits is
The corresponding probability that we get x marked rabbit can be evaluated as
A Bayes’ Theorem for Predicting Population Parameters f x
15 Cx
15 C30 x C30
225
(1)
This is the probability distribution for a variable x . However, we obtain x 5 , and
. Therefore, we regard that the function of Eq. (1) expresses the probability with respect to , and hence it is denoted as the unknown variable is the total number of rabbit
L
15 C5
15 C305 C30
(2)
Figure 2 shows the dependence of the probability on . It has a peak for 89.5. We regard that the situation occurs with the maximum probability, and hence we guess that there are about 89.5 rabbits on the hill. The corresponding distribution is shown in the figure. It should be noted that the form of Eq. (2) is the one for x , and hence it is
. Therefore, the integration of Figure 2 deviates L from 1. Therefore, the rigorous expression for is given by normalized with respect to x not to
L
15 C5
15 C305 C30
(3)
If we do not care about the normalization and focus on the peak position, Eq. (2) is OK. 0.25 0.20
L
0.15 0.10 0.05 0.00 40
50
60
70
80
90 100 110 120 130 140 150
Figure 2. Dependence of L associated with a rabbit number on
.
Kunihiro Suzuki
226
2.3. Decompression Sickness Decompression sickness is a fatal problem for divers. We want to know the probability where a diver feels a decompression problem as a function of water depth. The data we can get are water depths and related feelings associated with the decompression. If the diver feels a problem, the data is YES and NO for vice versa. We can then obtain data such as Table 1. Table 1. Data of water depth and the divers feelings associated with decompression
Member ID Depth (m) 1 34 2 34 3 34 4 34 5 35 6 35 7 36 8 36 9 37 10 37 11 38 12 38
Result YES NO NO NO NO NO YES NO YES YES YES YES
Next, we should define a probability where a diver feels a decompression problem. The probability should be small with small water depth, increase with the increasing depth, and then saturate. We therefore assume the form of the probability as f x,
x x
(4)
where x is the water depth. corresponds to the one where 50 % divers feel decompression problems. On the other hand, the probability where a diver does not feel a decompression problem is given by
1 f x,
(5)
A Bayes’ Theorem for Predicting Population Parameters That is, the data YES is related to
f x
, and the data NO is related to
227 1 f x
.
The probability where we obtain the data on Table 1 is denoted as L and is given by L f 34, 1 f 34, 1 f 35, 3
2
f 36, 1 f 36, f 37, f 38, 2
Figure 3 shows the dependence of L on probability is given by f x
.
L
2
has a peek for
(6)
35.5 . Therefore, the
x x 35.5
(7)
0.000290
0.000285 0.000280 0.000275
L
0.000270 0.000265 0.000260 0.000255 0.000250
0.000245 0.000240 30
35
40
45
50
Depth (m) Figure 3. Dependence of L associated with decompression sickness on depth.
3. PREDICTION OF MACRO PARAMETERS OF A SET WITH CONJUGATE PROBABILITY FUNCTIONS 3.1. Theoretical Framework We studied Bayesian updating in the previous chapter, where we evaluate the probability for the subject, and eventually decide which is the cause for the result. We treat a different situation in this chapter. We want to know macro parameters of a set based on extracted data and our experience. The prominent macro parameters for the set are average, variance, or ratio. We show how we can evaluate them in this chapter.
Kunihiro Suzuki
228
The procedure is similar to the Bayesian updating discussed in the previous chapter. However, we should use probability distributions instead of probabilities. We re-consider the Bayes’ theorem which is given by
P H D
PD H PH P D
(8)
where H is the cause (Hypothesis), D is the result (Data) caused by H . We regard that H is decided by the macroscopic parameter of the set such as average. Therefore, H can be regarded as the macroscopic parameter for the set, and we denote it as
. We modify the Bayes’ theory as P H D
We call we obtain
PD H PH
P D
D
f D P D
(9)
as the priori probability distribution which expresses the probability that
. f D is the likelihood probability distribution where we have the data
. D is the posterior probability distribution which
under the macroscopic value of is the product of express Eq. (9) as
f D
and
D
.
P D
is independent of
and we neglect it and
D Kf D
(10)
where we can decide K so that the integration of the probability with respect to is one. In the above theory, there are some ambiguous points. We do not know accurately how we define , which depends on the person’s image or experience. We use a simple one where we can easily obtain
D
with used
and
f D
, and impose that the
f D D form of the all distributions of , , and are the same. That is, the product of two same form functions is the same as the two functions. There are limited functions which have such characteristics. We show such combinations as shown in Table
D 2. We basically know the likelihood function, and require that and are the same type distribution functions, which is called as conjugate functions. If we regard the
A Bayes’ Theorem for Predicting Population Parameters
229
likelihood function as one with a variable of , the corresponding function is different, which is shown in the bracket in the table. In those cases, all the type of the distributions of
,
f D
, and
D
are the same.
Table 2. The combinations of priori, likelihood, and posterior functions
f D
Beta function
Binominal function (Beta function) Multinomial distribution (Dirichlet distribution) Normal function (Normal function) Normal function (Inverse Gamma function) Poisson function (Gamma function)
Dirichlet distribution Normal function Inverse Gamma function Gamma function
D Beta function Dirichlet distribution Normal function Inverse Gamma function Gamma function
3.2. Beta/Binominal (Beta)/Beta Distribution We study an example for the Bernoulli trails and evaluate the population ratio. We obtain the two kinds of events 1 or 0. event 1 for each event.
corresponds to the probability that we obtain the
We assume the probability where we obtain x times target event of 1 among n trials. The corresponding probability with respect to x is a binominal function and is given by
f x n Cx x 1
n 1
(11)
We then convert this probability distribution function to the one for probability to obtain
under the data
Dx
, which is the likelihood function denoted as
f D
. Since
for the likelihood function, it should be changed so that the total probability with respect to is one. The factor is the factor n Cx does not include the variable parameter denoted as K . We then obtain the likelihood function as
Kunihiro Suzuki
230 f D K x 1
We can decide is given by
0
K
n 1
(12)
by imposing that the total integration of the probability is one, which
f D d K
0
x 1
n 1
d
1
(13)
We then obtain the corresponding likelihood function, which is the Beta distribution Be and is given by
f D Be L , L 1
L 1 1 L B L , L
(14)
where
L x 1
(15)
L n x 1
(16)
B ,
is a Beta function given by
B , 1 1 1
0
1
d
(17)
The Beta distribution has parameters of and . B , We use a Beta distribution function e 0 0 as the priori probability distribution, which is given by Be 0 , 0
1 1 0 1 1 0 B 0 , 0
(18)
A Bayes’ Theorem for Predicting Population Parameters
231
There is not a clear reason to decide the parameters of 0 and 0 . Further, it is not so easy to decide the parameters. It is convenient to decide the corresponding average and standard deviation and convert them to 0 and 0 . The relationship between the average and variance and parameters given by
0
02
0
0 0
(19)
0 0
0 0 10 0 2
(20)
We then obtain 0
0
02 1 0 0 02
(21)
1 0 02 2 1 0 0 0 0
(22)
We obtain a posterior distribution as
D f D L 1 1
0 L 1 1
L 1
0 1 1
1 0
0 1
1
L 1
(23)
Finally, we obtain D
*
1
1
B *, *
*
1
(24)
where
* 0 L 1
(25)
Kunihiro Suzuki
232
* 0 L 1
(26)
The average and variance of
2
denoted as and 2 are given by
* * *
(27)
* * *
* 1 * *
2
(28)
3.3. Dirichlet/Dirichlet (Multinomial)/Dirichlet Distribution The binomial distribution is extended to a multinomial distribution where the event is more than two. In general, we need to consider multiple events of more than two kinds. Let us consider m kinds of events and a corresponding probability for
P X1 x1 , X 2 x2 , , X m xm f x1 , x2 ,
, xm
is given by
n! p x1 p x2 x1 ! x2 ! xm !
p xm
(29)
where p1 p2
pm 1
(30)
x1 x2
xm n
(31)
We then convert this probability distribution to the one for the probability to obtain i
f D i D x1 x2 xm under the data , which is the likelihood function denoted as . We then obtain the corresponding likelihood function, which is the Dirichlet distribution given by f 1 , 2 ,
, m D
1 2
m
1 1 2 1
2 1
m
m 1
(32)
A Bayes’ Theorem for Predicting Population Parameters
233
where
i xi 1
(33)
We use a Dirichlet distribution as the priori probability distribution, which is given by 1 , 2 ,
, m
0
10 20
m 0
1
2
10 1
m
20 1
m 0 1
(34)
We obtain a posterior distribution as
1 ,2 , ,m D 1 12 1
1
m
2 1
1 10 1 1
2
110 1220 1
m 1
1
m
2 20 1
m
m 0 1
1
m m 0 1
(35)
Finally, we obtain 1 , 2 ,
, m D
* 1
*
* 2
* m
1 1 2 * 1
* 2 1
m
* m 1
(36)
where
i* i i 0 1
(37)
3.4. Gamma/Poisson (Gamma)/Gamma Distribution We also study an example for the Bernoulli trials and do not evaluate the population ratio, but the average number that the target event occurs where the probability that one event occurs is quite small. In this situation, we can regard that the likelihood function follows the Poisson distribution. We assume that we obtain an event x times target events among n trials, and the probability that we obtain a target event is as small as p . The corresponding probability f x with respect to x is a Poisson distribution given by
f x
x e x!
(38)
Kunihiro Suzuki
234 where
is the average target event number and is given by
np
(39)
We then convert this probability distribution function to the one for the probability to obtain the average target event number function denoted as
f D
under the data of
Dx
, which is the likelihood
. Since the factor x! does not include the variable parameter
for the likelihood function, it should be changed so that the total probability with respect to is one. That is, we denote the factor as K . We then obtain the likelihood function as f D K x e We can decide is given by
K
(40) by imposing that the total integration of the probability is one, which
0
0
f D d K
x e d
1
(41)
We then obtain the corresponding likelihood function. It is a Gamma distribution and is given by
f D
n 1e L
L
nL
(42)
where nL x 1
(43)
L 1
(44)
n
is a Gamma function given by
n n1e d 0
(45)
A Bayes’ Theorem for Predicting Population Parameters We usually use a Gamma distribution which is given by
235
as the priori probability distribution,
0n n 1e 0
0
n0
(46)
It is not so easy to decide the priori probability parameters of 0 and 0 . It is convenient to decide the corresponding average and standard deviation and convert them to 0 and 0 . The relationship between the average and variance and parameters given by
0
02
n0
0
(47)
n0
02
(48)
We then obtain
0 02
(49)
02 n0 2 0
(50)
0
We obtain the posterior probability distribution as
D f D nL 1e L n0 1e 0
n0 nL 1 1 0 L
e
Finally, we obtain the posterior distribution as
(51)
Kunihiro Suzuki
236
D
n 1e *
*
n*
(52)
where
n* n0 nL 1
(53)
* 0 L
(54)
The average and variance are given by
2
n*
*
(55)
n* *2
(56)
3.5. Normal /Normal/Normal Distribution If we obtain a data set of x1 , x2 , , xn , we want to predict the average of the set. We only know that the data follow the normal distributions. The corresponding probability that we obtain D x1, x2 , , xn is given by f D
x 2 1 x 2 exp 1 2 exp 2 2 2 2 2 2 2 2 1
x 2 exp n 2 2 2 2 1
(57) 2 We assume that we know the variance , but do not know the average. We therefore
replace as f D
. Then the probability is expressed as x 2 1 x 2 exp 1 2 exp 2 2 2 2 2 2 2 2 1
x 2 exp n 2 2 2 2 1
(58)
A Bayes’ Theorem for Predicting Population Parameters
237
We then convert this probability distribution function to the one for probability to obtain the average
f D
under the data
D
, which is the likelihood function denoted as
for the likelihood function, it should be changed so that the total probability with respect to is . Since the factor 1
2 2 does not include the variable parameter
one. That is, we denote the factor as K and obtain the likelihood function as x1 2 x2 2 f D K exp exp 2 2 2 2 n 1 2 K exp 2 xi 2 i 1
xn 2 exp 2 2
n 1 n K exp 2 xi 2 2 xi n 2 i 1 2 i 1 n 1 n n K exp 2 2 x 2 2 x exp 2 xi 2 2 2 n i 1
n 1 n 2 n K exp 2 x x 2 exp 2 xi 2 2 2 n i 1 x 2 nx 2 n 1 n 2 K exp 2 xi exp 2 exp 2 2 n i 1 2 2 n x 2 K 'exp 2 2 n
(59)
where
nx 2 n 1 n K ' K exp 2 xi 2 exp 2 2 n i 1 2 We can decide the which is given by
K'
(60)
by imposing that the total integration of the probability is one,
f D d 1 0
(61)
Kunihiro Suzuki
238
Therefore, we obtain a corresponding likelihood function as
f D
L 2 exp 2 L2 2 L2 1
(62)
where
L2
2 n
L x
(63) 1 n xi n i 1
(64)
We then assume the priori distribution of
0 2 1 exp 2 02 2 0
(65)
The corresponding posterior distribution is given by D f D 0 2 L 2 exp exp 2 02 2 L2 2 0 L 2 2 0 L 1 1 2 2 0 L exp 1 2 1 1 2 2 L 0
(66)
Therefore, we obtain the posterior distribution as D
* 2 exp 2 *2 2 * 1
(67)
A Bayes’ Theorem for Predicting Population Parameters where 1 1 1 2 2 *2 0 L 0
*
2 0
239
(68)
L *2 L2
(69)
3.6. Inverse Gamma/Normal (Inverse Gamma)/Inverse Gamma Distribution In the previous section, we studied how to predict the average of a set, and we assumed that we know the variance of the set. We study a case where we do not know the variance of a set and know the average . If we obtain a data set of x1 , x2 , , xn , we want to predict the variance of the set. We only know that the data follow the normal distributions. The corresponding probability that we obtain D x1, x2 , , xn is given by f D
x 2 1 x 2 exp 1 2 exp 2 2 2 2 2 2 2 2 1
x 2 exp n 2 2 2 2 (70) 1
2 We regard that we know the average , but do not know the variance . We 2 therefore replace as
f D
. Then the probability is expressed as
x 2 1 x 2 x 2 1 1 exp 1 exp 2 exp n 2 2 2 2 2 2 n 2 n xi 1 n2 i 1 exp 2 2 n 2 2 n xi 2 xi 1 n2 i 1 exp 2 2 n n 2 2 x xi 2 n n 1 1 1 i 1 2 exp 2 2 n 2 2 2 n n n x x xi 1 1 2 1 1 i 1 exp 2 2
(71)
Kunihiro Suzuki
240 where x
1 n xi n i 1
(72)
Therefore, we obtain a corresponding likelihood function as
exp L nL
n
L 1
f D
(73)
where
nL
n 1 2
(74)
1
n
L n x x 2 xi 2 i 1 2 2
(75)
This is the inverse Gamma distribution. 2 We assume a priori distribution for variance given by
0
0
0 n n 1 Exp 0
n0
(76)
It is not so easy to decide the priori probability parameters of n0 and 0 . It is convenient to decide the corresponding average and standard deviation and convert them to n0 and 0 . The relationships between the average and variance and parameters are given by
0
0 n0 1
(77)
A Bayes’ Theorem for Predicting Population Parameters
02
241
0 2
n0 1 n0 2 2
(78)
We then obtain
n0 2
02 02
(79)
0 0 1
02 02
(80)
The corresponding posterior distribution is given by
D f D L n n L
exp L nL
L 1
n0 nL 1 1
n0 n0 1 Exp 0 0 n0
L exp 0
(81)
Therefore, we obtain the posterior distribution as *
D
*
n 1 exp
n*
(82)
where
n* n0 nL 1
(83)
* 0 L
(84)
Kunihiro Suzuki
242
The average and variance are given by
* n* 1
*2
(85)
*
n
n
1
*
2
*
2
(86)
4. PROBABILITY DISTRIBUTION WITH TWO PARAMETERS In the previous section, we show how to decide one parameter in a posterior probability distribution. In the analysis, we select a convenient distribution to obtain a rather simple posterior probability distribution with one parameter, which is called as conjugate distributions. However, we want to treat distributions with many parameters in general. We need to obtain the procedure to obtain many parameters. We show the treatment for two parameters in this section. We will extend the treatment for many parameters of more than two in the later section. We treat two parameters of an average 1 and a variance 2 . The posterior distribution for the average is given by
* 2 1 1 D exp *2 * 2 2 1
(87)
where 1
*2
1
2 0
0
*
L2
2 0
1
L2
(88)
L *2 L2
(89)
2 n
(90)
L x
A Bayes’ Theorem for Predicting Population Parameters
243
1 n xi n i 1
(91)
The posterior distribution for the variance is given by
* 2
n 1 exp *
2 D
n*
(92)
where
n* n0 nL 1
(93)
* 0 L
(94)
nL
n 1 2
1
(95) n
L n 1 x x 2 xi 2 i 1 2 2
The corresponding schematic distribution is shown in Figure 4.
Figure 4. Probability distribution with two parameters of 1 and 2 .
(96)
Kunihiro Suzuki
244
We need to decide the parameters of 1 and 2 where they are constant numbers in one equation although they are variables in the other equations. We show various methods to decide them in the following sections. To perform concrete calculation, we use a data set as shown in Table 3. The data 2 number is 30, and the sample average x and unbiased variance s are given by
30
x xi i 1
5.11
(97)
1 30 2 xi x 30 1 i 1 4.77
s2
(98)
where xi is data value in the table. Table 3. Data for evaluating two macro parameters of a set D ata ID 1 2 3 4 5 6 7 8 9 10 Average Variance
D ata 6.0 10.0 7.6 3.5 1.4 2.5 5.6 3.0 2.2 5.0
D ata ID 11 12 13 14 15 16 17 18 19 20
D ata 3.3 7.6 5.8 6.7 2.8 4.8 6.3 5.3 5.4 3.3
D ata ID 21 22 23 24 25 26 27 28 29 30
D ata 3.4 3.8 3.3 5.7 6.3 8.4 4.6 2.8 7.9 8.9
5.11 4.77
4.1. MCMC Method (A Gibbs Sampling Method) A Gibbs sampling method gives us the parameter values with numerical iterative method by generating a random number related to the posterior probability distribution. We write again the posterior distribution for average given by
A Bayes’ Theorem for Predicting Population Parameters
0
1 D, 2
* 2 1 1 exp * 2 *2 2
245
(99)
where 1
*2
1
02
0
*
L2
2 0
1
L2
(100)
L *2 L2
(101)
2 n
L x
(102) 1 n xi n i 1
(103)
We should set the parameters as below.
0 x 5.11
s2 n 4.77 30 0.16
(104)
02
(105)
L x 5.11 We should also set 2 for evaluating 1 . We set its initial value as
(106)
Kunihiro Suzuki
246
2 0 s 2 4.77
(107)
Using the distribution function, we generate a random number for 1 which is denoted as
1 0 and is expressed by
11 rand Inv _ 1 D, 2 0
5.00
(108)
Eq.(108) expresses the random number based on the probability function of
1 D, 2 0
. The procedure to obtain the random number based on a function is
described in Chapter 14 of volume 3.
Figure 5. Probability distribution with two parameters of 1 and 2 . (a) 2 is set and 1 is generated. (b) 1 is set and 2 is generated.
A Bayes’ Theorem for Predicting Population Parameters
247
We can generate the random number associated with a certain probability distribution P P using a uniform random number as follows. We integrate , and form a function F as
F P d
(109)
0
We generate the uniform random number, and obtain a corresponding shown in Figure 6.
, which is
Figure 6. Random number generation related to the probability distribution of P using inverse function method.
The posterior probability distribution for the variance is then given by *
2 D,11
* 2
2 n 1 exp n*
(110)
where
n* n0 nL 1
(111)
* 0 L
(112)
nL
n 1 2
(113)
Kunihiro Suzuki
248
1
n
L n 1 x x 2 xi 2 i 1 2 2
n We should set 0 and
(114)
0 . We assume that an average of the variance 0 as
0 s 2
(115)
and a variance of the variance as
02 s 4
(116)
Therefore, we obtain
n0 2
02 02
3
(117)
0 0 1
02 02
4.77 2 9.54
(118)
We can further obtain
n 1 2 30 1 2 14
nL
(119)
n* n0 nL 1 3 14 1 18
(120)
A Bayes’ Theorem for Predicting Population Parameters n
x i 1
i
2
249
920.6 (121)
n 1 2 2 L n 1 x x xi 2 2 i 1
1 1 n 1 x 2
2
n x 2 xi 2 i 1
1 2 30 5.07 5.11 5.112 920.6 2 457.769
(122)
Using the distribution function, we generate a random number for 2 based on the distribution function and denote it as
21 which is given by
21 rand Inv _ 2 D,11
4.19
(123)
1 D, 2 1
1
We then move to the updated where 2 is used. We repeat the above process many times. In the initial stage of the cycle, the results are influenced by the initial values. Therefore; we ignore the first 1000 data for example. The other data express the distribution 2 for and . We then perform the Gibbs sampling 20000 cycle times, and delete the first 1000 cycle times data, and obtain the results as shown in Figure 7. The corresponding average and variance values are given by
1 5.11 2 4.61
(124)
250
Kunihiro Suzuki
(a)
(b) Figure 7. Gibbs sampling. (a) Average, (b) Variance.
4.2. MCMC Method (A MH Method) In the previous section, we obtained the parameters of posterior probability distributions by generating random numbers related to posterior distributions. However, the posterior distribution changes every time, and we should evaluate functions to generate the random number related to the posterior probability distribution. A Metropolis-Hastings (MH) algorithm method gives us almost the same results without using the generation of random numbers related to the posterior probability distribution. We utilize detailed balance conditions as follows.
A Bayes’ Theorem for Predicting Population Parameters We define a probability distribution
P xi
251
where we have a status of xi . We also
W x xi 1 define a transition function i , that expresses the transition probability where the status changes form xi to xi 1 . The detailed balance condition requires the followings.
P xi W xi xi 1 P xi 1 W xi 1 xi
(125)
P x P x If the probability i is larger than i 1 , the transition from the status xi to the status xi 1 is small and vice versa. In the MH algorithm, the above process is expressed
simply. The transition occurs significantly when transition occurs little when detail.
P xi
is larger than
P xi
is smaller than
P xi 1
P xi 1
, and the
, which is shown below in more
Figure 8. Transition probability W . The length qualitatively expresses the strength of the transition. (a) The probability gradient is positive. (b) The probability gradient is negative.
Kunihiro Suzuki
252
Figure 8 shows the schematic transition probability. Two cases are shown, one is the probability gradient is positive (a), and negative (b). We select a next step using random number, and hence xi 1 have a chance of both sides of the location of xi . We decide whether the transition occurs or not. We define the ratio r given by
r
P xi 1 P xi
(126)
When r is larger than 1, we always change the status from xi to xi 1 . When r is smaller than 1, we generate a random number q which is uniform one with the range of
0,1 . We then compare r with q .
When r is larger than q , we change the status from xi to xi 1 . When r is smaller than q , we do not change the status. The above process ensures that we stay long in the status where the corresponding probability is high and vice versa. We utilize the above process in the MH algorithm shown below. We use the same posterior probability distribution given by
0
0
1 D, 2
0 * 1 1 exp * 2 *2 2
where we set the initial values for
2
(127)
1 0 and 2 0 as
1 0 x 5.11
(128)
2 0 s 2 4.77 The related parameters are given by
(129)
A Bayes’ Theorem for Predicting Population Parameters 1
*2
1
2 0
0
*
2 0
1
253
L2
(130)
L *2 L2
(131)
0 x 5.11
(132)
s2 n 4.77 30 0.16
02
(133)
L x 5.11
(134)
2 0 n 2 0 30
(135)
L2
Using the parameters above, we can evaluate the posterior probability associated with the average of Eq. (127). The posterior distribution associated with the variance is given by
0
0
2 D,1
2 0
* exp 0 2 * n
n* 1
(136)
where
n* n0 nL 1
(137)
0 L
(138)
*
Kunihiro Suzuki
254
nL
n 1 2
1
(139)
L n 1 0 x 2
2
n x 2 xi 2 i 1
(140)
We should set n0 and 0 . We assume that an average of the variance 0 as
0 s 2
(141)
and a variance of the variance as
02 s 4
(142)
Therefore, we obtain
n0 2
02 02
3
(143)
0 0 1
02 02
4.77 2 9.54
(144)
We can further obtain
n 1 2 30 1 2 14
nL
(145)
A Bayes’ Theorem for Predicting Population Parameters
255
n* n0 nL 1 3 14 1 18 n
x
2
L
1 0 n 1 x 2
i 1
i
(146)
920.6
(147)
2
n x 2 xi 2 i 1
2 1 30 1 0 5.11 5.112 920.6 2
(148)
Using the parameters above, we can evaluate the distribution function associated with the variance. We then generate increments of the parameters using random numbers related to
N 0, N 0, normal distributions of , and . 1 and are variances for the average and variance, respectively. We set the steps as 1
1
1 0
(149)
1
2 0 2
(150)
2
1
and are constant numbers and we use 10 for this case. We then generate the tentative next values as
1t 1 0 rand Inv _ N 0,
(151)
2t 2 0 rand Inv _ N 0,
(152)
1
2
We then evaluate the two ratios given by
Kunihiro Suzuki
256
p1
1t D, 2 0
1 0 D, 2 0
2t D,1 0
p2
If
p1
2 0 D,1 0
and
p 2
(153)
(154)
are larger than 1, we set
11 1t
(155)
21 2t
(156)
when that is,
p1
or
p 2
q rand rand
q
with
is smaller than 1, we generate a uniform random number denoted as q ,
(157)
means that it is a uniform random number between 0 and 1. We then compare
p1
and
1 1 1t 1 0 1 1
1 t 1 2 2t 1 0 1 0 2 2
p 2
, and decide next step parameters as follows.
for p1 q for p1 q
(158)
for p2 q for p2 q (159)
A Bayes’ Theorem for Predicting Population Parameters
257
We repeat the process many times. Figure 9 shows the schematic parameter generation in the MH algorithm method.
Figure 9. Parameter generation for a MH algorithm method.
In the initial stage of the cycle, the results are influenced by the initial values for 2 and . Therefore, we ignore the first 1000 data for example. The other data express the
distribution for and . The corresponding average and variance values are given by 2
1 5.19 2 4.76 Figure 11 shows the algorithm flow for the MH method for one parameter.
Figure 10. A Metropolis-Hastings (MH) algorithm method. (a) Average, (b) Variance.
(160)
258
Kunihiro Suzuki
Figure 11. Metropolis-Hastings algorithm flow.
4.3. Posterior Probability Distribution with Many Parameters We can easily extend the above procedures described in the previous section to many parameters of more than two. The l parameters are denoted as 1 ,2 , ,l . We need to know the posterior probability distribution for each parameter where the other parameters are included as constant numbers. The posterior distributions are then expressed as
A Bayes’ Theorem for Predicting Population Parameters
1 D, 2 ,3 ,
,l
2 D,1 ,3 ,
,l
i D,1 , 2 ,
,i 1 ,i 1 ,
l D,1 , 2 ,
,l 1
259
,l
(161)
In the Gibbs sampling method, we use the same posterior probability distributions, and
, , , l
also set initial value for 2 3 number for the parameter values as
1 1 1 2 1 i 1 l
where
as
20 ,30 , ,l0 . We then generate the random
,i 11 ,i 1 0 ,
,l 11
rand Inv _ 1 D, 2 0 ,3 0 , ,l 0 rand Inv _ 2 D,11 ,3 0 , ,l 0 rand Inv _ i D,11 , 21 , rand Inv _ i D,11 , 21 ,
rand Inv _
,l 0
means the generating random number for
(162)
based on the
function . We then repeat the process. In the MH MC method, we use the same posterior probability distributions, and also
set initial value for
1 ,2 ,3 , ,l as 10 ,20 ,30 , ,l0 . We should also set the step for
, , ,
l each parameters as 1 2 . We generate tentative parameters as
1t 1 0 rand Inv _ N 0, 1
(163)
Kunihiro Suzuki
260
2t 2 0 rand Inv _ N 0,
(164)
lt l 0 rand Inv _ N 0,
(165)
2
l
We then evaluate the ratios given by
p1
pl
,l 0
1 0 D, 2 0 ,3 0 ,
,l 0
p2
1t D, 2 0 ,3 0 ,
,l 0
2 0 D,1 0 ,3 0 ,
,l 0
lt D,1 0 , 2 0 ,
l
0
0
0
D, 1 , 2 ,
, l 1 0
(167)
0
, l 1
(166)
2t D,1 0 ,3 0 ,
(168)
q as 1
We further generate a random number
q1 rand
(169)
which is a uniform random number between 0 and 1. We decide the updated parameters as below. We consider i-th parameter in general. If the i-th parameter is larger than 1, we update the value as
i1 it
When
q , and set it when it is larger 1
is smaller than 1, we then compare it with
q , and set i 1
than
pi
(170)
0
q . That is, it is summarized as 1
when it is smaller than
A Bayes’ Theorem for Predicting Population Parameters 1
i
it 0 i
261
for pi 1 or pi q 1 for pi q 1
(171)
We repeat the above process and obtain 1 , 2 ,3 , ,l . In the MH MC method, we need not posterior probability distributions of each parameter, but it is sufficient that we have one posterior probability distribution that includes whole parameters of 1 , 2 ,3 , ,l as
g D;1 ,2 ,3 , ,l
(172)
We can then regard the posterior probability distribution for each parameter as 1 D, 2 ,3 ,
,l K1 g D;1 , 2 ,3 ,
,l
2 D,1 ,3 ,
,l K 2 g D;1 , 2 ,3 ,
,l
i D,1 , 2 ,
,i 1 ,i 1 ,
l D,1 , 2 ,
,l 1 Kl g D;1 , 2 ,3 ,
,l Ki g D;1 , 2 ,3 ,
,l
,l
(173)
where the factor Ki is the normalized factor which is decided as we regard the parameter i as a variable while the others as constant numbers. However, we do not care about the value of it when we evaluate the ratio. We then apply the same procedures.
4.4. Hierarchical Bayesian Theory The shape of the posterior distribution is limited to the function we use for likelihood and priori distributions. We can improve the flexibility for the shape using a hierarchical Bayesian theory. We consider a case where 20 members perform an examination as shown in Figure 12. The examination consists of 10 problems, and the members select an answer in each problem. The score of each member is the number of the correct answer, which is denoted as xi for a member
i . Therefore, the value of xi is then between 0 and 10.
Figure 12 shows the assumed obtained data. The average 0 and the standard deviation 0 of the examination are given by
Kunihiro Suzuki
262
0 5.85
(174)
0 3.81
(175) 7 Data Normal Binominal HBayes
Frequency
6 5 4 3 2 1 0
0
2
4
Score
6
8
10
Figure 12. Score distribution and expected value.
If we assume that the member number of the scores follows the normal distribution, and the member number of score k is given by
N k
k 0 2 1 exp 2 0 2 2 0
(176)
This can be modified as
k 0 2 exp 2 0 2 N k 20 10 i 0 2 exp 2 0 2 i 0
(177)
The score distribution is far from the bell shape ones, which cannot be reproduced with the normal distribution as shown in Figure 12. We try to explain the distribution with a hierarchical Bayes theory using a binominal distribution.
A Bayes’ Theorem for Predicting Population Parameters
263
The probability that a member obtain a correct answer for each problem is assumed to be
, and we have a corresponding the likelihood function given by f D n Cx1 x1 1
n x1 n
Cx2 x2 1
n x2 n
Cx20 x20 1
X 1
n x20
1020 X
(178)
where 20
X xi i 1
117
(179)
The peak value of the likelihood function can be evaluated from
f D
X 1 X 1 1
200 X
200 X 1
X 1 200 X
0
(180)
We therefore obtain the peak value of the likelihood function at
X 200 0.585
0
(181)
The expected number for score k is then given by
N k 20 10 Ck0 k 1 0
10 k
(182)
Figure 12 compares the data with the evaluated values. The expected values deviate significantly. It is obvious that we apply the same probability 0 to any member. This treatment is valid if it is such kind of coin toss with the similar coin. This treatment cannot explain the data such as the figure.
Kunihiro Suzuki
264
We assume that the peculiar shape of the distribution of Figure 12 comes from the various skill of the person. How can we include the different skills? We regard that each member comes from his own group with different macro parameters of
, which is shown schematically in Figure 13.
Figure 13. Schematic treatment for a hierarchical Bayes theory. We regard that each data comes from different group and each group is controlled by the priori distribution.
We can add the variety of the shape of the distribution by changing Therefore, we modify the likelihood function as
f D 1 , 2 ,
, 20
n C x11x1 1 1 1x1 1 1
in each trial.
n x1
n x1 n
C x2 2 x2 1 2
2 x 1 2 2
n x2
n x2 n
C x205 x20 1 20
5 x 1 20 20
n x20
n x20
(183)
It should be noted that we now have undecided parameter number of 20 from 1. We can do anything with this method. However, we add some constraint to the functions. The distribution of i is controlled by the priori distribution. A parameter i defined in the range of 0 and 1, and it is frequently converted as
ln i i 1 i We then have
(184)
A Bayes’ Theorem for Predicting Population Parameters
i
1 1 exp i
265
(185)
Figure 14 shows the relationship between and
with various . increases
monotonically with increasing and decreases with increasing . Note that the definition , range for is , which is available for a normal distribution.
1.0 0.8
= -1 =0 =1
0.6 0.4 0.2 0.0 -10 Figure 14. Relationship between and
-5
0
with various
5
10
.
We assume the normal distribution for i as
i
2 1 exp i 2 2 2
(186)
where is also an undecided parameter. We introduced a parameter and it is assumed to be constant although the value is not decided at present. Therefore, we do not define probability distribution for . Therefore, we assume the priori distribution for i as 2 1 2 1 exp 1 2 exp 2 2 2 2 2 2
The posterior distribution is given by
2 1 exp 202 2 2
(187)
Kunihiro Suzuki
266
2 1 exp 1 2 2 2 2 1 n x 10 C x2 2 x2 1 2 2 exp 2 2 2 2 x1 10 C x11 1 1
n x1
10 C x20 20 x20 1 20
n x20
2 1 exp 202 2 2
(188)
The probability associated with the score k is evaluated as P k 10 Ck k 1
nk
2 1 exp 2 2 2 10 k
k
1 1 10 Ck 1 1 e 1 e
Figure 15 (a) shows the dependence of
2 1 exp 2 2 2
Pk
(189)
on with two standard deviations of
. The peak position of increases with increasing k , and the profile becomes broad with increasing . P k Figure 15 (b) shows the dependence of on with various . The profile is symmetrical with respect to k with 0 , and the probability with higher k is larger with positive , and vice versa. 0.06
=0
k : from 0 to 10 step 2
0.08
=3 = 6
0.05
=3
k : from 0 to 10 step 2
=0 =1 = -1
k=0
Probability
Probability
0.06
0.04
k = 10
0.03 0.02
0.04 k=0
k = 10
0.02
0.01 0.00 -10
-5
0 (a)
5
10
0.00 -10
-5
0 (b)
5
Figure 15. Dependence of probability for number k on . (a) Two standard deviations . (b) Various
.
10
A Bayes’ Theorem for Predicting Population Parameters
267
Neglecting the factors which are not related to the parameters, we obtain the function given by 20
L , Li i 1 20
i xi 1 i
n xi
i 1
2 exp 1 2 2 1
20
1 exp i
xi
i 1
1 exp i
n xi
2 exp 1 2 2
(190)
2 exp 1 2 2
(191)
1
where Li i , , i xi 1 i
n xi
2 exp 1 2 2 1
1 exp i
xi
1 exp i
n xi
1
We first decide parameters that do not depend on member, which are and . We eliminate parameters of i as
L1 1 , , d 1 L2 2 , , d 2
Q1 , Q2 ,
Q20 ,
L20 20 , , d 20 (192)
We perform the logarithm for the equation given by
G ln Q1 , ln Q2 ,
ln Q20 ,
(193)
We can obtain the values for and that maximize G using a solver in a software such as Excel in Micro Soft, and they are denoded as 0 and 0 . In this case, we obtain
0 0.86 0 3.08
(194)
Kunihiro Suzuki
268
We can then perform the MHMC method and obtain the parameter values of i as shown in Table 4 where we evaluated 5000 cycle times data and select the last 4000 data. The expected score values are evaluated as
N k 20
1 20 10 k k 10 Cki 1 i 20 i 1
(195)
which is shown in Figure 12. The expected values well reproduces the data. Table 4. Evaluated parameter values Data ID γ θ Score
1 2 -3.22 -4.87 0.09 0.02 1 0
3 4 3.80 -1.22 0.99 0.41 10 4
5 3.07 0.98 10
6 1.99 0.95 10
7 8 9 3.64 -0.28 -1.30 0.99 0.64 0.39 10 6 4
10 11 2.42 -2.76 0.96 0.13 10 1
12 13 14 1.32 -3.93 -0.97 0.90 0.04 0.47 9 0 5
15 16 17 4.92 -0.10 -2.80 1.00 0.68 0.13 10 7 1
18 19 2.57 -2.06 0.97 0.23 9 2
20 0.59 0.81 8
SUMMARY To summarize the results in this chapter— We obtain data x , and the population parameter is denoted as . We modify the Bayes’ theory as
P H D
PD H PH P D
D
f D P D
Kf D
where K is a normalized factor. We call as the priori probability distribution which
f D
depends on a macroscopic parameter or constant.
where we have data D under the macroscopic value of probability distribution. There are four convenient combinations for conjugate functions, and
D
are the as follows.
,
is the likelihood probability
. D is the posterior f D
, which is called as
A Bayes’ Theorem for Predicting Population Parameters
269
Beta/Binominal (Beta)/Beta Distribution We assume the probability where we obtain x target events among n trials, and the corresponding likelihood function is given by f D
1
L 1 1 L L , L
where
L x 1
L n x 1 The priori probability distribution for this case is given by
Be 0 , 0
1 1 0 1 1 0 B 0 , 0
The average and variance for the priori distributions are related to the parameters as 0
02 1 0 0 02
0
1 0 02 2 1 0 0 0 0
The posterior function is given by
D
*
1
1
* 1
B * , *
where
* 0 L 1
Kunihiro Suzuki
270
* 0 L 1 The average and variance of
2
are denoted as and 2 are given by
* * *
* * *
* 1 * *
2
Dirichlet/Multinomial (Dirichlet)/Dirichlet Distribution
m
In general, we need to consider multiple events of more than two kinds. Let us consider kinds of events and the corresponding probability for
P X1 x1 , X 2 x2 , , X m xm f x1 , x2 ,
, xm
is given by
n! p x1 p x2 x1 ! x2 ! xm !
p xm
where p1 p2
pm 1
x1 x2
xm n
The corresponding likelihood function is given by f 1 , 2 ,
, m D
1 2
m
1 1 2 1
2 1
m
m 1
where
i xi 1 The priori probability distribution for this case is given by
A Bayes’ Theorem for Predicting Population Parameters 1 , 2 ,
, m
0
10 20
m 0
1
2
10 1
20 1
m
271
m 0 1
Finally, we obtain the posterior distribution as 1 , 2 ,
, m D
* 1
*
* 2
* m
1 1 2 * 1
* 2 1
m
* m 1
where
i* i i 0 1
Gamma/Poisson(Gamma)/Gamma Distribution We assume that we obtain x times target events among n , and the average number we obtain is given by
np where p is the probability that we obtain a target event in each trial and assumed to be small. The corresponding likelihood function is given by
f D
n 1e L
nL
where nL x 1
The priori distribution
is given by
0n n 1e 0
0
n0
The average and variance for the priori distributions are related to the parameters as
Kunihiro Suzuki
272
0
0 02
02 n0 2 0 The posterior distribution is given by
D
n 1e *
*
n*
where
n* n0 nL 1
* 0 1
Normal/Normal/Normal Distribution If we obtained n data from a set as x1 , x2 , is given by
f D
L 2 exp 2 L2 2 L2 1
where
L2
2 n
L x
1 n xi n i 1
The priori distribution is given by
, xn , the corresponding likelihood function
A Bayes’ Theorem for Predicting Population Parameters
273
0 2 1 exp 2 02 2 0
The corresponding posterior distribution is given by
* 2 D exp 2 *2 2 * 1
where 1
*2
1
0
*
2 0
2 0
1
L2
L *2 L2
Inverse Gamma/Normal (Inverse Gamma)/Inverse Gamma Distribution We extract a data set from a set as x1 ,x2 , ,xn , and the corresponding likelihood distribution is given by
n
exp L nL
L 1
f D
where
nL
n 1 2
1
n
L n x x 2 xi 2 i 1 2 2
The priori distribution is given by
Kunihiro Suzuki
274
0
0
0 n n 1 Exp 0
n0
The average and variance for the priori distributions are related to the parameters as
n0 2
02 02
0 0 1
02 02
The posterior distribution is given by
*
n 1 exp *
D
n*
where
n* n0 nL 1
* 0 L We can easily extend these procedures to obtain many macro parameters of a population set. We need to obtain posterior l distribution functions with l parameters, which is denoted as
1 D, 2 ,3 ,
,l
2 D,1 ,3 ,
,l
i D,1 , 2 ,
,i 1 ,i 1 ,
l D,1 , 2 ,
,l 1
,l
A Bayes’ Theorem for Predicting Population Parameters
275
In the Gibbs sampling method, the parameter values are updated by generating corresponding random numbers as
i k 1 rand Inv _ i D,1 k 1 , 2 k 1 ,
, i 1 k 1 , i 1 k ,
,l k ,i
In the MH method, the parameter values are updated as follows. We generate it as
it i k rand Inv _ N 0, i
We then evaluate the ratio
pi
it D,1 k , 2 k , ,i 1 k ,i 1 k , ,l k
i k D,1 k , 2 k , ,i 1 k ,i 1 k , ,l k
We also generate uniform random number
q k 1 rand
q
k 1
as
We then update the parameter as
k 1
i
it k i
for pi 1 or pi q k 1 for pi q k 1
We can extend this theory to use various parameters in a likelihood function which is controlled with priori function, which is called as a hierarchical Bayes’ theory.
Chapter 8
ANALYSIS OF VARIANCES ABSTRACT We sometimes want to improve a certain objective value by trying various treatments. For example, we treat various kinds of fertilizer to improve harvest. We probably judge the effectiveness of the treatment by comparing corresponding average of the objective values of amount of harvest. However, there should be a scattering in the data values in each treatment. Even when we have a better average data, some data in the best treatment are less than the one in the other treatment. We show how we decide the effectiveness of the treatment considering the data scattering.
Keywords: analysis of variance, correlation ratio, interaction, studentized range distribution
1. INTRODUCTION In agriculture, we want to harvest grain much, and try various treatments and want to clarify effectiveness of various treatments. One example factor is kinds of fertilizers. We denote the factor as A and kinds of fertilizers as A j . We want to know the effectiveness of various kinds of fertilizers. We obtain various harvest data for each fertilizer. Although the average values have clear difference among the fertilizers, they are usually scattered in each fertilizers’ group. Analysis of variance is developed to decide validity of the effectiveness. Further, we need to consider more factors for the harvest such as temperature and humidity. Therefore, we need to include various factors in the analysis of the variances. We start with one factor, and then extend two factors.
Kunihiro Suzuki
278
2. CORRELATION RATIO (ONE WAY ANALYSIS) We study the dependence of three kinds of fertilizer A1 , A2 , and A3 on a grain harvest. We call the kinds of fertilizer as levels in general. We use 1 kg seed for each and harvest grain and check the weight with unit of kg. The data are shown in the Table 1. Table 1. Dependence of harvest on kinds of fertilizer
Average
A1 5.44 7.66 7.55 4.33 5.33 4.55 6.33 5.98 5.90
A:Fertilizer A2 7.33 6.32 6.76 5.33 5.89 6.33 5.44 8.54 6.49
A3 9.44 7.33 7.66 6.43 7.45 6.86 9.78 6.34 7.66
The data number for each level are given by
nA1 8, nA2 8, nA3 8
(1)
n nA1 nA2 nA3 24
(2)
The corresponding averages are given by nA1
1 A1 nA1
x
1 A2 nA2
nA2
x
1 nA3
x
A 3
i 1
i 1
iA1
(3)
iA2
6.49
(4)
nA3
i 1
5.90
iA3
7.66
(5)
Analysis of Variances
279
The total average is given by nA3 nA2 nA1 xiA1 xiA2 xiA3 6.68 i 1 i 1 i 1
1 nA1 nA2 nA3
(6)
The average harvested weight was 5.90 kg for fertilizer A1 , 6.49 kg for fertilizer A2 , and 7.66 kg for fertilizer A3 . It seems that fertilizer A3 is most effective. Fertilizer A1 seems to be less effective. However, there is a datum for fertilizer A1 that is larger than the one for fertilizer A3 . How can we judge the effectiveness of levels statistically? We judge the variance associated with the level versus the variance associated with each data within each level. If the former one is larger than the latter one, we judge the level change is effective, and vice versa. Let us express the above discussion quantitatively. We evaluate the variance associated with the effectiveness of the level, which is 2
denoted as S ex , and is given by
S 2 ex
nA1 A1
2
nA2 A2
2
nA3 A3
nA1 nA2 nA3
2
0.53
(7)
We then consider the variance associated with the group in each level. It is denoted as 2 in
S , and is given by nA1
Sin2
i 1
xiA1 A1
2
nA2
xiA2 A2 i 1
2
nA1 nA2 nA3
nA3
xiA3 A3 i 1
2
1.27
(8)
The correlation ratio 2 is given by 2
Sex2 0.30 Sin2 Sex2
(9)
This indicates the significance of the dependence. If it is large, the effectiveness of changing fertilizer is large. It is also obvious that the correlation ratio is less than 1. However, we do not use this correlation ratio for the final effectiveness judge. The procedure is shown in the next section.
Kunihiro Suzuki
280
3. TESTING OF THE DEPENDENCE 2 We consider the freedom of S ex . We denote it as
ex . The variable in the variance is
A1 , A2 , A3 . However, it should satisfy Eq. (6). Therefore, the freedom is the
three of
number of fertilizer kinds minus 1. Therefore the freedom
ex
is given by
ex p 1 3 1 2
(10)
where p is the number of fertilizer kinds and is 3 in this case. Therefore, the unbiased 2
variance sex is given by sex2
n
ex
Sex2 6.45
We consider the freedom of Sin2 . We denote it as
in . Each group has nA , nA , nA 1
2
3
data, but each group has their sample averages and the freedom is decreased by 1 for each group. Therefore, the total freedom is given by in nA 1 nB 1 nC 1 n p 24 3 21
(11) 2
The corresponding unbiased variance is given by sin sin2
n
in
Sin2 1.44
(12)
Finally, the ratio of the unbiased variance is denoted by F and is given by F
sex2 4.46 sin2
(13)
Analysis of Variances This follows the F distribution with a freedom of
281
ex ,in . The P
point for the F
distribution with the predictive probability of 95% is denoted as FC ex , in , 0.95 , and is FC ex , in ,0.95 3.47
(14)
which is shown in Figure 1. Therefore,
F FC , and hence the dependence is valid.
Figure 1. F distribution testing.
4. THE RELATIONSHIP BETWEEN VARIOUS VARIANCES We treat two kinds of variances in the previous section. We can also define the total variance with respect to the total average. We investigate the relationship between the variances. We assume that the level number is three and denoted as A , B , and C , and each level has the same data number of n . The total variance S 2 is constituted by summing up the deviation of each data with respect to the total average, and is given by nA1
Stot2
x i 1
iA1
nA2
x 2
i 1
iA2
nA3
x 2
i 1
iA3
2
n
Let us consider the first term of numerator of Eq. (15), which is modified as
(15)
Kunihiro Suzuki
282
nA
i 1
xiA1
2
nA1
xiA1 A1 A1 i 1 nA1
xiA1 A1 i 1 nA1
xiA1 A1 i 1
2
nA1
2
2 xiA1 A1 i 1
2
nA1 A1
A1
nA1
A1 i 1
2
2
(16)
Similar analysis is performed for the other terms, and we obtain nA1
2 Stot
i 1
xiA1 A1
2
nA2
xiA2 A2 i 1
n
nA1 A1
2
nA2 A2
2
2
nA3
xiA3 A3 i 1
nA3 A3
2
2
n
S S 2 ex
2 in
(17)
Therefore, the total variance is the sum of the variance associated with various levels and one associated with each level. Therefore, the correlation ratio is expressed by
2
Sex2 Sex2 2 Sin2 Sex2 Stot
(18)
5. MULTIPLE COMPARISONS In the analysis comparisons, we test whether whole averages are the same or not. When we judge whole averages are not the same, we want to know which one is different. One way to test it is to select one pair and judge the difference as shown in the previous section. We test each pair setting a prediction probability. The total results are not related to the setting prediction probability since they are not independent. We want to test them all pairs with a certain prediction probability. Turkey-Kramer multiple comparisons procedure is frequently used and we introduce it here. We compare the average of level i and j , and obtain the absolute value of the difference of Ai Aj , and form a normalized value given by
Analysis of Variances
zij
A A i
283
j
sin2 1 1 2 nAi nAj
(19)
We compare the normalized value z to the studentized range distribution critical value of q p, n p, P , where p is the level number, and n is the total sample number. The corresponding table for q r, n p, P with P 0.95 and P 0.99 are shown in Table 2. The other simple way to evaluate the difference is the one with zi
A i
2 in
s 2
1 1 n A i n
(20)
We may be able to compare this normalized value to q p, n p, P . We need to interpolate the value of q r , n r , P from the table. Especially, the interpolation associated with n r is important. For simplicity, we set k n p
(21)
The row number before the final is 120 in the table. If k is less than 120, we can easily interpolate the value of q p, k , P as q p, k , P q p, i , P
q p, k , P q p, i , P
q p, j , P q p, i , P q p, j , P q p, i , P
(22)
Where i, j are the both sides of the row number that holds ik j
(23)
The difficulty exists in that the final row number is infinity and the number before the final is 120 in general. Therefore, we need to study how to interpolate where k 120 .
Kunihiro Suzuki
284 We introduce variables as
1 k 120 ln 120 1
(24)
Table 2. Studentized range distribution table for prediction probabilities of 0.95 and 0.99
Freedom of numerator
One sided P = 0.95 2 17.97 6.09 4.50 3.93 3.64 3.46 3.34 3.26 3.20 3.15 3.11 3.08 3.06 3.03 3.01 3.00 2.98 2.97 2.96 2.95 2.92 2.89 2.86 2.83 2.80 2.77
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 24 30 40 60 120 ∞
3 26.98 8.33 5.91 5.04 4.60 4.34 4.17 4.04 3.95 3.88 3.82 3.77 3.74 3.70 3.67 3.65 3.63 3.61 3.59 3.58 3.53 3.49 3.44 3.40 3.36 3.31
4 32.82 9.80 6.83 5.76 5.22 4.90 4.68 4.53 4.42 4.33 4.26 4.20 4.15 4.11 4.08 4.05 4.02 4.00 3.98 3.96 3.90 3.85 3.79 3.74 3.69 3.63
5 37.08 10.88 7.50 6.29 5.67 5.31 5.06 4.89 4.76 4.65 4.57 4.51 4.45 4.41 4.37 4.33 4.30 4.28 4.25 4.23 4.17 4.10 4.04 3.98 3.92 3.86
6 40.41 11.74 8.04 6.71 6.03 5.63 5.36 5.17 5.02 4.91 4.82 4.75 4.69 4.64 4.60 4.56 4.52 4.50 4.47 4.45 4.37 4.30 4.23 4.16 4.10 4.03
7 43.12 12.44 8.48 7.05 6.33 5.90 5.61 5.40 5.24 5.12 5.03 4.95 4.89 4.83 4.78 4.74 4.71 4.67 4.65 4.62 4.54 4.46 4.39 4.31 4.24 4.17
8 45.40 13.03 8.85 7.35 6.58 6.12 5.82 5.60 5.43 5.31 5.20 5.12 5.05 4.99 4.94 4.90 4.86 4.82 4.79 4.77 4.68 4.60 4.52 4.44 4.36 4.29
9 43.36 13.54 9.18 7.60 6.80 6.32 6.00 5.77 5.60 5.46 5.35 5.27 5.19 5.13 5.08 5.03 4.99 4.96 4.92 4.90 4.81 4.72 4.64 4.55 4.47 4.39
Freedom of denominator 10 11 12 49.07 59.59 51.96 13.99 14.39 14.75 9.46 9.72 9.95 7.83 8.03 8.21 7.00 7.17 7.32 6.49 6.65 6.79 6.16 6.30 6.43 5.92 6.05 6.18 5.74 5.87 5.98 5.60 5.72 5.83 5.49 5.61 5.71 5.40 5.51 5.62 5.32 5.43 5.53 5.25 5.36 5.46 5.20 5.31 5.40 5.15 5.26 5.35 5.11 5.21 5.31 5.07 5.17 5.27 5.04 4.14 5.23 5.01 5.11 5.20 4.92 5.01 5.10 4.82 4.92 5.00 4.74 4.82 4.90 4.65 4.73 4.81 4.56 4.64 4.71 4.47 4.55 4.62
13 53.20 15.08 10.15 8.37 7.47 6.92 6.55 6.29 6.09 5.93 5.81 5.71 5.63 5.55 5.49 5.44 5.39 5.35 5.32 5.28 5.18 5.08 4.98 4.88 4.78 4.68
14 54.33 15.38 10.35 8.53 7.60 7.03 6.66 6.39 6.19 6.03 5.90 5.80 5.71 5.64 5.57 5.52 5.47 5.43 5.39 5.36 5.25 5.15 5.04 4.94 4.84 4.74
15 55.36 15.65 10.53 8.66 7.72 7.14 6.76 6.48 6.28 6.11 5.98 5.88 5.79 5.71 5.65 5.59 5.54 5.50 5.46 5.43 5.32 5.21 5.11 5.00 4.90 4.80
16 56.32 15.91 10.61 8.79 7.83 7.24 6.85 6.57 6.36 6.20 6.06 5.95 5.86 5.79 5.72 5.66 5.61 5.57 5.53 5.49 5.38 5.27 5.16 5.06 4.95 4.85
17 57.22 16.14 10.84 8.91 7.93 7.34 6.94 6.65 6.44 6.27 6.13 6.02 5.93 5.85 5.79 5.73 5.68 5.63 5.59 5.55 5.44 5.33 5.22 5.11 5.00 4.89
4 164.30 22.29 12.17 9.17 7.80 7.03 6.54 6.20 5.96 5.77 5.62 5.50 5.40 5.32 5.25 5.19 5.14 5.09 5.05 5.02 4.91 4.80 4.70 4.60 4.50 4.40
5 185.60 24.72 13.33 9.96 8.42 7.56 7.01 6.63 6.35 6.14 5.97 5.84 5.73 5.63 5.56 5.49 5.43 5.38 5.33 5.29 5.17 5.05 4.93 4.82 4.71 4.60
6 202.20 26.63 14.24 10.58 8.91 7.97 7.37 6.96 6.66 6.43 6.25 6.10 5.98 5.88 5.80 5.72 5.66 5.60 5.55 5.51 5.37 5.24 5.11 4.99 4.87 4.76
7 215.80 28.20 15.00 11.10 9.32 8.32 7.68 7.24 6.92 6.67 6.48 6.32 6.19 6.09 5.99 5.92 5.85 5.79 5.74 5.69 5.54 5.40 5.27 5.13 5.01 4.88
8 227.20 29.53 15.64 11.55 9.67 8.61 7.94 7.47 7.13 6.87 6.67 6.51 6.37 6.26 6.16 6.08 6.01 5.94 5.89 5.84 5.69 5.54 5.39 5.25 5.12 4.99
9 237.00 30.68 16.20 11.93 9.97 8.87 8.17 7.68 7.32 7.06 6.84 6.67 6.53 6.41 6.31 6.22 6.15 6.08 6.02 5.97 5.81 5.65 5.50 5.36 5.21 5.08
Freedom of denominator 10 11 12 245.60 253.20 260.00 31.69 32.59 33.40 16.69 17.13 17.53 12.27 12.57 12.84 10.24 10.48 10.70 9.10 9.30 9.49 8.37 8.55 8.71 7.86 8.03 8.18 7.50 7.65 7.78 7.21 7.36 7.49 6.99 7.13 7.25 6.81 6.94 7.06 6.67 6.79 6.90 6.54 6.66 6.77 6.44 6.56 6.66 6.35 6.46 6.56 6.27 6.38 6.48 6.20 6.31 6.41 6.14 6.25 6.34 6.09 6.19 6.29 5.92 6.02 6.11 5.76 5.85 5.93 5.60 5.69 5.76 5.45 5.53 5.60 5.30 5.38 5.44 5.16 5.23 5.29
13 266.20 34.13 17.89 13.09 10.89 9.65 8.86 8.31 7.91 7.60 7.36 7.17 7.01 6.87 6.76 6.66 6.57 6.50 6.43 6.37 6.19 6.01 5.84 5.67 5.51 5.35
14 271.80 34.81 18.22 13.32 11.08 9.81 9.00 8.44 8.03 7.71 7.47 7.26 7.10 6.96 6.85 6.74 6.66 6.58 6.51 6.45 6.26 6.08 5.90 5.73 5.56 5.40
15 277.00 35.43 18.52 13.53 11.24 9.95 9.12 8.55 8.13 7.81 7.56 7.36 7.19 7.05 6.93 6.82 6.73 6.66 6.59 6.52 6.33 6.14 5.96 5.79 5.61 5.45
16 281.80 36.00 18.81 13.73 11.40 10.08 9.24 8.66 8.23 7.91 7.65 7.44 7.27 7.13 7.00 6.90 6.81 6.73 6.65 6.59 6.39 6.20 6.02 5.84 5.66 5.49
17 286.30 36.53 19.07 13.91 11.55 10.21 9.35 8.76 8.33 7.99 7.73 7.52 7.35 7.20 7.07 6.97 6.87 6.79 6.72 6.65 6.45 6.26 6.07 5.89 5.71 5.54
Freedom of numerator
One sided P = 0.99
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 24 30 40 60 120 ∞
2 90.03 14.04 8.26 6.51 5.70 5.24 4.95 4.75 4.60 4.48 4.39 4.32 4.26 4.21 4.17 4.13 4.10 4.07 4.05 4.02 3.96 3.89 3.83 3.76 3.70 3.64
3 135.00 19.02 10.62 8.12 6.98 6.33 5.92 5.64 5.43 5.27 5.15 5.04 4.96 4.90 4.84 4.79 4.74 4.70 4.67 4.64 4.55 4.46 4.37 4.28 4.20 4.12
The is related to the value of q p, k , P as
q p, k , P q p,120 P
q p, , P q p,120, P
(25)
Analysis of Variances
285
We can evaluate from the thirst relationship as
k 120 120
(26)
We can then obtain 1 ln 1
(27)
Solving this equation with respect to , we obtain
e 1 e 1
(28)
Comparing Eq. (28) with Eq. (25), we obtain an interpolate value as q p, k , P q p,120, P
e 1 q p, , P q p,120, P e 1
(29)
6. ANALYSIS OF VARIANCE (TWO WAY ANALYSIS WITH NON-REPEATED DATA) We consider two factors of A and B . For example, factor A corresponds to the kinds of fertilizers, and factor B corresponds to the temperature. We assume the factor A has four levels of A1, A2 , A3 , A4 , denoted as nA , and the factor B has three levels of B1 , B2 , B3 denoted as nB , where nA 4 and nB 3 . We assume that the data is only one, that is, non-repeated data as shown in Table 3. The total data number n is given by
n nA nB
(30)
In this case, each data xij can be expressed by the deviation from the average, and is given by
Kunihiro Suzuki
286
xij Aj Bi eij
(31)
This can be modified as
xij Aj Bi eij
(32)
The corresponding freedom relationship is given by
tot A B e
(33)
where
tot n 1 12 1 11
(34)
A nA 1 4 1 3
(35)
B nB 1 3 1 2
(36)
We can obtain the freedom associated with the error and it is given by
e tot A B n 1 nA 1 nB 1 n nA nB 1 12 4 3 1 6
The discussion is almost the same as that for one-way analysis.
(37)
Analysis of Variances
287
We evaluate the total average and evaluate the average associated with factor A denoted as A j where j 1, 2,3, 4 and factor B denoted as Bi where i 1,2,3 , which are shown in Table 3. Table 3. Two-factors cross data table with no repeat
A1 B
A A3
A2
B1 B2 B3 μAj DμAj
7.12 8.63 7.99 7.91 -2.23
Total average
10.15
8.33 8.15 8.55 8.34 -1.80
μBi
A4 9.12 10.98 11.79 10.63 0.48
11.21 15.43 14.45 13.70 3.55
DμBi 8.95 10.80 10.70
-1.20 0.65 0.55
We consider the effect of a factor A where 4 levels exist. We can evaluate the associated variance as nA
S A2
nB Aj j 1
2
n 1 nA
nA
j 1
Aj
2
(38)
1 2 2 2 2 2.23 1.80 0.48 3.55 4 5.26
The freedom is level number 4 -1 =3. Therefore, the unbiased variance is given by s A2
n
A
S A2
12 5.26 3 21.08
(39)
We next evaluate the effect of a factor B where three levels exist. We evaluate three deviations of each level average with respect to the total average, given by
Bi Bi
(40)
Kunihiro Suzuki
288
The corresponding variance can be evaluated as nBi
S B2
nA Bi i 1
2
n 1 nB
nBi
i 1
Bi
2
(41)
1 2 2 2 1.20 0.65 0.55 3 0.72 2 The unbiased variance s B is then given by
sB2
n
B
S B2
12 0.72 2 4.34
(42)
The error is given by
eij xij Aj Bi
(43)
which is shown in Table 4. Table 4. Error data
A
B
A1 0.41 0.06 -0.47
B1 B2 B3 2
The corresponding variance Se is given by
A2 1.19 -0.85 -0.34
A3 -0.31 -0.30 0.61
A4 -1.29 1.08 0.20
Analysis of Variances Se2
1 3 4 2 eij n i 1 j 1
289
(44)
0.50
The freedom of e is given by
e tot aLex bLex n 1 na 1 nb 1 n na nb 1 12 4 3 1 6
(45)
Therefore, the unbiased variance is given by se2
n
e
Se2
12 0.50 6 1.01
(46)
The F associated with a factor A is given by FA
s A2 21.08 20.87 se2 1.01
(47)
The critical value for FAc is given by
FAc 3,6,0.95 4.76 Therefore, FA is larger than
(48)
FAc , and hence there is dependence of a factor B .
The F value associate with the factor B is given by FB
sB2 4.34 4.29 se2 1.01
The critical value for F is given by
(49)
Kunihiro Suzuki
290
FBc 2,6,0.95 5.14 Therefore, F is smaller than
(50)
FBc , and hence there is no clear dependence on the factor
B.
7. ANALYSIS OF VARIANCE (TWO WAY ANALYSIS WITH REPEATED DATA) We consider two factors of A and B , and the factor A has four levels of A1, A2 , A3 , A4 denoted as nA , and the factor B has three levels of B1 , B2 , B3 denoted as nB , where nA 4 and nB 3 . In the previous section, we assumed a one set data. That is, there is
only one data for the level combination of Aj Bi . We assume that the data are two sets, that is, repeated data as shown in Table 5. There are two data for Aj Bi and are denoted as
xij _ s
and s 1,2 .
The total data number n is given by n nA nB ns 3 4 2 24
(51)
The total average is evaluated as
1 ns nA nB xij _ s n s 1 j 1 i 1
10.07
(52)
We can then evaluate each data deviation xij _ s from the total average, and it is given by
xij _ s xij _ s which is shown in Table 6.
(53)
Analysis of Variances
291
Table 5. Two-factors cross data table with repeat of data 1 and 2 Data 1
B
A1
B1 B2 B3
7.55 8.12 8.97
Total mean
A A3
A2 8.98 6.96 10.95
Data 2
A4 8.55 11.89 12.78
11.02 11.89 16.55
B
A1
A A3
A2
B1 B2 B3
6.11 9.01 8.79
8.78 9.65 9.21
A4 8.44 9.86 12.01
9.89 13.22 12.47
10.07
Table 6. Deviation of each data with respect to the total average Data 1 deviation B
A1
B1 B2 B3
A A3
A2 -2.52 -1.95 -1.10
-1.09 -3.11 0.88
Data 2 deviation
A4 -1.52 1.82 2.71
0.95 1.82 6.48
B
A1
A A3
A2
B1 B2 B3
-3.96 -1.06 -1.28
-1.29 -0.42 -0.86
A4 -1.63 -0.21 1.94
-0.18 3.15 2.40
We form a table averaged the data 1 and 2, and it is given by
xij
1 xij _1 xij _1 2
(54)
which is shown in Table 7. We regard that we have two sets of the same data. Table 7. Average of data 1 and 2 Average of data 1,2 B
B1 B2 B3 μAj DμAj
A1
A A3
A2 6.83 8.57 8.88 8.09 -1.98
8.88 8.31 10.08 9.09 -0.98
μBi
μBi
A4 8.50 10.88 12.40 10.59 0.52
10.46 12.56 14.51 12.51 2.44
8.67 10.08 11.47
-1.40 0.01 1.40
The averages are evaluated as Aj
Bi
1 nB
nB
x i 1
ij
1 nA xij nA j 1
(55)
(56)
Kunihiro Suzuki
292
and the average deviation can be evaluated as
Aj Aj
(57)
Bi Bi
(58)
2 Let us consider the effectiveness of a factor A . The related sum of square S Aex is given
by
S
2 Aex
2 nA Ai nB 2 i 1 n 2 nA Ai nB 2 i 1 2nA nB
nA
i 1
Ai
2
nA
1.98 0.98 0.52 2.44 2
2
2
2
4
2.77
(59)
The related freedom Aex is the level number of 4 minus 1, and is given by
Aex nA 1 4 1 3
(60)
2 Therefore, the corresponding unbiased variance s Aex is given by
2 s Aex
n
2 S Aex
Aex 24 2.77 3 22.17
(61)
Analysis of Variances
293
2 Let us consider the effectiveness of the factor B . The related sum of square SBex is
given by
2 S Bex
2 nB Bi nA 2 i 1 n 2 nB Bi nA 2 i 1 2nA nB
nB
i 1
Bi
2
nB
1.40 0.01 1.40 2
2
2
3
1.31
(62)
The related freedom Bex is the level number of 3 minus 1, and is given by
Bex nB 1 3 1 2
(63)
2 Therefore, the corresponding unbiased variance sBex is given by
2 sBex
n
bLex
2 S Bex
24 1.31 2 15.69
(64)
Table 8. Deviation of the data with respect the average for each level Data 1
B
B1 B2 B3
A1
A A3
A2 0.72 -0.45 0.09
0.10 -1.35 0.87
Data 2
A4 0.05 1.02 0.39
0.57 -0.66 2.04
B
B1 B2 B3
A1
A A3
A2 -0.72 0.45 -0.09
-0.10 1.35 -0.87
A4 -0.06 -1.02 -0.39
-0.57 0.67 -2.04
Kunihiro Suzuki
294
We obtain two sets of data and the both data for the same levels of A j Bi are different in general. This difference is called as a pure error and is given by
epure _ ij _ s xij _ s xij
(65)
It should be noted that
epure _ ij _ 2 epure _ ij _1
(66)
which is shown in Table 8. 2 We can evaluate the variance associated with the difference Se _ pure and is given by
2 1 nB nA e pure _ ij _1 2 n j i 1 2 2 2 2 0.72 0.10 0.05 0.57 24 2 2 2 2 0.45 1.35 1.02 0.66
Se2_ pure
(67)
2 2 2 2 0.09 0.87 0.39 2.04 2 0.78
Let us consider the corresponding freedom e _ pure . The data number is given by n na nb 2
(68)
We use averages of na nb kinds. Therefore, we have a freedom of
e _ pure na nb 2 na nb na nb 2 1 43 12 2 Therefore, the corresponding unbiased variation se _ pure is given by
(69)
Analysis of Variances se2_ pure
n
e _ pure
295
Se2_ pure
24 0.78 12 1.57
(70)
The deviation of each data from the total average eij _ s is given by
eij _ s xij _ s Aj Bi
(71)
What is the difference between the deviation and the pure error plus effectiveness of factors A and B means? If the effectiveness of the factors A and B is independent, the error is equal to the pure error. Therefore, the difference is related to the interaction between the two factors. The difference is called with an interaction, and is given by einteract _ ij _ s eij _ s e pure _ ij _ s
xij _ s Aj Bi xij _ s xij
xij Aj Bi
(72)
Note that this does not depend on s . This is evaluated as shown in Table 9. The 2 corresponding variance Sinteract is given by
2 Sinteract
2 0.14 1.20 2
2
24
0.41 0.61 2
2
0.36 (73)
Let us consider the related freedom interact . We obtain the equations associated with a freedom given by
tot aLex bLex interact e _ pure Therefore, we obtain
(74)
Kunihiro Suzuki
296
interact tot Aex Bex e _ pure n 1 nA 1 nB 1 nA nB ns 1 23 3 2 12 6
(75)
2 The corresponding unbiased variance sinteract is given by
2 sinteract
n
interact
2 Sinteract
24 0.36 6 1.45
(76) Table 9. Interactions
Data 1
B
A1
B1 B2 B3
A A3
A2 0.14 0.47 -0.61
1.20 -0.79 -0.41
Data 2
A4 -0.69 0.28 0.41
-0.65 0.04 0.61
B
B1 B2 B3
A1
A A3
A2 -1.84 -1.51 -2.59
0.22 -1.77 -1.39
A4 -0.17 0.80 0.93
1.79 2.48 3.04
Effectiveness of each factor can be evaluated with corresponding F values. The effectiveness of the factor A can be evaluated as
FA
2 sAex
se2_ pure
14.14 (77)
The critical F value Fac is given by FAc F Aex , e , P F 3,12,0.95 3.49
(78)
Therefore, the factor A is effective. The effectiveness of the factor B can be evaluated as
FB
2 sBex
se2_ pure
10.01 (79)
Analysis of Variances
297
The critical F value FBc is given by FBc F Bex , e , P F 2,12,0.95 3.89
(80)
Therefore, the factor B is effective. The effectiveness of the interaction can be evaluated as
Finteract
2 sinteract 0.92 se2_ pure
(81)
The critical F value Finteractc is given by Finteractc F interact , e , P F 6,6,0.95 3.00
(82)
Therefore, the interaction is not effective.
SUMMARY This chapter is summarized in the following.
One Way Analysis 2
The effectiveness of the level is evaluated with S ex given by
S 2 ex
nA1 A1
2
nA2 A2
2
nA3 A3
2
nA1 nA2 nA3 2
The scattering of the data is expressed with S in , and it is given by
Kunihiro Suzuki
298 nA1
Sin2
x
iA1
i 1
A1
nA2
x 2
i 1
iA2
A2
nA3
x 2
i 1
iA3
A3
2
nA1 nA2 nA3
The correlation ratio 2 is given by 2
Sex2 S Sex2 2 in
This is between 0 and 1, and the effectiveness of the factor can be regarded as significant with larger 2 . We form an unbiased variable as sex2
n
ex
Sex2
where
ex p p
is the level number.
sin2
n
in
Sin2
where
in n p Finally, the ratio of the unbiased variance is denoted by F and it is given by F
sex2 sin2
This follows the F distribution with a freedom of
ex ,in . The P
distribution is denoted as FC ex , in , P . If F Fc we regard that the factor is valid, and vice versa.
point for the F
Analysis of Variances
299
The effectiveness between each level can be evaluated as
Ai Aj sin2 1 1 2 nAi nAj If this value is larger than the studentized range distribution table value of q r , n r , P , we judge that the difference is effective. The other simple way to evaluate the difference is the one with
zi
A i
sin2 1 1 2 nAi n
We may be able to compare absolute value of this with z p for a normal distribution. Two way analysis without repeated data We consider two factors of
A:
A1, A2 , A3 , A4 and B : B1 , B2 , B3 , where nA 4 and
nB 3 . The total data number n is given by
n nA nB In this case, each data xij can be expressed by the deviation from the average, and is given by
xij Aj Bi eij The various variances are given by
nA
2 S Aex
i 1
Ai
nA
2
Kunihiro Suzuki
300
nB
2 S Bex
Se2
i 1
Bi
2
nB
1 nB nA 2 eij n i 1 j 1
The various freedoms are given by
tot n 1 A nA 1 B nB 1 The freedom associated with error is given by
e tot A B n 1 nA 1 nB 1 n nA nB 1 Therefore, the unbiased variances are given by sA2
sB2
se2
n
A n
B n
e
S A2
SB2
Se2
The F associated with the factor A is given by FA
s A2 se2
Analysis of Variances
301
This is compared with the F critical value of FAc A , c , P s B2 se2
FB
This is compared with the F critical value of FBc B , c , P
Two way analysis with repeated data
We consider two factors of
A : A1 , A2 , , AnA and B : B1 , B2 , , BnB , and we have ns
set. The total data number n is given by n nA nB ns
The total average is given by
1 ns nA nB xij _ s n s 1 j 1 i 1
Each data deviation xij _ s from the total average is given by
xij _ s xij _ s The average data for each level is given by
xij
1 xij _1 xij _1 2
The averages are evaluated as Aj
1 nB
nB
x i 1
ij
Kunihiro Suzuki
302
Bi
1 nA xij nA j 1
and the average deviations can be evaluated as
Aj Aj Bi Bi
nA
2 S Aex
i 1
Ai
2
nA
Aex nA 1 2 Therefore, the corresponding unbiased variance s Aex is given by
2 sAex
n
Aex
2 S Aex
nB
2 S Bex
i 1
Bi
2
nB
Bex nB 1 2 The corresponding unbiased variance sBex is given by
sBex 2
n
bLex
S Bex 2
A pure error is given by
epure _ ij _ s xij _ s xij
Analysis of Variances Se2_ pure
303
2 1 nB nA ns e pure _ ij _ s nA nB ns j i s
The deviation of each data from the total average eij _ s is given by
eij _ s xij _ s Aj Bi
The difference associated with interaction is given by einteract _ ij _ s eij _ s e pure _ ij _ s
xij Aj Bi
2 Sinteract
nA
1 nA nB
ns
nB
e
1 nA nB ns
interact _ ij _ s
j
i
nB
ij
j
i
2
s
x nA
Aj Bi
2
interact tot Aex Bex e _ pure
n 1 nA 1 nB 1 nA nB ns 1
2 sinteract
FA
n
interact
2 Sinteract
2 sAex
se2_ pure
The critical F value Fac is given by
FAc F Aex ,e , P Therefore, the factor A is effective. The effectiveness of the factor B can be evaluated as
Kunihiro Suzuki
304
FB
2 sBex
se2_ pure
The critical F value FBc is given by
FBc F Bex ,e , P Therefore, the factor B is effective. The effectiveness of the interaction can be evaluated as
Finteract
2 sinteract se2_ pure
The critical F value Finteractc is given by
Finteractc F interact ,e , P
Chapter 9
ANALYSIS OF VARIANCES USING AN ORTHOGONAL TABLE ABSTRACT We show a variable analysis using an orthogonal table which enables us to treat factors more than two. Although the level number for each factor is limited to two, many factors can be assigned to columns in the orthogonal table and we can also evaluate interaction of targeted combination, and obtain easily the effectiveness of each factor.
Keywords: analysis of variance, orthogonal table, interaction
1. INTRODUCTION We studied variable analysis with two factors in the previous chapter. However, we have a case to treat factors more than two. For example, when we treat 4 factors with two 4 levels of each, we need the data number at least of 2 16 . The factor number a with m level number m requires the minimum data number of a . The number of levels we should try to obtain a result increases significantly with increasing a and m . Therefore, we need some procedure to treat such many factors, which is called as experimental design using an orthogonal table. We focus on the two levels with many factors in this chapter.
Kunihiro Suzuki
306
2. TWO FACTOR ANALYSIS We start with two factors where each level number is two. The factor is donated as A and B , and the levels for each factors are denoted as A1 , A2 and B1 , B2 . The corresponding data are shown in Table 1. The related data for given
factors are denoted as yij , where i, j 1, 2 . This table is converted as a data list for a given combination of factors shown in Table 2. The combination is also expressed as the two columns as shown in Table 3. Finally, we convert the notation 1 to -1, and 2 to 1, and obtain Table 4. We regard each column as vector, and the dot product of the two vectors is given by 1 1 1 1 1 1 1 1 1 1 0 1 1
(1)
Therefore, the two vectors are orthogonal to each other, which is the reason why the table is called as orthogonal. Table 1. Factors, levels, and related data
B
A
A1 A2
B1 y11 y21
B2 y12 y22
Table 2. Combination of factors, and related data
Combination A1B1 A1B2 A2B1 A2B2
Data y11 y12 y21 y22
Analysis of Variances Using an Orthogonal Table
307
Table 3. Two column expression for factors, and related data A
B 1 1 2 2
1 2 1 2
Data y11 y12 y21 y22
Table 4. Two column expression for factors, and related data A
B -1 -1 1 1
-1 1 -1 1
Data y11 y12 y21 y22
We assume that the data yij is composed of many factors: the contribution of a factor Ai denoted as i , a factor B j denoted as j , interaction between i and j denoted as
ij , an independent constant number denoted as
, and an error denoted as
eij . yij is
hence expressed by
yij i j ij eij
(2)
The number of yij is denoted as n , and in this case, it is given by n 22 4
(3)
We impose the restrictions to the variables as
1 2 0
(4)
1 2 0
(5)
11 12 0
(6)
Kunihiro Suzuki
308
21 22 0
(7)
11 21 0
(8)
12 22 0
(9)
The above treatment does not vanish generality. Therefore, we decrease the variables number as
1 2
(10)
1 2
(11)
11 21 21 22
(12)
Each data is then given by
y11 e11
(13)
y12 e12
(14)
y21 e21
(15)
y22 e21
(16)
We should guess the values of , , , and from the obtained data yij . The predicted value can be determined so that the error is the minimum. The sum of error Qe2 can be evaluated as
Analysis of Variances Using an Orthogonal Table
309
Qe2 e112 e122 e212 e222
y y y
y11 12
21
22
2
2
2
2
(17)
Differentiating Eq. (17) with respect to and set it to 0, we obtain
Qe2 2 y11
2 y 2 y
2 y12 21
22
2 y11 y12 y21 y22 4ˆ 0
(18)
We then obtain
ˆ
1 y11 y12 y21 y22 n
(19)
where we use hat to express that it is the predicted value. We utilize this notation in the followings. Differentiating Eq. (17) with respect to and set it to 0, we obtain Qe2 2 y11
2 y 2 y 2 y
12
21
22
2 y11 y12 y21 y22 4ˆ 0
(20)
Kunihiro Suzuki
310 We then obtain
ˆ
1 y11 y12 y21 y22 n
(21)
Differentiating Eq. (17) with respect to and set it to 0, we obtain
Qe2 2 y11
2 y 2 y 2 y
12
21
22
2 y11 y12 y21 y22 4ˆ 0
(22)
We then obtain
1 ˆ y11 y12 y21 y22 n
(23)
Differentiating Eq. (17) with respect to and set it to 0, we obtain Qe2 2 y11
2 y 2 y
2 y12 21
22
2 y11 y12 y21 y22 4 0
We then obtain
(24)
Analysis of Variances Using an Orthogonal Table
1 n
311
y11 y12 y21 y22 (25)
We define the variables below QA y11 y12
(26)
QA y21 y22
(27)
QB y12 y22
(28)
QB y11 y21
(29)
Q y11 y22
(30)
Q y12 +y21
(31)
The total average is given
ˆ
1 y11 y12 y21 y22 n
(32)
This is also expressed by
1 Q Q n
(33)
where is a dummy variable and , , . The average associated with each factor of levels are expressed by
1 Q n 2
(34)
Kunihiro Suzuki
312
1 Q n 2
(35)
Let us consider the variance associated with a factor A denoted by S A2 , which is evaluated as 2 2 n A A S A2 2 n
(36)
This can be further reduced to 2 2 n A A S A2 2 n 2 2 Q 1 QA 1 1 A QA QA QA QA 2 n n n n 2 2 2 2 1 1 1 QA QA Q A Q A 2 n n 1 2 2 QA QA n
(37)
The other variances are evaluated similar way as
SB2
SC2
1 2 Q QB 2 B n
(38)
1 2 Q QB 2 B n
(39)
2 S AB
1 2 Q QAB 2 AB n
We can modify the table related to the above equations.
(40)
Analysis of Variances Using an Orthogonal Table
313
First, we can add column associated with AB as in Table 5. The final table can be formed using this data as shown Table 6. Table 5. Two column expression for factors adding interaction, and related data
A
B -1 -1 1 1
AB -1 1 -1 1
1 -1 -1 1
Data y11 y12 y21 y22
Table 6. Two column expression for factors adding interaction, and related data A
Factor Level
B
Ay11 y12 QA-
Data Sum
A+ y21 y22 QA+ S(2)A
Variance
By11 y21 QB-
(AB) B+ y12 y22 QB+
S(2)B
(AB)y12 y21 Q(AB)-
(AB)+ y11 y22 Q(AB)+
S(2)(AB)
The freedom is all 1, that is,
A B AB 1
(41)
and the unbiased variance is the same as the variance. The unbiased variances are denoted as small s character and are given by sA2
n
A
S A2 , sB2
n
B
2 SB2 , sAB
n
AB
2 S AB
(42)
We cannot separate the variance associated with error. However, if the interaction can be ignored, that is 0 , the related variance can be regarded as the one for error. Therefore, 2 se2 sAB
We can then test the effectiveness the factor using a F distribution as
(43)
Kunihiro Suzuki
314
FA
FB
s A2 se2
(44)
sB2 se2
(45)
The critical value for the F distribution Fcrit is given by
Fcrit F 1,1,0.95
(46)
3. THREE FACTOR ANALYSIS WITHOUT REPETITION We extend the analysis in the previous section to three factors of A , B , and C . We assume that the data yijk is expressed by the contribution of the factor Ai denoted as i , the factor B j denoted as j , the factor Ck denoted as k , the interaction between each factor denoted as ij , ik , jk , ijk , the independent constant number denoted as , and the error denoted as eijk . The number of yijk is denoted as n , and n 23 8
(47)
and it is expressed as
yijk i j k ij ik jk ijk eijk
(48)
We impose restrictions to the variables as
1 2 0
(49)
1 2 0
(50)
Analysis of Variances Using an Orthogonal Table
315
1 2 0
(51)
11 12 0
(52)
21 22 0
(53)
11 21 0
(54)
12 22 0
(55)
11 12 0
(56)
21 22 0
(57)
11 21 0
(58)
12 22 0
(59)
11 12 0
(60)
21 22 0
(61)
11 21 0
(62)
12 22 0
(63)
These restrictions do not vanish the generality. We assume that s1 : i j even s2 : i j odd
(64)
Kunihiro Suzuki
316 and impose
s 1 s 2 0
(65)
s 1 s 2 0
(66)
s 1 s 1 0
(67)
s 2 s 2 0
(68)
1
1
2
2
1
2
1
2
Eqs. (65)-(68) can be summarized as
u u 1
0
2
(69)
where u1 : i j k even u2 : i j k odd
(70)
Therefore, we decrease the variables number as 1 2
(71)
1 2
(72)
1 2
(73)
11 21 21 22
(74)
11 21 21 22
(75)
11 21 21 22
(76)
Analysis of Variances Using an Orthogonal Table
u
1
u 2
317
(77)
Each data yijk is then expressed by
y111 e111
(78)
y112 e112
(79)
y121 e121
(80)
y122 e122
(81)
y211 e211
(82)
y212 e212
(83)
y221 e221
(84)
y222 e222
(85)
Based on the above analysis, we can construct an orthogonal table which is called as
L8 27 . This can accommodate the case with factors up to 6. 8 in the notation of L8 27 3 comes from 2 and 3 comes from the number of character A, B, C . 2 in the notation of
L8 27 comes from the level number. 7 in the notation of L8 27 comes from the column number. The column number is assigned to the character as
A 20 1: first column B 21 2 : second column C 22 4 : fourth column
Kunihiro Suzuki
318
This is related to the direct effects of each factor. The other column is related to AB, AC, BC, ABC as below. AB 1 2 3 AC 1 4 5 BC 2 4 6 ABC 1 2 4 7
We can generate the yijk in order as shown in the table. The level of column 1(A), 2(B), and 4(C) are directly related to the subscript of yijk , where it is -1 when the subscript is 1 and 1 when the subscript is 2. This can also be appreciated that it is -1 when the subscript is odd number and 1 when the subscript is even number. The level for clum3 (AB) is determined from the sum of i j is an odd or even number. The other column levels are also determined with the same way.
Table 7. Orthogonal table L8 27 1
8 Factor
2 -1 -1 -1 -1 1 1 1
Combination No
Column No. 1 2 3 4 5 6 7
1 A
3 -1 -1 1 1 -1 -1 1
4 1 1 -1 -1 -1 -1 1
1 B
1 AB
5 -1 1 -1 1 -1 1 -1
6 1 -1 1 -1 -1 1 -1
1 C
7 1 -1 -1 1 1 -1 -1
-1 1 1 -1 1 -1 -1
1
1 ABC
1 AC
BC
Data Value y111 2.3 y112 3.4 y121 4.5 y122 5.6 y211 7.5 y212 8.9 y221 9.7 y222
8.9
Let us perform dot product of any combination of columns. For example levels of column number 1 and 2. We perform dot product of columns 1 and 2 are 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
(86)
We can verify this result for any combination of columns. This expresses the orthogonality of the table.
Analysis of Variances Using an Orthogonal Table
319
We should guess the value of , , , , and , , , and from the obtained data yijk . The predicted value can be determined so that the error is the minimum. The error Qe can be evaluated as 2
2
2
Qe2 eijk2 i 1 j 1 k 1
y y y y y y y
y111
2
2
112
2
121
2
122
2
211
2
212
2
221
2
222
(87)
Differentiating Eq. (87) with respect to and set it to 0, we obtain Qe2 2 y111
2 y 2 y 2 y 2 y 2 y 2 y
2 y112 121
122
211
212
221
222
2 y111 y112 y121 y122 y211 y212 y221 y222 8 0
(88)
We then obtain
ˆ
1 y111 y112 y121 y122 y211 y212 y221 y222 n
(89)
Kunihiro Suzuki
320
Differentiating Eq. (87) with respect to and set it to 0, we obtain Qe2 2 y111
2 y 2 y 2 y 2 y 2 y 2 y 2 y
112
121
122
211
212
221
222
2 y111 y112 y121 y122 y211 y212 y221 y222 8 0
(90)
We then obtain
ˆ
1 y111 y112 y121 y122 y211 y212 y221 y222 n
(91)
Similarly, we obtain
1 ˆ y111 y112 y121 y122 y211 y212 y221 y222 n
(92)
1 y111 y112 y121 y122 y211 y212 y221 y222 n
(93)
1 y111 y112 y121 y122 y211 y212 y221 y222 n
(94)
1 y111 y112 y121 y122 y211 y212 y221 y222 n
(95)
1 y111 y112 y121 y122 y211 y212 y221 y222 n
(96)
ˆ
Analysis of Variances Using an Orthogonal Table
1 y111 y112 y121 y122 y211 y212 y221 y222 n
321
(97)
We define the variables below QA y211 y212 y221 y222
(98)
QA y111 y112 y121 y122
(99)
QB y121 y122 y221 y222
(100)
QB y111 y112 y211 y212
(101)
QC y112 y122 y212 y222
(102)
QC y111 y121 y211 y221
(103)
Q y111 y112 y221 y222
(104)
Q y121 y122 y211 y212
(105)
Q y111 y121 y212 y222
(106)
Q y112 y122 y211 y221
(107)
Q y111 y122 y211 y222
(108)
Q y112 y121 y212 y221
(109)
Q y112 y121 y211 y222
(110)
Q y111 y122 y212 y221
(111)
Kunihiro Suzuki
322 The total average is given
1 y111 y112 y121 y122 y211 y212 y221 y222 n
(112)
This is also expressed by
1 Q Q n
(113)
where is a dummy variable and , , , . The average associated with each factor of levels are expressed by
1 Q n 2
(114)
1 Q n 2
(115)
Let us consider the variance associated with a factor A denoted by S A2 , which is evaluated as 2 2 n A A S A2 2 n
(116)
This can be further reduced to 2 2 n A A S A2 2 n 2 2 Q 1 Q 1 1 A QA QA A QA QA 2 n n n n 2 2 2 2 1 1 1 QA QA Q A Q A 2 n n
1 2 QA QA n2
(117)
Analysis of Variances Using an Orthogonal Table
323
The other variances are evaluated with the similar way as
SB2
SC2
1 2 Q QB 2 B n
(118)
1 2 Q QB 2 B n
(119)
2 S AB
2 S AC
2 S BC
1 2 Q QAB 2 AB n
(120)
1 2 Q QAC 2 AC n
(121)
1 2 Q QBC 2 BC n
(122)
2 S ABC
1 2 Q QABC 2 ABC n
(123)
The freedom of each variance is all 1 and their variances are the same as the variances given by sA2
n
A
2 sAB
S A2 , sB2
n
B
SB2 , sc2
n
C
SC2
n 2 2 n 2 2 n 2 S ,s S ,s S AB AB AC AC AC BC BC BC
2 s ABC
n 2 S ABC ABC
(124)
(125)
(126)
Based on the above process, we can form the Table 8, and evaluated value is shown in Table 9.
Kunihiro Suzuki
324
Table 8. Orthogonal table L8 27 1 A
Column No. Factor Level
2 B
3 AB
4 C
5 AC
6 BC
7 ABC
-1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 y111 y211 y111 y121 y121 y111 y111 y112 y112 y111 y112 y111 y111 y112 y112 y212 y112 y122 y122 y112 y121 y122 y122 y121 y121 y122 y122 y121 y121 y221 y211 y221 y211 y221 y211 y212 y211 y212 y212 y211 y212 y211 y122 y222 y212 y222 y212 y222 y221 y222 y221 y222 y221 y222 y221 y222 QA- QA+ QB- QB+ Q(AB)- Q(AB)+ QC- QC+ Q(AC)- Q(AC)+ Q(BC)- Q(BC)+ Q(ABC)-Q(ABC)+
Data
Sum
S(2)A
Variance
S(2)B
S(2)AB
S(2)C
S(2)AC
S(2)BC
S(2)ABC
Table 9. Values for Orthogonal table L8 27 Column No. Factor Level
1 A
2 B
3 AB
-1 1 -1 1 -1 1 2.3 7.5 2.3 4.5 4.5 2.3 3.4 8.9 3.4 5.6 5.6 3.4 Data 4.5 9.7 7.5 9.7 7.5 9.7 5.6 8.9 8.9 8.9 8.9 8.9 Sum 15.8 35 22.1 28.7 26.5 24.3 5.76 0.68 0.08 Variance 46.08 5.445 0.605 Unbiased variance
4 C
5 AC
6 BC
-1 1 -1 1 -1 1 2.3 3.4 3.4 2.3 3.4 2.3 4.5 5.6 5.6 4.5 4.5 5.6 7.5 8.9 7.5 8.9 8.9 7.5 9.7 8.9 9.7 8.9 9.7 8.9 24 26.8 26.2 24.6 26.5 24.3 0.12 0.04 0.08 0.98 0.32 0.605
7 ABC -1 1 2.3 3.4 5.6 4.5 8.9 7.5 9.7 8.9 26.5 24.3 0.08 0.605
We can utilize this table more than its original meaning. The table is for three factors with full interaction. Therefore, we have no data associated with error. In the practical usage, we do not consider the whole interaction.
3.1. No Interaction If we do not consider any interaction, the direct effect is expressed with columns 1, 2, and 4. We regard the other column as the error one, and the corresponding variance Se2 is given by 2 2 2 2 Se2 S AB S AC SBC S ABC 0.27
(127)
The corresponding freedom is the sum of the each variance and is e 4 in this case, and the unbiased error variance is given by se2
n 2 Se 0.58 e
(128)
Analysis of Variances Using an Orthogonal Table
325
The each factor effect validity is evaluated as
s A2 46.08 FA 2 86.33 se 0.53
FB
FC
(129)
sB2 5.45 10.20 se2 0.53
(130)
sC2 0.98 1.84 se2 0.53
(131)
The corresponding critical F value is the same for all factors Fcrit as
Fcrit F 1,4,0.95 7.71
(132)
Therefore, factors A and B are effective, and factor C is not effective.
3.2. The Other Factors Although the table is for three factors, we can utilize it for more factors if we neglect some interaction factors. Let us consider that we do not consider no interaction. We want to consider 4 factors, that is, factors A , B , C , and D . We can select any columns for factor D among the columns 3, 5, 6, and 7. We assign column 5 (AC) for the factor D , and set the experimental condition based on the Table 7. We assume that we obtain the same value for simplicity. 2 sD2 sAC
(133)
The variance associated with error is then given by 2 2 2 Se2 S AB SBC S ABC 0.23
(134)
Kunihiro Suzuki
326
The corresponding freedom is the sum of the each variance and is e 3 in this case, and the unbiased error variance is given by se2
n
e
Se2 0.61
(135)
The each factor effect validity is evaluated as FA
FB
FC
FD
s A2 46.08 76.2 se2 0.61
(136)
sB2 5.45 9.00 se2 0.61
(137)
sC2 0.98 1.62 se2 0.61
(138)
sC2 0.32 0.53 se2 0.61
(139)
The corresponding critical F value is the same for all factors Fcrit as
Fcrit F 1,3,0.95 10.13
(140)
Therefore, factors A is effective, and factors B , C , and D are not effective. We need one column for error at least. Therefore, we can add four factors as maximum.
3.3. Interaction We can consider 3 interactions as maximum, for factors A , B , and C . If we consider the interaction, we need to use the corresponding column. We consider the interaction of
AC , and the other factor D . We use the column 5 for the interaction of AC , and assign the column 3 for the factor D . The other columns of 6 and 7 are the ones for error.
Analysis of Variances Using an Orthogonal Table 2 2 Se2 SBC S ABC 0.15
327 (141)
The corresponding freedom e 2 , and the unbiased variance is se2
FA
FB
FC
n
e
Se2 0.61
(142)
s A2 46.08 76.2 se2 0.61
(143)
sB2 5.45 9.00 se2 0.61
(144)
sC2 0.98 1.62 se2 0.61
(145)
2 s AC 0.32 FAC 2 0.07 se 0.61
FD
sC2 0.32 0.53 se2 0.61
(146)
(147)
The corresponding critical F value is the same for all factors Fcrit as
Fcrit F 1,2,0.95 18.51
(148)
Therefore, only factor A is effective.
4. THREE FACTOR ANALYSIS WITH REPETITION We extend the analysis in the previous section to the data with repetition, where we have r data for one given condition. The data is denoted as yijk _ r . The number of yijk is denoted as n , and is given by
Kunihiro Suzuki
328 n 23 r
(149)
We use r 3 here, and hence n 24 . However, the analysis procedure is applicable for any number of r . The orthogonal table is shown in Table 10. There are three data for the one same condition. 7 Table 10. Orthogonal table L8 2 with repeated data
1
2 -1 -1 -1 -1 1 1 1 1
Combination No
Column No. 1 2 3 4 5 6 7 8 Factor
A
3 -1 -1 1 1 -1 -1 1 1
B
1 1 -1 -1 -1 -1 1 1 AB
4
5 -1 1 -1 1 -1 1 -1 1
C
1 -1 1 -1 -1 1 -1 1 AC
6
7 1 -1 -1 1 1 -1 -1 1
BC
-1 1 1 -1 1 -1 -1 1 ABC
Data1 Data2 Data3 y111_1 y111_2 y111_3 y112_1 y112_2 y112_3 y121_1 y121_2 y121_3 y122_1 y122_2 y122_3 y211_1 y211_2 y211_3 y212_1 y212_2 y212_3 y221_1 y221_2 y221_3 y222_1 y222_2 y222_3
Therefore, we have additional variation associated error. The average associated with one condition is given by
ijk
1 r
r
y
ijk _ p
p 1
(150)
The corresponding 8 averages are shown in Table 11. Table 11. Averages of repeated data Mean Data1 Data2 Data3 y111_1 y111_2 y111_3 (y111_1+y111_2+y111_3)/3 y112_1 y112_2 y112_3 (y112_1+y112_2+y112_3)/3 y121_1 y122_1 y211_1 y212_1 y221_1 y222_1
y121_2 y122_2 y211_2 y212_2 y221_2 y222_2
y121_3 y122_3 y211_3 y212_3 y221_3 y222_3
(y121_1+y121_2+y121_3)/3 (y122_1+y122_2+y122_3)/3 (y211_1+y211_2+y211_3)/3 (y212_1+y212_2+y212_3)/3 (y221_1+y221_2+y221_3)/3 (y222_1+y222_2+y222_3)/3
2 We can obtain variance associated with this Sein as
Analysis of Variances Using an Orthogonal Table 2 Sein
1 n
2
2
2
3
y
ijk _ r
ijk
i 1 j 1 k 1 r 1
329
2
(151)
The freedom of the variance ein is given by
ein 2 2 2 r 1 2 2 2 3 1 16
(152)
The unbiased variance is given by 2 sein
n 2 Sein ein
(153)
After we obtain the data shown in the Table 10, we construct the table. We simply add the data to the corresponding columns.
Table 12. Orthogonal table L8 27 Column No. Factor Level
Data1
Data2
Data3
Sum Variance Freedom
1 A -1 y111_1 y112_1 y121_1 y122_1 y111_2 y112_2 y121_2 y122_2 y111_3 y112_3 y121_3 y122_3 QA-
2 B 1 y211_1 y212_1 y221_1 y222_1 y211_2 y212_2 y221_2 y222_2 y211_3 y212_3 y221_3 y222_3 QA+
S(2)A 1
-1 y111_1 y112_1 y211_1 y212_1 y111_2 y112_2 y211_2 y212_2 y111_3 y112_3 y211_3 y212_3 QB-
3 AB 1 y121_1 y122_1 y221_1 y222_1 y121_2 y122_2 y221_2 y222_2 y121_3 y122_3 y221_3 y222_3 QB+
S(2)B 1
We define the variables below
-1 y121_1 y122_1 y211_1 y212_1 y121_2 y122_2 y211_2 y212_2 y121_3 y122_3 y211_3 y212_3 Q(AB)-
1 y111_1 y112_1 y221_1 y222_1 y111_2 y112_2 y221_2 y222_2 y111_3 y112_3 y221_3 y222_3 Q(AB)+
S(2)AB 1
4 C -1 y111_1 y121_1 y211_1 y221_1 y111_2 y121_2 y211_2 y221_2 y111_3 y121_3 y211_3 y221_3 QC-
5 AC 1 y112_1 y122_1 y212_1 y222_1 y112_2 y122_2 y212_2 y222_2 y112_3 y122_3 y212_3 y222_3 QC+
S(2)C 1
-1 y112_1 y122_1 y211_1 y221_1 y112_2 y122_2 y211_2 y221_2 y112_3 y122_3 y211_3 y221_3 Q(AC)-
1 y111_1 y121_1 y212_1 y222_1 y111_2 y121_2 y212_2 y222_2 y111_3 y121_3 y212_3 y222_3 Q(AC)+
S(2)AC 1
6 BC -1 y112_1 y121_1 y212_1 y221_1 y112_2 y121_2 y212_2 y221_2 y112_3 y121_3 y212_3 y221_3 Q(BC)-
7 ABC
1 -1 1 y111_1 y111_1 y112_1 y122_1 y122_1 y121_1 y211_1 y212_1 y211_1 y222_1 y221_1 y222_1 y111_2 y111_2 y112_2 y122_2 y122_2 y121_2 y211_2 y212_2 y211_2 y222_2 y221_2 y222_2 y111_3 y111_3 y112_3 y122_3 y122_3 y121_3 y211_3 y212_3 y211_3 y222_3 y221_3 y222_3 Q(BC)+ Q(ABC)-Q(ABC)+
S(2)BC 1
S(2)ABC 1
Kunihiro Suzuki
330
3
QA y211_ r y212 _ r y221_ r y222 _ r r 1
3
QA y111_ r y112 _ r y121_ r y122 _ r r 1
QB y121_ r y122 _ r y221_ r y222 _ r r 1
3
QB y111_ r y112 _ r y211_ r y212 _ r r 1
3
3
QC y111_ r y121_ r y211_ r y221_ r r 1
3
(155)
(157)
3
(159)
Q y121_ r y122 _ r y211_ r y212 _ r r 1
3
Q y111_ r y121_ r y212 _ r y222 _ r r 1
3
Q y112 _ r y122 _ r y211_ r y221_ r r 1
3
Q y111_ r y122 _ r y211_ r y222 _ r r 1
3
Q y112 _ r y121_ r y212 _ r y221_ r r 1
(158)
Q y111_ r y112 _ r y221_ r y222 _ r r 1
(156)
QC y112 _ r y122 _ r y212 _ r y222 _ r r 1
(154)
3
(160)
(161)
(162)
(163)
(164)
(165)
Analysis of Variances Using an Orthogonal Table 3
Q y112 _ r y121_ r y211_ r y222 _ r r 1
3
Q y111_ r y122 _ r y212 _ r y221_ r r 1
331
(166)
(167)
The total average is given
1 3 y111_ r y112 _ r y121_ r y122 _ r y211_ r y212 _ r y221_ r y222 _ r n r 1
(168)
This is also expressed by
1 Q Q n
(169)
where is a dummy variable and , , , . The average associated with each factor of levels are expressed by
1 Q n 2
(170)
1 Q n 2
(171)
Let us consider the variance associated with a factor A denoted by S A2 , which is evaluated as 2 2 n A A 2 S n 2 A
This can be further reduced to
(172)
Kunihiro Suzuki
332
2 2 n A A 2 S n 2 2 Q 1 QA 1 1 A QA QA QA QA 2 n n n n 2 2 2 A
1 1 1 QA QA Q A Q A 2 n n 2
2
1 2 QA QA n2
(173)
The other variances are evaluated with a similar way as
SB2
SC2
1 2 Q QB 2 B n
(174)
1 2 Q QB 2 B n
(175)
2 S AB
2 S AC
2 S BC
1 2 Q QAB 2 AB n
(176)
1 2 Q QAC 2 AC n
(177)
1 2 Q QBC 2 BC n
(178)
2 S ABC
1 2 Q QABC 2 ABC n
(179)
The freedom of each variance is all 1 and their variances are the same as the variances given by
Analysis of Variances Using an Orthogonal Table sA2
2 sAB
n
A n
n
B
2 2 S AB , sAC
AB
s ABC 2
S A2 , sB2
n
ABC
SB2 , sc2
n
AC
n
C
333
SC2
2 2 S AC , sBC
(180) n
BC
2 SBC
(181)
S ABC 2
(182)
If we neglect the some interactions, we should regard them as the extrinsic error. For 2
example, we neglect the interaction BC and ABC , the corresponding variance Se ' is given by 2 2 Se2' SBC S ABC
(183)
The corresponding freedom e ' is given by
e ' BC ABC
(184)
The total variance associated with the error is given by 2 Se2 Sein Se2'
(185)
The corresponding freedom e is given by
e ein e '
(186)
The unbiased variance is then given by se2
n 2 Se e
The F values are given by
(187)
Kunihiro Suzuki
334
FA
FB
FC
FAB
s A2 se2
(188)
sB2 se2
(189)
sC2 se2
(190)
2 s AB se2
(191)
2 s AC FAC 2 se
(192)
The corresponding critical F value is the same for all factors Fcrit as Fcrit F 1,e ,0.95
(193)
SUMMARY To summarize the results in this chapter:
In the L8 27 table, we have 7 columns and 8 rows, and basically we consider three factors of A, B, C without repeated data. The column number is assigned to the characters as
A 20 1: first column B 21 2 : second column C 22 4 : fourth column This is related to the direct effects of each factor. The other column is related to AB, AC, BC, ABC as below.
Analysis of Variances Using an Orthogonal Table
335
AB 1 2 3 AC 1 4 5 BC 2 4 6 ABC 1 2 4 7
We have 8 data denoted as yijk where i corresponds to A , j corresponds to B , and
k corresponds to C , and we have y111 , y112 , y121 , , y222 . We focus on the corresponding number related to the column. For example, the column is assigned to AB , we form i j , and assign -1 or 1 for the row in each column depending on that i j is odd or even. y111 1 1 2 1 y112 1 1 2 1 y121 1 2 3 1 y222 2 2 4 1
We denote each column as , and denote each column with -1 as and 1 as . We denote the sum of the signed column as Q or Q . The corresponding variance associated with a factor is given by S 2
1 Q Q n2
2
The freedom of each factor is all 1. Therefore, the corresponding unbiased variance is given by
s2
nS2
nS2
The variance associated with the error corresponds to the one we do not focus on. We denote it as . Therefore, the variance associated with the error is given by Se2 S2
The unbiased variance is given by
Kunihiro Suzuki
336
se2
nSe2 n
where n is the number of factors we do not focus on. We evaluate
s2
F
se2
We compare this with F 1, n , P . We then treat repeated data. The repeated number is denoted as r . The data number
n then becomes n 23 r . Each variance associated with the factor has the same expression as
1 Q Q n2
S2
2
The unbiased variance has also the same expression given by
s2
nS2
nS2
In this case, we have r data for given i, j , k , that is, each data are expressed with
yijk _ r and we can define averages for each level as ijk
1 r yijk _ p r p 1
We can then obtain the corresponding variance as 2 Sein
2 1 2 2 2 r yijk _ p ijk n i 1 j 1 k 1 p 1
The corresponding freedom ein is given by
Analysis of Variances Using an Orthogonal Table
ein 23 r 1 We regard that the total error variance is given by 2 Se2 Sein S2
The corresponding freedom is given by e ein n
Therefore, the corresponding unbiased variance is given by se2
nS e2
e
We evaluate
F
s2 se2
We compare this with F 1,e , P .
337
Chapter 10
PRINCIPAL COMPONENT ANALYSIS ABSTRACT We evaluate one subject from various points of view, and want to evaluate each subject totally. Principal component analysis enables us to evaluate it. We compose variables by linearly combining the original variables. In the principal component analysis, the variables are composed so that the variance has the maximum value. We can judge that the value of the first component expresses the total score. We show the procedure to compose the variables. The method is applied to the regression where the explanatory variables have strong interaction. The variables are composed by the principal component analysis. We can then perform the regression using the composed variable, which gives us stable results.
Keywords: principal analysis, principal component, eigenvalue, eigenvector; regression
Figure 1. Path diagram for principal analysis.
Kunihiro Suzuki
340
1. INTRODUCTION We want to evaluate a subject from some different point of views in many cases. It is a rarely case that one subject is superior to the other subjects in all points. Therefore, we cannot judge which is better totally considering all points of views. In the principal component analysis, we construct composite variables, and decide which is better by simply evaluating the first composite variable. The schematic path diagram is shown in Figure 1. In the figure, we use p kinds of variables, and we theoretically have the same number kinds of principal variables. There is the other usage of principal component analysis. As we mentioned in Chapter 4 of volume 2 associated with multiple regression, the multiple regression factor coefficient becomes very complicated, unstable, and inaccurate when the interaction between explanatory variables are significant. In that case, we can construct a composite variable which consists of the explanatory variables with strong interaction. Using the composite variable, we can apply the multiple regression analysis, which makes the results simple, stable, and clear. We treat two variables first and extend the analysis to the multiple ones. We perform matrix operation in this chapter. The basics of the matrix operation are described in Chapter 15 of volume 3.
2. PRINCIPAL COMPONENT ANALYSIS WITH TWO VARIABLES We consider principal analysis with two variables
n , and each data are denoted as
xi1 xi 2 i 1, 2, ,
x j j 1, 2
, n
. The sample number is
which is shown in Table 1.
Table 1. Data form for two variable principal analysis ID
x1
x1
1
x11
x11
2
x21
x12
i
xi1
xi 2
n
xn1
xn 2
Principal Component Analysis
The average and unbiased variance of
x1
341
s x 1 1 are denoted as and , respectively. 2
s x x Those of 2 are denoted as 2 and 2 , respectively. These parameters are given by 2
x1
1 n xi1 n i 1 1 n 2 xi1 x1 n 1 i 1
s1 2
x2
(1)
(2)
1 n xi 2 n i 1
(3)
1 n 2 xi 2 x2 n 1 i 1
s2 2
(4)
The correlation factor between r12
x1
and
x2
1 n x1 x1 x2 x2 2 2 n 1 i 1 s s 1
r is denoted as 12 and is given by (5)
2
First of all, we normalize the variables as u1
u2
x1 x1 s1
2
x2 x2 s2 2
(6)
(7)
The normalized variables hold below. u1 u2 0 n
u i 1
2 i1
n 1
(8)
(9)
Kunihiro Suzuki
342 n
u i 1
2
i2
n 1
n
u u i 1
i1 i 2
n 1 rx1 x2
(10)
(11)
We construct a composite variable z below. z a1u1 a2u2
(12)
We impose that the unbiased variance of z has the maximum value to determine the 2 factors a1 and a2 . The unbiased variance sz is evaluated as 1 n 2 zi z n 1 i 1 1 n 2 zi n 1 i 1 1 n 2 a1ui1 a2ui 2 n 1 i 1
sz 2
(13)
n n 1 2 n 2 2 2 a1 ui1 2a1a2 ui1ui 2 a2 ui 2 n 1 i 1 i 1 i 1
a12 a22 2a1a2 rx1 x2
This value increases with increasing a1 and a2 . Therefore, we need a certain restriction to determine the value of a1 and a2 . It is given by a12 a22 1
(14)
Consequently, the problem is defined as follows. We determine the factors of a1 and a2 so that the unbiased variance has the maximum value under the restriction of Eq. (14). We apply Lagrange’s method of undetermined multipliers to solve this problem. We introduce a Lagrange function given by
f a1 , a2 , a12 a22 2a1a2 r12 a12 a22 1
(15)
Principal Component Analysis
343
Partial differentiating Eq. (15) with respect to a1 and set it to 0, we obtain f a1 , a2 , 2a1 2a2 r12 2a1 0 a1
(16)
Arranging this, we obtain a1 r12 a2 a1
(17)
Partial differentiating Eq. (15) with respect to a1 and set it to 0, we obtain f a1 , a2 , 2a2 2a2 r12 2a2 0 a2
(18)
Arranging this, we obtain a2 r12 a1 a2
(19)
Eqs. (17) and (19) are expressed by a matrix form as 1 r12
r12 a1 a1 1 a2 a2
(20)
This can also be expressed by
Ra a
(21)
where 1 R r12
r12 1
(22)
a a 1 a2
(23)
aT a1 , a2
(24)
Kunihiro Suzuki
344 Let us consider the meaning of 1 r12
. Multiplying Eq. (24) to Eq. (20), we obtain
r12 a1 a1 a1 , a2 1 a2 a2
a1 , a2
(25)
Expanding this, we obtain 1 r12 a1 left side a1 , a2 r12 1 a2 a r a a1 , a2 1 12 2 r12 a1 a2 a12 r12 a1a2 r12 a1a2 a22
(26)
a a 2r12 a1a2 2 1
sz
2 2
2
a right side a1 , a2 1 a2 a12 a22
(27)
Therefore, we obtain
sz 2
(28)
is equal to the variance of z . Large means a large variance. This is called as an eigenvalue. Since our target for the principal analysis is to construct a variable that makes clear the difference of each member, the large variance is the expected one. Therefore, the large is our requested one, and we denote the maximum one as first eigenvalue, and second, and so on. The eigenvalue can be evaluated as follows. Modifying Eq. (20), we obtain r12 a1 1 0 1 a2 r12
This has roots only when it holds below.
(29)
Principal Component Analysis 1 r12 0 r12 1
345 (30)
This is reduced to
1
2
r122 0
(31)
Therefore, we have two roots given by
1 1 r12
(32)
1 1 rx1x 2
(33)
We can obtain a set of 1 2 for each . Substituting 1 r12 into (20), we obtain a ,a
a1 r12 a2 1 r12 a1
(34)
a2 r12 a1 1 r12 a2
(35)
These two equations are reduced to the same result of a1 a2
(36)
Imposing the restriction of Eq. (14), we obtain 2a12 1
(37)
Therefore, we obtain
a1
1 2
We adopt a positive factor, and the composite number z1 is given by
(38)
Kunihiro Suzuki
346
z1
1 2
u1
1 2
u2
(39)
Similarly, we obtain for 1 r12 as a1 r12 a2 1 r12 a1
(40)
a2 rx1x2 a1 1 r12 a2
(41)
This is reduced to a1 a2
(42)
Considering the restriction of Eq. (14), we obtain 2a12 1
(43)
Therefore, we obtain
1
a1
(44)
2
We adopt the positive one, and the composite number z2 is given by
z2
1 2
u1
1 2
u2
(45)
In the case of r12 0 , z1 is called as the first principal component, and z2 is the second principal component. In the case of r12 0 , z2 is called as the first principal component, and z1 is the second principal component. Assuming r12 0 , we summarize the results. The first principal component eigenvalue and eigenvector are given by
1 1 r12
(46)
Principal Component Analysis
a 1
347
1 2 1 2
(47)
The second principal component eigenvalue and eigenvector are given by
2 1 r12
a
2
(48)
1 2 1 2
(49)
The normalized composite values are given as bellows. The first principal component is given by zi 1
1 2
ui1
1 2
(50)
ui 2
The second principal component is given by
zi 2
1 2
ui1
1 2
ui 2
(51)
The contribution of i component tot the total is evaluated as
2
i
2
s1 s2
i 1 r12 1 r12
i
(52)
2
The total sum of the eigenvalue is equal to the variable number in general.
3. PRINCIPAL COMPONENT ANALYSIS WITH MULTI VARIABLES We extend the analysis of two variables in the previous section to m variables
x1 , x2 , , xm
, where m is bigger than two. The sample data number is assumed to be n .
Therefore, the data are denoted as
xi1 , xi 2 , , xim i 1, 2,
, n
.
Kunihiro Suzuki
348
Normalizing these variables, we obtain u1
x1 x1 s1
2
x2 x2
, u2
s2
2
,up
,
xp xp sp
(53)
2
The averages and variances of the variables are 0 and 1 for all. We construct a composited variable given by z a1u1 a2u2
amum
(54)
The corresponding unbiased variance is given by 1 n 2 zi z n 1 i 1 1 n 2 zi n 1 i 1 1 n a1ui1 a2ui 2 n 1 i 1
sz 2
a12 a22
am uim
2
am2
2a1a2 r12 2a1a3 r13
2a1am r1m
2a2 a3 r23 2a2 a4 r24
2a2 am r2 m
2a3 a4 r34 2a3 a5 r35
2a3 am r2 m
(55)
2am 1am rm 1, m
The constrain is expressed by a12 a22
am2 1
(56)
Lagrange function given by f a12 a22
am2
2a1a2 r12 2a1a3 r13
2a1am r1m
2a2 a3 r23 2a2 a4 r24
2a2 am r2 m
2a3 a4 r34 2a3 a5 r35
2a3 am r2 m
2am 1am rm 1, m a12 a22
am2 1
(57)
Principal Component Analysis
349
Partial differentiating Eq. (57) with respect to a1 and set it 0, we obtain f 2a1 2a2 r12 2a3r13 a1
2am r1m 2 a1 0
(58)
Partial differentiating Eq. (57) with respect to a2 and set it 0, we obtain f 2a1r12 2a2 2a3 r23 2a4 r24 a2
2am r2 m 2 a2 0
(59)
Repeating the same process to the am , and we finally obtain the results with a matrix form, which is given by 1 r12 r21 1 rm1 rm 2
r1m a1 a1 r2 m a2 a 2 1 am am
(60)
We obtain m kinds of eigenvalues from 1 r12 r21 1 rm1
r1m r2 m
0
(61)
1
rm 2
We show briefly how we can derive eigenvalues and eigenvectors. We assume that
1 2
m
(62)
i corresponds to the i th principal component. The corresponding eigenvector can be evaluated from 1 r12 r21 1 rm1 rm 2
r1m a1 a1 r2 m a2 a i 2 1 am am
(63)
Kunihiro Suzuki
350
We cannot obtain the value of the elements of the eigenvector but the ratio of each element. We consider the ratio with respect to a1 , and set the other factor as ai i a1
(64)
We can evaluate i from Eq. (63), where a1 is not determined. From the constraint of Eq. (54), we can decide a1 as m2 am2 1 22
a12 22 a12
m2 a12
1
(65)
We select positive one and obtain
a1
1 1 2 2
(66)
m2
Therefore, we obtain z
1 1 2 2
m2
u1 2u2
mum
(67)
The contribution ratio is given by i
(68)
m
In the principal component analysis, we usually consider up to the second component. It is due to the fact that we can utilize a two-dimensional plot using them. The validity of only using two principal components can be evaluated by the sum of the component given by 1 2 m
(69)
The critical value for the accuracy is not clear. However, we can roughly estimate if the value is bigger than 0.8 or not for judging the validity of the analysis. The data are plotted on the two dimensional plane with the first and second principal component axes.
Principal Component Analysis
351
Each component variable for j 1, 2, , m is plotted as
a , a a , a 1 1
1
2
1 2
2 2
(70)
a , a 1 m
2 m
The each member is plotted as
z , z 1
i
2
(71)
i
where
zi a1 ui1 a2 ui 2
(72)
zi a1 ui1 a2 ui 2
(73)
1
1
2
2
1
2
We treat the first and second components identically up to here. However, it should be weighted depending on the eigenvalue. We should remind that the eigenvalue has the meaning of the variance, and hence it is plausible to weight the square root of the eigenvalue which corresponds to the standard deviation. That is, we multiply first component, and
2
1
to the
to the second component.
d j We can then define the distance between two members i and denoted as ij , and it is given by dij
1 zi1 1 z j1
2
2 zi 2 2 z j2
2
(74)
4. EXAMPLE FOR PRINCIPAL COMPONENT ANALYSIS (SCORE EVALUATION) We apply principal component analysis to the score evaluation. We obtain the score data as shown in Table 2.
Kunihiro Suzuki
352
Table 2. Score of subjects
Member ID 1 2 3 4 5 6 7 8 9 10 Mean Stdev.
Japanese x1 86 71 42 62 96 39 50 78 51 89 66.4 20.5
English x2 79 75 43 58 97 33 53 66 44 92 64.0 21.6
Mathematics x3 67 78 39 98 61 45 64 52 76 93 67.3 19.4
Science x4 68 84 44 95 63 50 72 47 72 91 68.6 18.0
Table 3. Normalized scores and corresponding first and second principal components Member ID 1 2 3 4 5 6 7 8 9 10
Japanese u1 0.954 0.224 -1.188 -0.214 1.441 -1.334 -0.798 0.565 -0.750 1.100
English Mathematics u2 u3 0.696 -0.015 0.510 0.552 -0.974 -1.461 -0.278 1.585 1.531 -0.325 -1.438 -1.151 -0.510 -0.170 0.093 -0.790 -0.928 0.449 1.299 1.327
Science u4 -0.033 0.857 -1.368 1.469 -0.312 -1.035 0.189 -1.202 0.189 1.246
First component 0.796 1.073 -2.493 1.283 1.165 -2.479 -0.643 -0.671 -0.518 2.488
Second component 0.857 -0.348 0.321 -1.765 1.802 -0.297 -0.678 1.342 -1.148 -0.086
The normalized values are shown in Table 3. The correlation matrix is then evaluated as 0.967 0.376 0.311 1 0.967 1 0.415 0.398 R 0.376 0.415 1 0.972 1 0.311 0.398 0.972
(75)
The corresponding eigenvalues can be evaluated with a standard matrix operation shown in Chapter 15 of volume 3, and are given by
Principal Component Analysis
353
1 2.721, 2 1.222, 3 0.052, 4 0.005
(76)
The eigenvectors are given by z1 0.487u1 0.511u2 0.508u3 0.493u4
(77)
z2 0.527u1 0.474u2 0.481u3 0.516u4
(78)
z3 0.499u1 0.539u2 0.504u3 0.455u4
(79)
z4 0.485u1 0.474u2 0.506u3 0.533u4
(80)
The contribution of each principal component is given by 1 4
0.680,
2 4
0.306,
3 4
0.013,
4 4
0.001
(81)
The contribution summing up to the second component is 98% and is more than 80%, and we can focus on the two components neglecting the other ones. The first and second components are shown in Table 3. The first and second component for each subject is given by Japanese: 0.487, 0.527 English: 0.511, 0.474
(82)
Mathematics: 0.508, 0.481 Science: 0.493, 0.516
We treat the first and second component identically above. The weighted ones depending on the eigenvalue can also be evaluated. That is, we multiply 1 first component, and 2 by
1.222
2.721
to the
to the second component. The resultant data are given
Japanese: 0.803, 0.583 English: 0.843, 0.524 Mathematics: 0.838, 0.532 Science: 0.813, 0.570
(83)
Kunihiro Suzuki
354
Let us inspect the first component z1 , where the each factor is almost the same. Therefore, it means simple summation. Let us inspect the second component z2 . u1 and u2 are the almost the same, and u3 and u4 are the almost the same with the negative sign. The group u1 and u2 corresponds to
Japanese and English, and the group u3 and u4 corresponds to mathematics and science. Therefore, this expresses the quantitative course / qualitative course. The naming is not straight forward. We should do it by inspecting the value of factors. We further weight each score as shown in Table 4, and the corresponding twodimensional plot is shown in Figure 2. Table 4. Principal components and weighted principal components
Member ID
First component
1 2 3 4 5 6 7 8 9 10
Second component
0.796 1.073 -2.493 1.283 1.165 -2.479 -0.643 -0.671 -0.518 2.488
0.857 -0.348 0.321 -1.765 1.802 -0.297 -0.678 1.342 -1.148 -0.086
Weighted first component 1.313 1.770 -4.113 2.116 1.922 -4.090 -1.060 -1.107 -0.854 4.104
Weighted second component 0.948 -0.385 0.355 -1.951 1.992 -0.328 -0.750 1.483 -1.270 -0.095
Weighted second component and factor loading
5
-5
4
3 No.5
2 No.8 Jap. Eng.
1
No.1
No.3
No.10
0 -4 No.6
-3
-2
-1
0 No.7 -1 No.9
1 2 Math. No.2 Sci.
-2
3
4
5
No.4
-3 -4
-5
Weighted first principal component and factor loading
Figure 2. Weighted principal components and factor loadings.
We first simply inspect the horizontal direction which corresponds to the first principal component. No. 10 has high score, and the second group is No. 4 and No. 5, and the last group is No. 3 and No. 6.
Principal Component Analysis
355
We then focus on the vertical direction which corresponds to the second principal component. If it is positive it is qualitative characteristics and negative it is quantitative characteristics. Let us compare the second group No. 4 and 5 based on the second principal component, where both have almost the same first principal component. No. 4 has a quantitative character, and No. 5 has a quality character. We can also see that No. 10 is an all-rounder. The distance in the plane of Figure 2 corresponds to the resemblance of each member. The distances are summarized in Table 5. No. 1 resembles to No. 5, and No.2 to No.1, and so on. Table 5. Distance between each member
No.1 No.2 No.3 No.4 No.5 No.6 No.7 No.8 No.9 No.10
No.1 No.2 No.3 No.4 No.5 No.6 No.7 No.8 No.9 No.10 1.5 5.8 3.2 1.3 5.9 3.1 2.6 3.3 3.1 1.5 6.3 1.7 2.5 6.2 3.0 3.6 2.9 2.5 5.8 6.3 7.0 6.6 0.7 3.4 3.4 3.8 8.7 3.2 1.7 7.0 4.2 6.8 3.6 5.0 3.2 2.9 1.3 2.5 6.6 4.2 6.8 4.3 3.2 4.5 3.2 5.9 6.2 0.7 6.8 6.8 3.2 3.7 3.6 8.6 3.1 3.0 3.4 3.6 4.3 3.2 2.4 0.6 5.5 2.6 3.6 3.4 5.0 3.2 3.7 2.4 2.9 5.7 3.3 2.9 3.8 3.2 4.5 3.6 0.6 2.9 5.4 3.1 2.5 8.7 2.9 3.2 8.6 5.5 5.7 5.4
5. MULTIPLE REGRESSION USING PRINCIPAL COMPONENT ANALYSIS In the multiple regression analysis, the analysis becomes unstable and inaccurate when the interaction between explanatory variables is significant as shown in Chapter 4 of volume 2. In that case, we can utilize the principal analysis to overcome the problem. Let us consider the case where the interaction between two explanatory variables x1 and
x2 is significant. We construct a composite variable given by z a1u1 a2u2
(84)
u1 and u2 are normalized variables. We then obtain the first principal component
where as
zi 1
1 2
ui1
1 2
ui 2
(85)
Kunihiro Suzuki
356
We should evaluate the contribution of the component given by 1
(86)
2
This should be more than 0.8. We assume that the contribution is more than 0.8 and continue the analysis below. The average of the first principal component is zero. The variance z12
z12 can be evaluated below.
1 n 2 zi1 n 1 i 1 2
1 n 1 1 ui1 ui 2 n 1 i 1 2 2 n 1 1 ui12 ui 22 2ui1ui 2 2 n 1 i 1 1 r12
(87)
The correlation factor between objective variables and the first principal component is given by 1 n vi zi n 1 i 1 2 1
ryz1
z1
1 n 1 1 vi ui1 ui 2 1 r12 n 1 i 1 2 2 1
1 2 1 r12
r
y1
(88)
ry 2
When r12 is zero, it becomes average of each correlation factor, that is, the two variables becomes one variable. The regression is given by v ryz1 z 1
1 2 1 r12 ry1 ry 2 2 1 r12
r
y1
1 1 ry 2 u1 u2 2 2
u1 u2
(89)
Principal Component Analysis
357
Therefore, the regression can be expressed by the average correlation factor, which is decreased by the interaction between the two explanatory variables. We can easily extend this procedure to multi variables. We can obtain the first component variable that consists of m variables where the interaction is significant. That is, m
z ai ui 1
(90)
i 1
We then analyze the multiple regression analysis using this lumped variable as well as the other variables.
SUMMARY To summarize the results in this chapter— We normalize the data as
uij j 1,2, , p; i 1,2, , n We then obtain the matrix relationship given by 1 r12 r21 1 rm1 rm 2
r1m a1 a1 r2 m a2 a 2 1 am am
where rij is the correlation factor between factor i and j . We can obtain eigenvalues and eigenvectors given by
1 ; a1 k11 , k21 ,
, k p 1
, k p
, k p
2 ; a 2 k1 2 , k2 2 ,
2
・・・ p ; a p k1 p , k2 p ,
p
Kunihiro Suzuki
358 We then express the data as
zi k1 ui1 k2 ui 2
k p uip
zi k1 ui1 k2 ui 2
k p uip
1
1
2
1
2
2
1
2
・・・
zi k1 ui1 k2 ui 2 p
p
p
k p uip p
This is only the different expression of the same data. However, the difference exits where we focus on only the first two components. The effectiveness of the two component selection can be evaluated as 1 2 2 1 1 2 p p
z , z 1
i
2
i
Each item can be plotted in the two dimensional plane as Item 2 : k , k
Item 1: k1 , k1
1
2
1 2
2
2
Item p : k p , k p 1
2
The score of each member is plotted in the two-dimensional plane as
1 zi1 , 2 zi 2
We can judge the goodness of the member by evaluating the first term of 2
z characteristics from the second term of 1 i . The closeness of each member can be evaluated from the distance given by
1 zi1
, and
Principal Component Analysis d zij
1 zi1 1 z j1
2
2 zi 2 2 z j2
359
2
The each item is plotted in the two-dimensional plane as
1 k j1 , 2 k j 2
The closeness of each item can be evaluated from the distance given by dkij
1 ki1 1 k j1
2
2 ki 2 2 k j2
2
We can also evaluate the closeness of the member and the item from the distance given by d zkij
1 zi1 1 k j1
2
2 zi 2 2 k j 2
2
When the multiple regression suffers high interaction between explanatory variables, we perform the principal component analysis and for a composite variable given by m
z ai ui 1
i 1
We then perform one variable regression for each pair.
Chapter 11
FACTOR ANALYSIS ABSTRACT We obtain various data for many factors. We assume the output data comes from some few fundamental factors. Factor analysis gives us the procedure to decide the factors, where the factor number is less than the parameter numbers in general. Whole data are then expressed by the fewer factors, which are regarded as causes of the original data.
Keywords: factor analysis, factor point, factor loading, independent load, Valimax method, rotation matrix
1. INTRODUCTION We obtain various kinds of data. We assume that there are some common causes for all data, and each data can be reproduced by the cause factors. The procedure to evaluate the common cause is called a factor analysis. There are some confusion between principal component analysis and factor analysis, and we utilize the principal component analysis in the factor analysis. However, both analyses are quite different as explained below. In the principal component analysis, we extract the same kind number of the variable as that of the original variables, but focus on only the first principal component and treat it as though it is the target variable of the original variables, which is shown in Figure 1 (a). In the factor analysis, we want to extract variables which are regarded as the cause for the all original variables. We set the number of the variable for the cause from the beginning, and determine the variables. There is no order between the extracted variables. The schematic path diagram is shown in Figure 1 (b).
362
Kunihiro Suzuki
Consequently, in principal component analysis, and factor analysis, the focused variable number is fewer than the original variable number, and the analysis is quite similar. However, the role of the extracted variables is quite different. The ones for the principal analysis are objective one for the obtained data, and the ones for the factor analysis are the explanatory variables.
Figure 1. Schematic path diagram. (a) Principal component analysis, (b) Factor analysis.
In this chapter, we start with one factor, and then treat two factors, and finally treat arbitrarily factor numbers. We repeat almost the same step three times, where some steps
Factor Analysis
363
are added associated with the number. Performing the steps, we may be able to obtain a clear image of the factor analysis. We perform matrix operation in this chapter. The basic matrix operation is described in Chapter 15 of volume 3.
2. ONE FACTOR ANALYSIS We focus on one variable, that is, we assume that only one variable contributes to the whole original data.
2.1. Basic Equation for One Variable Factor Analysis We assume that one member has three kinds of scores of x, y , z . We also assume that the three variables are all normalized.
a , a , a We denote f as the factor point, and factor loadings as x y z , and independent
loads as
e , e , e . Then the original variables are expressed by
x y z
x
y
z
ax ex f a y ey a e z z
(1)
We denote each data using a subscript i , and they are expressed by
xi ax eix yi fi a y eiy a e z z iz i
(2)
a , a , a The factor loadings x y z are independent of the member, but depend on x, y , z . The factor point f depends on the member, but independent of x, y , z . The independent e ,e ,e
variables x y z depend on both x, y , z and the member. The terms on right side of Eq. (2) are all unknown variables. We need some restrictions on them to decide the terms.
Kunihiro Suzuki
364
We assume that a factor point f is normalized, that is E f 0
(3)
V f 1
(4)
The average of the independent load is zero, that is,
E e 0
(5)
where x, y , z . The covariance between f , e are zero, that is,
Cov f , e 0 Cov e , e 0
(6)
for
(7)
We define the vectors below.
f1 f f 2 fn
(8)
e1x e ex 2 x enx
(9)
e1 y e2 y ey eny
(10)
Factor Analysis e1z e ez 2 z enz
x1 X y1 z 1
365
(11)
xn yn zn
x2 y2 z2
(12)
ax A ay a z
(13)
F f1
f2
e1x e1 y e 1z
e2 x e2 y e2 z
fn
(14)
enx eny enz
(15)
Eq. (1) can then be expressed by X AF
(16)
T Multiplying X to the Eq. (16) from the right side, and expanding it, we obtain
x1 T XX y1 z 1
x xn 1 x yn 2 zn xn
x2 y2 z2
nS xx nS xy nS xz
nS xy nS yy nS yz
1 n rxy r xz
rxy 1 ryz
y1 y2 yn
z1 z2 zn
nS xz nS yz nS zz rxz ryz 1
(17)
Kunihiro Suzuki
366 T
where X is the transverse matrix of X . Similarly, we multiply transverse matrix to Eq. (16) from the right side, we obtain
AF AF AF F T AT T T
AFF T AT AF T F T AT T A FF T AT A F T F T AT T T
(18)
We evaluate the terms in Eq. (18) as below. n
FF T f i 2 i 1
n
(19)
F T f1
e1x e2 x fn enx
f2
fi eix fi eiy i 1 i 1 0 0 0 n
n
eny
n
fe i 1
e1z e2 z enz
e1 y e2 y
i iz
(20)
Substituting Eqs. (17), (19), and (20) into Eq. (16), we obtain 1 1 T XX rxy n r xz
rxz 1 ryz AAT T n 1
rxy 1 ryz
(21)
Evaluating the terms in the right side of Eq.(21), we obtain
ax AA a y ax a z T
ay
az (22)
Factor Analysis e1x 1 T 1 e1 y n n e1z
e2 x e2 y e2 z
e1x enx e2 x eny enz enx
e1 y e2 y eny
367
e1z e2 z enz
n n n 2 eix eiy eix eiz eix i 1 i 1 i 1 n n 1 n eiy eix eiy 2 eiy eiz n i 1 i 1 i 1 n n n 2 eiz eix eiz eiy eiz i 1 i 1 i 1 n 1 2 0 0 n eix i 1 1 n 2 0 eiy 0 n i 1 1 n 2 0 0 eiz n i 1 Se 2 0 0 x 2 0 Sey 0 2 0 0 S ez
(23)
Substituting Eqs. (22) and (23) into Eq. (21), we obtain 1 rxy r xz
rxy 1 ryz
rxz ax ryz a y ax 1 az
ay
Se 2 x az 0 0
0 Sey 2
0
0 0 2 Sez
(24)
This is the basic equation for one factor analysis. Let us compare the case for principal component analysis, where the covariance matrix is expanded as follows. 1 rxy r xz
rxy 1 ryz
rxz ryz 1v1v1T 2 v2 v2T 3v3v3T 1
The right side terms are all determined clearly.
(25)
Kunihiro Suzuki
368 If we can assume
ax a a y 1 v1v1T a z
(26)
Se 2 x 0 0
(27)
0 Sey 2
0
0 0 2 v2 v2T 3v3v3T 2 Sez
The first principal analysis and factor analysis become identical. It is valid when the first principal component dominates all. However, it is not true in general. As we mentioned, we can determine whole terms of Eq. (25) in the principal component analysis. On the other hand, we cannot solve the terms in Eq. (24) shown below in general. 2 ax Sex ay , 0 a z 0
0 Sey 2
0
0 0 2 Sez
(28)
We solve Eq. (24) with the aid of the principal component analysis, which is shown in the next section.
2.2. Factor Load We can expand the covariance matrix as below as shown in Chapter 10 for the principal component analysis. 1 rxy r xz
rxy 1 ryz
rxz ryz 1v1v1T 2 v2 v2T 3v3v3T 1
1 v1
1 v1
T
2 v2
2 v2
T
3 v3
3 v3
T
(29)
Factor Analysis We regards the first term
1 v1
369
T
1 v1 this as factor load in the factor analysis, that
is,
1 rxy r xz
rxy 1 ryz
rxz ryz 1
1_1 v1_1
1_1 v1_1
T
Se 2 x 0 0
0 Sey 2
0
0 0 2 Sez
(30)
Therefore, we approximate the factor load as
ax a y 1_1 v1_1 a z
(31)
This is not the final form factor loadings, but is the first approximated one. The first approximated independent load can be evaluated as Se 2_1 0 0 1 x 2 0 Sey _1 0 rxy 2 rxz 0 0 Sez _1
rxy 1 ryz
rxz ryz 1
1_1 v1_1
1_1 v1_1
T
(32)
The non-diagonal elements on the right side of Eq. (32) are not 0 in general. Therefore, Eq. (32) is invalid from the rigorous point of view. However, we ignore the non-diagonal elements and regard the diagonal elements on both sides are the same. We then form the new covariance matrix given by 1 Se 2_1 rxy rxz x 2 R1 rxy 1 Sey _1 ryz 2 r ryz 1 Sez _1 xz
We evaluate the eigenvalue and eigenvectors of the matrix, and evaluate below.
(33)
Kunihiro Suzuki
370 Se 2_ 2 x 0 0 1 rxy r xz
0 S 0 2 0 Sez _ 2 rxy rxz 1 ryz 1_ 2 v1_ 2 ryz 1 0
2 ey _ 2
1_ 2 v1_ 2
T
(34)
We evaluate the difference between the approximated independent factors given by Se 2_ 2 Se 2_1 x x 2 2 Sey _ 2 Sey _1 S 2 S 2 ez _1 ez _ 2
(35)
If the value below is less than a certain value
2 1 2 2 2 2 Se 2_ 2 Se 2_1 Se 2_ 2 Se 2_1 S S e _ 2 e _1 x x y y z z 3
(36)
we regard the component is the load factor given by
a 1_ m v1_ m
(37)
If the value is more than a certain value, we construct the updated covariance matrix given by 1 Se 2_ 2 rxy rxz x 2 R2 rxy 1 Sey _ 2 ryz 2 r r 1 S xz yz e _ 2 z
(38)
We evaluate the eigenvalue and eigenvector of it and the updated one denote as
1_ 3 , v1_ 3 . We can then evaluate the updated independent matrix given by
Factor Analysis Se 2_ 3 x 0 0 1 rxy r xz
0 2 Sey _ 3 0 2 0 Sez _ 3 rxy rxz 1 ryz 1_ 3 v1_ 3 ryz 1
371
0
1_ 3 v1_ 3
T
(39)
We further evaluate the updated difference of the independent element matrix below. Se 2_ 3 Se 2_ 2 x x 2 2 Sey _ 3 Sey _ 2 S 2 S 2 ez _ 2 ez _ 3
(40)
and evaluate below.
2 2 2 1 2 2 2 2 2 2 Sex _ 3 Sex _ 2 Sey _ 3 Sey _ 2 Sez _ 3 Sez _ 2 3
(41)
We repeat this process until is less than the certain critical value. When we obtain this value after the m times process, we obtain the load factor as
a 1_ m v1_ m
(42)
The corresponding diagonal elements are given by d x2 1 Sex _ m , d y2 1 Sey _ m , d z2 1 Sez _ m 2
2
2
(43)
which we utilize later.
2.3. Factor Point and Independent Load We review the process up to here. The basic equation for one variable factor analysis is given by
Kunihiro Suzuki
372
xi ax eix yi fi a y eiy a e z z iz i
(44)
We decide a factor load, and the variances of ex , ey , ez . We further need to evaluate fi and ex , ey , ez themselves. We assume that x, y , z are commonly expressed with f . This means that the predictive
fˆ value of f is denoted as i and it can be expressed with the linear sum of x, y , z given by fˆi hfx xi hfy yi hfz zi
(45)
The deviation from the true value fi is expressed by n
Q f f i fˆi i 1
2
fi h fx xi h fy yi h fz zi n
2
i 1
(46)
We decide the factors h fx , h fy , h fz so that Q f has the minimum value. Partial differentiate Eq. (46) with respect to h fx , and set it to 0, we obtain
Q f h fx
h fx
f h n
i 1
i
x h fy yi h fz zi
2
fx i
2 fi h fx xi h fy yi h fz zi xi 0 n
i 1
(47)
We then obtain
h n
i 1
fx xi xi h fy yi xi h fz zi xi f i xi n
i 1
The left hand side of Eq. (48) is further modified as
(48)
Factor Analysis
h
x x h fy yi xi h fz zi xi h fx xi xi h fy yi xi h fz zi xi
n
i 1
373
fx i i
n
n
n
i 1
i 1
i 1
n h fx rxx h fy rxy h fz rxz
(49)
The right side of Eq. (48) is further modified as n
n
f x f a i 1
i i
i
i 1
x
f i bx g i exi
nax
(50)
Therefore, we obtain
h fx rxx h fy rxy h fz rxz ax
Similarly, we obtain with respect to Q f h fy
h fx
f h n
i 1
i
(51)
h fy
x h fy yi h fz zi
as 2
fx i
2 fi h fx xi h fy yi h fz zi yi 0 n
i 1
(52)
We then obtain
h fx rxy h fy ryy h fz ryz ay
Similarly, we obtain with respect to
h fx rzx h fy ryz h fz rzz az
(53)
h fz
as
(54)
Summarizing these, we obtain rxx rxy r zx
rxy ryy ryz
rzx h fx ax ryz h fy a y rzz h fz az
(55)
Kunihiro Suzuki
374 We then obtain h fx rxx h fy rxy h r fz zx
rxy ryy ryz
rzx ryz rzz
1
ax ay a z
(56)
Using the h fx , h fy , h fz , the factor point is given by
fˆi hfx xi hfy yi hfz zi
(57)
Independent load can be evaluated as exi xi ax e y f i ay yi i a e z z zi i
(58)
3. TWO-FACTOR ANALYSIS 3.1. Basic Equation We treat a two-factor analysis with three normalized variables x, y , z in this section. The one factor point is denoted as f , and the other as g . The factor load associated with f is denoted as ax , ay , az , and the factor load associated with g is denoted as
b , b , b . The independent load is denoted as e , e , e . The variable is then expressed x
y
z
x
y
z
as x ax bx ex y f a g y by ey z a b e z z z
(59)
Expressing each data as i , we have xi ax bx eix yi fi a y gi by eiy a b e z z z iz i
(60)
The factor loadings
Factor Analysis
375
a , a , a and b , b , b
are independent of the member, but
x
y
z
x
y
z
depend on x, y , z . The factor point f and g depends on the member, but independent on x, y , z . The independent variables
ex , ey , ez
depend on both x, y , z and member.
The terms on right side of Eq. (60) are all unknown variables. We need some restrictions on them to decide the terms We assume that factor points f and g are normalized, that is E f 0
(61)
V f 1
(62)
Eg 0
(63)
V g 1
(64)
The average of the independent load is zero, that is, E e 0
(65)
where x, y , z . The covariance between f , g , e are zero, that is, Cov f , g 0
(66)
Cov f , e 0
(67)
Cov g , e 0
(68)
Cov e , e 0
for
We define the vectors below.
(69)
Kunihiro Suzuki
376 f1 f f 2 fn
(70)
g1 g g 2 gn
(71)
e1x e ex 2 x enx
(72)
e1 y e2 y ey eny
(73)
e1z e ez 2 z enz
(74)
x1 X y1 z 1
x2 y2 z2
ax bx A a y by a b z z
xn yn zn
(75)
(76)
Factor Analysis f F 1 g1
f2 g2
e1x e1 y e 1z
e2 x e2 y e2 z
fn gn
377
(77)
enx eny enz
(78)
Eq. (59) can then be expressed by
X AF
(79)
The final form is exactly the same as the one for one factor analysis. T Multiplying X to the Eq. (90) from the right side, and expanding it, we obtain
x1 T XX y1 z 1
x xn 1 x yn 2 zn xn
x2 y2 z2
nS xx nS xy nS xz
nS xy nS yy nS yz
1 n rxy r xz
rxy 1 ryz
y1 y2 yn
z1 z2 zn
nS xz nS yz nS zz rxz ryz 1
(80)
T
where X is the transverse matrix of X . Similarly, we multiply transverse matrix to Eq. (16) from the right side, we obtain
AF AF AF F T AT T T
AFF T AT AF T F T AT T A FF T AT A F T F T AT T T
We evaluate the terms in Eq. (81) as below.
(81)
Kunihiro Suzuki
378
f FF T 1 g1
f2 g2
n 2 fi i 1 n gi fi i 1 1 0 n 0 1
f F T 1 g1
f1 fn f2 gn fn n fi gi i 1 n 2 gi i 1
g1 g2 gn
(82) e1x f n e2 x gn enx
f2 g2
fi eix fi eiy i 1 i 1 n n gi eix gi eiy i 1 i 1 0 0 0 0 0 0 n
n
e1 y e2 y eny
e1z e2 z enz
i 1 n gi eiz i 1 n
fe
i iz
(83)
Therefore, we obtain 1 1 T XX rxy n r xz
rxy 1 ryz
rxz 1 ryz AAT T n 1
(84)
T We further expand the term AAT and 1 n as
ax bx ax a y az AAT a y by a b bx by bz z z ax ax bx bx ax a y bx by a y ax by bx a y a y by by a a b b a a b b z x z y z y z x
ax az bx bz a y az by bz az az bz bz
ax ax ax a y ax az bx bx bx by bx bz a y ax a y a y a y az by bx by by by bz a a a a a a b b b b b b z y z z z y z z z x z x ax bx a y ax a y az by bx by bz a b z z
(85)
Factor Analysis e1x 1 T 1 e1 y n n e1z
e2 x e2 y e2 z
e1x enx e2 x eny enz enx
e1 y e2 y eny
379
e1z e2 z enz
n n n 2 eix eiy eix eiz eix i 1 i 1 i 1 n n n 1 2 eiy eix eiy eiy eiz n i 1 i 1 i 1 n n n 2 eiz eix eiz eiy eiz i 1 i 1 i 1 n 1 2 0 0 n eix i 1 1 n 2 0 eiy 0 n i 1 1 n 2 0 0 eiz n i 1 Se 2 0 0 x 2 0 Sey 0 2 0 0 S ez
(86)
Finally, we obtain 1 rxy r xz
rxy 1 ryz
rxz ax ryz a y ax 1 az
ay
bx az by bx b z
by
Se 2 x bz 0 0
0 Sey 2
0
0 0 2 Sez
(87)
This is the basic equation for two factor analysis. Let us compare the case for principal component analysis, where the covariance matrix is expanded as follows. 1 rxy r xz
rxy 1 ryz
rxz ryz 1v1v1T 2 v2 v2T 3v3v3T 1
(88)
The right side terms are all determined clearly. We utilize this equation for principal component analysis as shown in the one factor analysis.
Kunihiro Suzuki
380
One subject is added for two factor analysis. We will finally obtain X , A, F , in the end, which satisfy
X AF
(89)
However, in the mathematical stand point of view, Eq. (89) is identical to
X ATT 1F
(90)
We assume that T is the rotation matrix, Eq. (90) can be modified as X AT T F
(91)
where A ' AT ' F T F
(92)
Therefore, if A snd F are the root of Eq. (89), A' , F ' are also the root of it. We will discuss this uncertainty later.
3.2. Factor Load for Two-Factor Variables We can expand the covariance matrix as below as shown in the one factor analysis given by 1 rxy r xz
rxy 1 ryz
rxz ryz 1v1v1T 2 v2 v2T 3v3v3T 1
1 v1
We regards the first term factor analysis, that is,
1 v1
T
1 v1
2 v2
1 v1
2 v2
T
T
2 v2
3 v3
2 v2
3 v3
T
(93)
T
as a factor load in the
Factor Analysis 1 rxy r xz
rxz ryz 1
rxy 1 ryz
1_1 v1_1
1_1 v1_1
T
381
2 _1 v2 _1
2 _1 v2 _1
T
Se 2 x 0 0
0 Sey 2
0
0 0 2 Sez
(94)
Therefore, we approximate the factor load as
ax bx a y 1_1 v1_1 , by 2 _1 v2 _1 a b z z
(95)
This is not the final form factor loadings, but is the first approximated one. The first approximated independent load can be evaluated as Se 2_1 x 0 0
0 2
Sey _1 0
0 1 0 rxy r 2 xz Sez _1
rxy 1 ryz
rxz ryz 1
1_1 v1_1
1_1 v1_1
T
2 _1 v2 _1
T 2 _1 v2 _1
(96)
The non-diagonal elements on the right side of Eq. (96) are not 0 in general. Therefore, Eq. (96)is invalid from the rigorous point of view. However, we ignore the non-diagonal elements and regard the diagonal element on both sides are the same. We then form the new covariance matrix given by 1 Se 2_1 rxy rxz x 2 R1 rxy 1 Sey _1 ryz 2 r r 1 S xz yz ez _1
(97)
We evaluate the eigenvalue and eigenvectors of the matrix, and evaluate below. Se 2_ 2 x 0 0 1 rxy r xz
0 2 Sey _ 2 0 2 0 Sez _ 2 rxy rxz 1 ryz 1_ 2 v1_ 2 ryz 1 0
1_ 2 v1_ 2
T
(98)
Kunihiro Suzuki
382
We evaluate the difference between the approximated independent factors given by Se 2_ 2 Se 2_1 x x 2 2 Sey _ 2 Sey _1 S 2 S 2 ez _1 ez _ 2
(99)
If the value below is less than a certain value
2 2 1 2 2 2 Se 2_ 2 Se 2_1 Se 2_ 2 Se 2_1 Sex _ 2 Sex _1 y y z z 3
(100)
We regard the component is the load factor given by
a 1_ 2 v1_ 2 , b 2 _ 2 v2 _ 2
(101)
If the value is more than a certain value, we construct the updated covariance matrix given by 1 Se 2_ 2 rxy rxz x 2 R2 rxy 1 Sey _ 2 ryz 2 r r 1 S xz yz e _ 2 z
(102)
We evaluate the eigenvalue and eigenvector of it and the updated one denote as
1_ 3 , v1_ 3 , 2 _ 3 , v2 _ 3 . We can then evaluate the updated independent matrix given by Se 2_ 3 x 0 0 1 rxy r xz
0 2 Sey _ 3 0 2 0 Sez _ 3 rxy rxz 1 ryz 1_ 3 v1_ 3 ryz 1 0
1_ 3 v1_ 3
T
2 _ 3 v2 _ 3
T 2 _ 3 v2 _ 3
(103)
Factor Analysis
383
We further evaluate the updated difference of the independent element matrix below. Se 2_ 3 Se 2_ 2 x x 2 2 Sey _ 3 Sey _ 2 S 2 S 2 ez _ 2 ez _ 3
(104)
and evaluate below.
2 2 2 1 2 2 2 2 2 2 Sex _ 3 Sex _ 2 Sey _ 3 Sey _ 2 Sez _ 3 Sez _ 2 3
(105)
We repeat this process until it is less than the certain value. When we obtain this value after the m times process, we obtain the load factor as
a 1_ m v1_ m , b 2 _ m v2 _ m
(106)
The corresponding diagonal elements are given by d x2 1 Sex _ m , d y2 1 Sey _ m , d z2 1 Sez _ m 2
2
2
(107)
which we utilize later.
3.3. Uncertainty of Load Factor and Rotation There is uncertainty in load factors as we mentioned before. We solve the uncertainty in this section. We assume that we have factor loads a x , a y , a z and bx , by , bz using the showed procedure, and obtain factor points fi , gi i 1, 2,
, n . The procedure to obtain the
factor points will be shown later. We then obtain
xi ax yi a y z a i z
bx eix fi by eiy g bz i eiz
(108)
Kunihiro Suzuki
384
We rotate f , g with an angle of
, and the factor point
fˆ , gˆ , and both are related to i
f i , gi
is mapped to
i
f i cos gi sin
sin f i cos gi
(109)
f i cos gi sin
sin f i cos gi
(110)
Therefore, we obtain ax bx ax bx fi cos sin fi a y by g a y by sin cos g i a b i a b z z z z ax cos bx sin ax sin bx cos f a y cos by sin a y sin by cos i a cos b sin a sin b cos gi z z z z ax bx f a y by i g i az bz
(111)
where
ax ax cos bx sin a y a y cos by sin az az cos bz sin
(112)
bx ax sin bx cos by a y sin by cos bz az sin bz cos
(113)
The above operation is the product of load factor matrix denoted as T and we obtain
A
and rotation matrix
Factor Analysis
385
AT A
(114)
We can select any rotation as the stand point of root. How, we can select the angle? We select the angle so that factor points have strong relationship to the factor load. We introduce a Valimax method, where the elements in A has the maximum and minimum values. We introduce a variable 2
2 1 1 1 2 2 V a 2 a b 3 x, y , z 3 x , y , z 3 x, y , z
2
2 1 b 32 x, y , z
2
2
(115)
and evaluate the angle that givens the maximum V , which is called as a crude Valimax method. We can evaluate the angle that maximize V numerically. However, we can obtain its analytical one as shown below. We consider the first and third terms on the right side of Eq. (115), and modify them below. 2
2 b a x, y, z x, y, z
2
2
2
2 a sin b cos 2 a cos b sin x, y,z x, y,z
a
4
cos 4 4a 3b cos3 sin 6a 2b 2 cos 2 sin 2 4a b 3 cos sin 3 b 4 sin 4
4
sin 4 4a 3b sin 3 cos 6a 2b 2 sin 2 cos 2 4a b 3 sin cos 3 b 4 cos 4
4
b 4 cos 4
x, y,z
a
x, y,z
a
2
x, y,z
a b a b cos sin
4
3
3
3
x, y,z
12
x, y,z
a 2b 2 cos 2 sin 2
a b a b cos sin
4
3
3
3
x, y,z
a
x, y,z
4
b 4 sin 4
(116)
Kunihiro Suzuki
386
We then consider the second term of the right side of Eq. (115), and modify it as below.
2 a x , y , z
2
2 a cos b sin x , y , z
2
a 2 cos 2 2a b cos sin b 2 sin 2 x , y , z
a
2
cos 2 2a b cos sin b 2 sin 2
2
cos 2 2a b cos sin b 2 sin 2
x, y, z
a x, y, z
2
a 2 a 2 cos 4 x, y, z x, y, z 2 a b a 2 cos3 sin x, y, z x, y, z b 2 a 2 cos 2 sin 2 x, y, z x, y, z 2 a 2 a b cos3 sin x, y, z x, y, z 4 a b a b cos 2 sin 2 x, y, z x, y, z 2 b 2 a b cos sin 3 x, y, z x, y, z a 2 b 2 cos 2 sin 2 x, y, z x, y, z 2 a b b 2 cos sin 3 x, y, z x, y, z b 2 b 2 sin 4 x, y, z x, y, z Rearranging them with respect to the same term associated with
(117)
, we obtain
Factor Analysis 2 a x , y , z
387
2
2
a 2 cos 4 x, y , z 4 a 2 a b cos3 sin x , y , z x , y , z 2 2 a 2 b 2 4 a b cos 2 sin 2 x , y , z x , y , z x, y,z 4 b 2 a b cos sin 3 x, y, z x, y , z 2
b 2 sin 4 x, y, z
b x , y , z
2
(118)
2
2 a sin b cos x , y , z
2
a 2 sin 2 2a b sin cos b 2 cos 2 x , y , z
a
2
sin 2 2a b sin cos b 2 cos 2
a
2
sin 2 2a b sin cos b 2 cos 2
x, y, z
x, y, z
a 2 a 2 sin 4 x, y, z x, y, z 2 a b a 2 sin 3 cos x, y, z x, y, z b 2 a 2 sin 2 cos 2 x, y,z x, y, z 2 a 2 a b sin 3 cos x, y, z x, y, z 4 a b a b sin 2 cos 2 x, y, z x, y, z 2 b 2 a b sin cos 3 x, y,z x, y, z a 2 b 2 sin 2 cos 2 x, y, z x, y, z 2 a b b 2 sin cos 3 x, y, z x, y, z b 2 b 2 cos 4 x, y,z x, y, z
2
(119)
Kunihiro Suzuki
388
We then consider the fourth term of the right side of Eq. (115), and modify it as below. Rearranging them with respect to the same term associated with 2 b x , y , z
, we obtain
2
2
a 2 sin 4 x, y ,z 4 a b a 2 sin 3 cos x, y,z x, y ,z 2 a 2 b 2 4 a b a b sin 2 cos 2 x, y ,z x, y ,z x, y ,z x, y ,z 4 b 2 a b sin cos3 x, y ,z x, y ,z 2
b 2 cos 4 x, y ,z
(120)
Summarizing the second and fourth terms of the right side of Eq. (115), we obtain 2
2
2 2 a b x , y , z x , y , z 2 2 a 2 b 2 cos 4 x , y , z x , y , z 4 a 2 b 2 a b cos3 sin x, y, z x , y , z x , y , z 2 2 2 2 4 a b 2 a b cos sin 2 x , y , z x , y , z x , y , z 4 a 2 b 2 a b cos sin 3 x, y, z x , y , z x, y, z 2 2 a 2 b 2 sin 4 x , y , z x , y , z
(121)
Factor Analysis
389
Summarizing all, we obtain
3V 2
1 a 2 b 3 x, y , z x, y, z
2
2 2 1 a b 3 x , y , z x, y, z 2 2 1 4 4 2 2 a b a b cos 4 3 x , y , z x , y , z x , y , z 1 4 a 3b a b 3 a 2 b 2 a b cos3 sin 3 x , y , z x , y , z x , y , z 2 2 1 2 2 2 2 4 3 a b a b 2 a b cos sin 2 3 x , y , z x , y , z x , y , z x , y , z 1 4 a 3b a b 3 a 2 b 2 a b cos sin 3 3 x , y , z x, y , z x , y , z 2 2 1 4 4 2 2 a b a b sin 4 3 x , y , z x , y , z x , y , z K1 cos 4 K 2 cos3 sin K 3 cos 2 sin 2 K 2 cos sin 3 K1 sin 4 (122) 2
2
2
where
K1
a x, y, z
1 b a 2 b 2 3 x , y , z x, y, z 2
4
4
2
(123)
1 K 2 4 a 3b a b 3 a 2 b 2 a b 3 x , y , z x , y , z x , y , z
(124)
2 1 2 2 2 2 K3 4 3 a b a b 2 a b 3 x , y , z x , y , z x , y , z x , y , z
(125)
We obtain the condition where V has the local maximum as below.
Kunihiro Suzuki
390
3V 4 K1 cos3 sin
K 2 3cos 2 sin 2 cos 4 K 3 2 cos sin 3 2 cos3 sin K 2 sin 4 3cos 2 sin 2 4 K1 sin 3 cos
K 2 cos 4 4 K1 2 K 3 cos3 sin 6 K 2 cos 2 sin 2 4 K1 2 K 3 cos sin 3 K 2 sin 4 K 2 cos 4 4 K 4 cos3 sin 6 K 2 cos 2 sin 2 4 K 4 cos sin 3 K 2 sin 4 0 (126) where
1 K 4 K1 K3 2
(127)
Dividing Eq. (126) by cos 4 , we obtain
1 4
K4 K tan 6 tan 2 4 4 tan 3 tan 4 0 K2 K2
(128)
Arranging it, we obtain
4 tan tan 3 K2 tan 4 K 4 1 6 tan 2 tan 4
(129)
where we utilize a lemma given by
tan 4
4 tan tan 3 1 6 tan 2 tan 4
(130)
Factor Analysis
391
We can then obtain
K 4 tan 1 2 K4
(131)
We impose that the angle limitation is
2
4
2
(132)
We assume that 0 is the one root of it. 0 can become the other root as shown in Figure 2.
Figure 2. The two angles that provide the same value of tangent.
Therefore, we obtain two roots for that. One corresponds to the local maximum and the other local minimum. We then select the local maximum. Further differentiate Eq. (126) with respect to
, we obtain
Kunihiro Suzuki
392
2 3V 2 K 2 4 cos3 sin 4 K 4 3cos 2 sin 2 cos 4 6 K 2 2 cos sin 3 2 cos3 sin 4 K 4 sin 4 3cos 2 sin 2 K 2 4 cos sin 3 K 2 4 cos3 sin 12 cos sin 3 12 cos3 sin 4 cos sin 3 4 K 4 3cos 2 sin 2 cos 4 sin 4 3cos 2 sin 2 16 K 2 cos sin 3 cos3 sin 4 K 4 cos 4 sin 4 6 cos 2 sin 2 16 K 2 cos sin sin 2 cos 2 2 2 4 K 4 cos 2 sin 2 2 cos sin 4 K 4 cos 4 K 2 sin 4 0
(133)
Therefore, the condition for the local maximum is given by
K 4 cos 4 K 2 sin 4 0
(134)
We have the relationship of
K2 sin 4 tan 4 K4 cos 4
(135)
We then obtain
cos 4
K4 sin 4 K2
Substituting Eq. (136) into Eq. (134), we obtain
(136)
Factor Analysis
K22 K42 sin 4 0 K2
Multiplying
K2 2 / K2 2 K4 2 0
393
(137)
to Eq. (137), we obtain
K 2 sin 4 0
(138)
This is the condition for the local maximum. If the evaluated 4 do not hold Eq. (138), we change the angle to
4 4
(139)
We select the sign of so that the angle satisfy
4
4
(140)
We consider the various cases and establish the evaluation process. When
8
K 2 sin 4 0
is held, this ensures the below.
8
(141)
We adopt the angle as it is. When
K 2 sin 4 0
is not held, we then consider the second condition.
When sin 4 0 is held, 4 exist in the first quadrant. When we adopt the negative sign, we obtain
4
2
Therefore, corresponding angle region is
(142)
Kunihiro Suzuki
394
4
8
(143)
This is in the acceptable region. When we adopt positive singe, we obtain
4
3 2
(144)
Therefore, the corresponding angle region is
3 4 8
(145)
This is the outside of the acceptable region. We then consider the case of sin 4 0 . This means that 4 exist in the fourth quadrant region. When we adopt negative sign, we obtain
3 4 2
(146)
Therefore, corresponding angle region is
3 8 4
(147)
This is the outside of the acceptable region. When we adopt positive sign, we obtain
4 2
(148)
Therefore, corresponding angle region is
8
4
(149)
Factor Analysis
395
This is in the acceptable region. Table 1 summarizes the above procedure. Table 1. Conditions for determine angle Condition 1
Condition 2
4
K 2 sin 4 0
None
None
K 2 sin 4 0
conversion
Range of
sin 4 0
4 4
sin 4 0
4 4
8
8
4
8
8
4
A crude Valimax method, which we explain up to here, is inadequate when the factor 2 d2 1 Se2 is large. When d is large, the absolute value of factor loads are also apt to
become large. Consequently, the rotation angle is dominantly determined by the factor load with large instead of whole data. Since we want to evaluate the angle related to the whole data, the result with the crude Valimax method may be unexpected one. Therefore, we
V by changing the elements to a d , b d . 2
2
modify
Therefore, we evaluate 2
2
2
2
2 a 2 a 1 1 S 2 3 x , y , z d 3 x , y , z d
2 b 2 b 1 1 2 3 x , y , z d 3 x , y , z d
(150)
and decide the which gives the maximum S . This is called as a criteria Valimax method. The corresponding factors are given by
tan 4
where
G2 G4
(151)
Kunihiro Suzuki
396
2 2 2 2 a 4 b 4 1 a b G1 x , y , z d d 3 x , y , z d x , y , z d
(152)
a 3 b a b 3 1 a 2 b 2 a b G2 4 x, y , z d d d d 3 x, y , z d d x, y , z d d (153) 2 2 2 2 2 a b 1 a b a b G3 4 3 2 d d 3 d d d d x , y , z x , y , z x, y , z x , y , z
(154)
1 G4 G1 G3 2
(155)
G2 sin 4 0
(156)
3.4. Factor Point and Independent Load with Two-Factors We review the process up to here. The basic equation for two-variable factor analysis is given by
xi ax bx exi yi fi a y gi by eyi a b e z z z zi i
(157)
We decide the factor load, and variances of ex , ey , ez . We further need to evaluate fi , gi and ex , ey , ez themselves. We assume that x, y , z are commonly expressed with f and g . This means that the predictive value of
fˆi ˆ and can be expressed with the linear sum of gi
fˆi hfx xi hfy yi hfz zi
x, y , z
given by
(158)
Factor Analysis
gˆ i hgx xi hgy yi hgz zi
397
(159)
We focus on fi first. This is the same analysis for one-factor analysis and we obtain h fx rxx h fy rxy h r fz zx
rxy ryy ryz
rzx ryz rzz
1
ax ay a z
(160)
Using the h fx , h fy , h fz , the factor point is given by
fˆi hfx xi hfy yi hfz zi
(161)
We can perform the same analysis for g , and obtain hgx rxx hgy rxy h r gz zx
rxy ryy ryz
rzx ryz rzz
1
bx by b z
(162)
Using the hgx , hgy , hgz , the factor point is given by
gˆ i hgx xi hgy yi hgz zi
(163)
Summarizing above, we can express as h fx h fy h fz
hgx rxx hgy rxy hgz rzx
rxy ryy ryz
rzx ryz rzz
1
ax ay a z
bx by bz
(164)
Independent factor is expressed by exi xi ax bx eyi yi fi a y gi by b e z a z zi i z
(165)
Kunihiro Suzuki
398
This can be further expressed for any member i as
ex1 ex 2 exn
ey1 ey 2 eyn
ez1 x1 ez 2 x2 ezn xn
y1 y2 yn
z1 f1 z2 f 2 zn f n
g1 g 2 ax bx gn
ay by
az bz (166)
4. GENERAL FACTOR NUMBER We expand the analysis to any number of factors in this section. We treat
q
p
variables,
factors, and n data here. The restriction is
q p
(167)
The analysis is basically the same as the ones for two-factor analysis. We repeat the process again here without skipping to understand the derivation process clearly.
4.1. Description of the Variables We treat various kinds of variables, and hence clarify the definition in the start point. The kind number of variables is p , and each variable has data number of n . The probability variables are denoted as
X1 , X 2 , , X p
(168)
We express all of them symbolically as X for 1,2,
,p
Each data is expressed by
(169)
Factor Analysis
399
X 1 x11 , x21 , , xi1 , , xn1 X 2 x12 , x22 , , xi 2 , , xn 2 X p x1 p , x2 p , , xip , , xnp
(170)
We express all of them symbolically as
X i
(171)
We sometimes express 2 variable symbolically as
X , X
(172)
There are q kinds of factor point denoted by
f1 , f 2 , , f q
(173)
We express this symbolically as
fk for k 1, 2,
,q
(174)
When we express that for each data, we then express the factor points as
f1 : f11 , f 21 ,
, f n1
f 2 : f12 , f 22 ,
, fn2
f q : f1q , f 2 q ,
, f nq
(175)
We express this symbolically as
fik When we express two factor point symbolically as
(176)
Kunihiro Suzuki
400
f k , fl
(177)
We have q kinds of factor loads expressed by
a1 , a2 , , aq
(178)
Each factor load has p kinds of element and is expressed by a11 a21 a1 a p1 a12 a22 a2 ap2 a1q a2 q aq a pq
(179)
We express the factor load symbolically as
ak
(180)
The corresponding elements are also expressed symbolically as
ak : a k
(181)
Independent factor has p kinds and each has n data, which is denoted as
e1 , e2 , , e p It can be expressed symbolically as
(182)
Factor Analysis e for 1,2,
,p
401 (183)
We can express each independent factor related to data as e1 : e11 , e21 ,
, en1
e2 : e12 , e22 ,
, en 2
e p : e1 p , e2 p ,
, enp
(184)
It can be expressed symbolically as ei
(185)
When we express two independent factors symbolically as e , e
(186)
We can obtain eigenvalues of p kinds given by
1 , 2 , , p
(187)
We utilize the factor number of them given by
1 , 2 , , q
(188)
4.2. Basic Equation We assume normalized p variable of x1 , x2 , , x p and q kinds of factor of
f1 , f 2 , , f q . The factor load associated with the factor point f1 given by
a
11
by
a21
a p1 . In general, the factor load associated with the factor point f k is given
Kunihiro Suzuki
402 a1k a2 k ak a pk
(189)
Therefore, the data are expressed by
x1 a11 a12 x2 f a21 f a22 1 2 xp a p1 ap2
a1q e1 a2 q e2 fq a pq e p
(190)
Focusing on data i , which correspond to the all data for member i , we express it as xi1 a11 a12 xi 2 f a21 f a22 i1 i2 xip a p1 ap2
a1q ei1 a2 q ei 2 fiq a pq eip
for i 1, 2,
,n
(191)
Note that a k is independent of each member i , factor point is independent of x , but depend on i . ei depends both on i and x . To solve Eq. (191), we impose some restrictions below. We assume that the factor point f k is normalized, that is, E fk 0
(192)
V fk 1
(193)
The average of independent factor is zero, that is,
E ex 0 where 1, 2, , p . The covariance between f k and e is zero, that is,
(194)
Factor Analysis
403
Cov f k , fl kl
(195)
Cov f k , e 0
(196)
Cov e , e
(197)
We define the vectors below. f11 f f1 21 f n1
(198)
f12 f f 2 22 fn2
(199)
・・・ f1q f2q fq f nq
(200)
e11 e e1 21 en1
(201)
e12 e e2 22 en 2
(202)
Kunihiro Suzuki
404 e1 p e2 p ep enp
x11 x12 X x1 p a11 a21 A a p1
f11 f12 F f1q
e11 e12 e1 p
(203)
x21 x22 x2 p a12 a22 ap2
f 21 f 22 f2q
e21 e22 e2 p
xn1 xn 2 xnp a1q a2 q a pq
(204)
(205)
f n1 fn2 f nq
(206)
en1 en 2 enp
(207)
Then the data are expressed with a matrix form as
X AF T Multiplying X to the Eq. (90) from the right side, and expanding it, we obtain
(208)
Factor Analysis
x11 x21 x12 x22 T XX x1 p x2 p nS11 2 nS12 2 2 2 nS21 nS22 nS 2 nS 2 p2 p1 r11 r21 n rp1
xn1 x11 xn 2 x21 xnp xn1 2 nS1 p 2 nS2 p 2 nS pp
x12 x22 xn 2
r1 p r2 p rpp
r12 r22 rp 2
405
x1 p x2 p xnp
(209)
T
where X is the transverse matrix of X . Since the variables are normalized, the covariance is equal to the correlation factor and hence we obtain
r S 2
(210)
r 1 for 1,2, , p
(211)
Similarly, we multiply the transverse matrix to Eq. (16) from the right side, we obtain
AF AF AF F T AT T T
AFF T AT AF T F T AT T A FF T AT A F T F T AT T T
We evaluate the terms in Eq. (81) as below.
(212)
Kunihiro Suzuki
406 f11 f 21 f12 f 22 FF T f1q f 2 q n 2 fi1 i 1 n f f i 1 i 2 i1 n fiq fi1 i 1 1 0 0 1 n 0 0
f11 f 21 f12 f 22 T F f1q f 2 q n fi1ei1 i 1 n f e i 1 i 2 i1 n fiq ei1 i 1 0 0 0 0 0 0
f n1 f11 f n 2 f 21 f nq f n1 n
f i 1
n
i 1
2 i2
n
i 1
fn2 f i 1 n fi 2 fiq i 1 n 2 fiq i 1 n
f
i1 i 2
f f
f1q f2q f nq
f12 f 22
f
iq i 2
f
i1 iq
0 0 1
(213) f n1 e11 e12 f n 2 e21 e22 f nq en1 en 2
n
f i 1
i 1
e
i2 i2
n
f i 1
e i 1 n fi 2 eiq i 1 n f iq eiq i1 n
e
i1 i 2
n
f
e1q e2 q enq
e
iq i 2
f
i1 iq
0 0 0
(214)
Therefore, we obtain 1 0 0 1 1 XX T n 0 0
0 0 1 AAT T n 1
(215)
Factor Analysis
407
T We further expand the term AAT and 1 n as
a1q a11 a21 a p1 a11 a12 a a a a a a 21 22 2 q 12 22 p2 AAT a pq a1q a2 q a pq a p1 a p 2 a112 a12 2 a1q 2 a11a21 a12 a22 a1q a2 q a a a a a a a212 a22 2 a2 q 2 22 12 2 q 1q 21 11 a a a a a a a p1a21 a p 2 a22 a pq a2 q p 2 12 pq 1q p1 11 2 a11 a11a21 a11a p1 2 a a a a 21 21a p1 21 11 2 a a a p1 p1 11 a p1a21 a12 2 a12 a22 a12 a p 2 2 a a a a 22 22 a p 2 22 12 2 a a ap2 p 2 12 a p 2 a22 a1q 2 a a 2 q 1q a a pq 1q a11
a12
a11a p1 a12 a p 2 a21a p1 a22 a p 2 a p12 a p 2 2
a1q a pq a2 q a pq 2 a pq
a1q a pq a2 q a pq 2 a pq a2 q a pq a11 a21 a21 a p1 a p1 a12 a22 a22 ap2 ap2 a1q a2 q aa2 q 2
a1q
a1k
a2 q
q
k 1
a2 k
a1q a2 q a pq a pq a1k a2 k a pk a pk
(216)
Kunihiro Suzuki
408 e11 1 T 1 e12 n n e1 p
e21 e22 e2 p
1 n 2 n ei1 i 1 1 n e e n i 1 i 2 i1 n 1 e e n ip i1 i 1 Se 2 1 0 0ix 1
0 Se2 2
0
en1 e11 e12 e1 p en 2 e21 e22 e2 p enp enp en1 en 2 1 n 1 n ei1ei 2 ei1eip n i 1 n i 1 1 n 2 1 n ei 2 ei 2 eip n i 1 n i 1 n n 1 1 2 eip ei 2 eip n i 1 n i 1 0 0 2 Sep
(217)
Finally, we obtain 1 r21 rp1
r12 1 rp 2
r1 p r2 p q a1k k 1 1
a2 k
2 a1k Se1 a2 k 0 a pk a pk 0ix1
0 Se2 2
0
0 0 2 Se p
(218)
This is the basic equation for the factor analysis.
4.3. Factor Load We can expand the covariance matrix as below as shown in the one factor analysis given by 1 r12 r21 1 rp1 rp 2
r1 p r2 p v v T 2 v2 v2T 111 1
p v p v pT
(219)
Factor Analysis We regards the first term
1 v1
1 v1
T
409
2 v2
2 v2
T
q vq
q vq
T
as
factor load in the factor analysis, that is, 1 r21 rp1
r1 p r2 p 1
r12 1 rp 2
1 v1
Se 2 1 0 0
1 v1
T
2 v2
2 v2
T
q vq
q vq
0 0 2 Se p
0 Se2 2
0
T
(220)
This is not the final form factor loadings, but is the first approximated one. The first approximated independent load can be evaluated as
a1 1 v1 a2 2 v2 aq q vq
(221)
We regard these as first approximated ones, that is, we denote them as
1_1 , v1_1 , 2 _1 , v2 _1 , , q _1 , vq _1 Se 2_1 1 0 0 1 r21 rp1
0 Se2 _1 2
0 r12 1 rp 2
1_1 v1_1
(222)
0 0 2 Se p _1 r1 p r2 p 1
1_1 v1_1
T
2 _1 v2 _1
2 _1 v2 _1
T
q _1 vq _1
q _1 vq _1
T
(223)
Kunihiro Suzuki
410
The non-diagonal elements on the right side of Eq. (96) are not 0 in general. Therefore, Eq. (96) is invalid from the rigorous point of view. However, we ignore the non-diagonal elements and regard the diagonal element on both sides are the same. We then form the new covariance matrix given by 1 Se 2_1 r12 1 2 r 1 Se2 _1 R1 21 rp1 rp 2
r2 p 2 1 Se p _1 r1 p
(224)
We evaluate the eigenvalue and eigenvectors of the matrix, and evaluate below.
1_ 2 , v1_ 2 , 2 _ 2 , v2 _ 2 , , q _ 2 , vq _ 2
(225)
The diagonal elements can then be evaluated as Se 2_ 2 0 1 2 0 Se2 _ 2 0 0 1 r12 r21 1 rp1 rp 2
1_ 2 v1_ 2
0 0 2 Se p _ 2 r1 p r2 p 1
1_ 2 v1_ 2
T
2 _ 2 v2 _ 2
2 _ 2 v2 _ 2
T
q _ 2 vq _ 2
q _ 2 vq _ 2
T
(226) We evaluate the
below.
1 2 2 2 2 2 2 S S S S e _ 2 e _1 e _ 2 e 1 1 2 2 _1 q
2 2 Seq _ 2 Seq _1
2
If it is smaller than the setting value of c , we regard the factor load as
(227)
Factor Analysis
411
a1 1_ 2 v1_ 2 a2 2 _ 2 v2 _ 2 aq q _ 2 vq _ 2
(228)
If c , we construct updated covariance matrix given by 1 Se 2_ 2 r12 1 2 r21 1 Se2 _ 2 R2 rp1 rp 2
r2 p 2 1 Se p _ 2 r1 p
(229)
and evaluate the eigenvalues and eigenvectors given by
1_ 3 , v1_ 3 , 2 _ 3 , v2 _ 3 , , q _ 3 , vq _ 3
(230)
We evaluate the diagonal elements as Se 2_ 3 0 1 2 0 Se2 _ 3 0 0 1 r12 r21 1 rp1 rp 2
1_ 3 v1_ 3
0 0 2 Sep _ 3 r1 p r2 p 1
1_ 3 v1_ 3
T
2 _ 3 v2 _ 3
2 _ 3 v2 _ 3
T
q _ 3 vq _ 3
q _ 3 vq _ 3
T
(231) We then evaluate
given by
1 2 2 2 2 2 2 S S S S e _ 3 e _ 2 e _ 3 e 1 2 _2 2 q 1
2 2 Seq _ 3 Seq _ 2
2
(232)
Kunihiro Suzuki
412 If
is smaller than c , we decide the factor load as
a1 1_ 3 v1_ 3 a2 2 _ 3 v2 _ 3 aq q _ 3 vq _ 3
If
(233)
is larger than c we repeat this process until
2 2 1 2 2 Se 2_ m Se 2_ m1 S S e _ m e _ m 1 1 2 2 q 1
2 2 Seq _ m Seq _ m1
2
(234)
c is held. We finally obtain the factor load given by a11 a21 a1 1_ m v1_ m a p1 a12 a22 a2 2 _ m v2 _ m ap2
aq q _ m vq _ m
a1q a2 q a pq
(235)
On the other hand, we obtain the final diagonal elements as d12 1 Se1 _ m 2
d 22 1 Se2 _ m 2
d q2 1 Seq _ m 2
(236)
Factor Analysis
413
4.4. Uncertainty of Load Factor and Rotation We will finally obtain X , A, F , in the end. However, in the mathematical stand point of view, it is expressed by X AF ATT 1 F
(237)
We assume that T is the rotation matrix, Eq. (90) can be modified as X AT T F
(238)
where A ' AT ' F T F
(239)
Therefore, if A and F are the root of Eq. (89), A and F are also the roots of it. We will discuss this uncertainty later. In the case of two-factor analysis, we can consider one angle. However, in the q-factor analysis, we have q C2 kinds of planes to be considered. We consider two factors among them. The corresponding factor loads are a1k a1l a a2l 2k ak , al a pk a pl
for k l
(240)
The conversion is expressed by the matrix as a1k a1l a2 k a2l cos sin sin cos a pk a pl a1k cos a1l sin a1k sin a1l cos a2 k cos a2l sin a2 k sin a2l cos a pk cos a pl sin ak sin al cos
(241)
Kunihiro Suzuki
414
Therefore, the converted factor load is given by a1k cos a1l sin a1k a2 k cos a2l sin a2 k ak a pk cos a pl sin a pk
(242)
a1k sin a1l cos a1l a sin a2l cos a2l al 2 k ak sin al cos a pl
(243)
We decide the angle
so that it maximizes the S below.
2
2
2
2
2 2 1 p a k 1 p a k S p 1 d p 2 1 d 2 2 1 a l 1 p a l p 1 d p 2 1 d p
(244)
The is solved as
tan 4
G2 G4
(245)
The critical condition is given by
G2 sin 4 0
(246)
Where each parameter is given by 2 2 2 2 a 4 a 4 1 p a k p a l k l G1 d p d d 1 d 1 1 p
(247)
Factor Analysis a 3 a a a 3 1 p a 2 a 2 p a a G2 4 k l k l k l k l x, y , z d d d d p 1 d d 1 d d
415
(248) 2 2 2 p a 2 a 2 p a k a l 1 p a k p a l k l G3 4 3 2 p 1 d 1 d 1 d d 1 d d (249)
1 G4 G1 G3 2
(250)
G2 sin 4 0
(251)
We select two factors. The factor k , l is expressed by the matrix below... 1 T 0
1 (k ,k )
( k ,l )
cos
sin 1 1
(l ,k )
sin
( l ,l )
cos 1
0 1
(252)
We then form AT
(253)
We perform whole combination of p C2 . When we perform the Valimax rotation, the two factor loads changes in general. Next, we select the same one and different one in the next selection, both factor load changes. Therefore, we need to many cycles so that the change becomes small.
Kunihiro Suzuki
416
4.5. Factor Point and Independent Load with General Number-Factors We review the process up to here. The basic equation for general number p of variables factor analysis is given by xi1 a11 a12 xi 2 f a21 f a22 i1 i2 xip a p1 ap2
a1q ei1 a2 q ei 2 fiq a pq eip
We decide factor load, and variance of e1 , e2 , and ei1 , ei 2 ,
for i 1, 2,
,n
(254)
, e p . We further need to evaluate fi , gi
, eip themselves.
We assume that x1 , x2 , , x p are commonly expressed with f1k , f 2 k ,
, f pk This means
that the predictive value of fˆik can be expressed with the linear sum of x1 , x2 , , x p given by
fˆik h1k xi1 h2k xi 2
hpk xip
(255)
Assuming fik as real data, the square of the deviation is given by n
Qk fik fˆik i 1
2
f ik h1k xi1 hk 2 xi 2 n
i 1
hpk xip
2
(256)
We decide h1k , h2k , , hpk so that it minimize Eq. (256). Partial differentiating Eq. (256) with respect to h1k , and set it as 0, we obtain Qk h1k h1k
n
f i 1
ik
h1k xi1 hk 2 xi 2
2 fik h1k xi1 hk 2 xi 2 n
i 1
hpk xip
2
hpk xip xi1 0
(257)
Factor Analysis
417
This leads to
h n
1k xi1 xi1 hk 2 xi 2 xi1
i 1
hpk xip xi1 fik xi1 n
i 1
(258)
The left side of Eq. (258) is modified as
h n
x x hk 2 xi 2 xi1
1k i1 i1
i 1
hpk xip xi1 h1k xi1 xi1 h2 k xi 2 xi1 n
n
i 1
i 1
n h1k r11 h2k r21
hpk rp1
n
hpk xip xi1 i 1
(259)
The right side of Eq. (258) is modified as n
f i 1
x
ik i1
fik a11 fi1 a12 fi 2 n
i 1
n
n
i 1
i 1
a1q f iq
a11 fik fi1 a12 f ik f i 2
n
a1q f ik f iq i 1
n
a1k fik fik i 1
na1k
(260)
Therefore, Eq. (258) is reduced to
h1k r11 h2k r21
hpk rp1 a1k
(261)
Performing the similar analysis with respect to h k , we obtain Q f h k
h k
n
f i 1
ik
h1k xi1 h2 k xi 2
2 fik h1k xi1 h2 k xi 2 n
i 1
We then obtain
hpk xip
2
hpk xip xi 0
(262)
Kunihiro Suzuki
418
h n
x x h2 k xi 2 xi
1k i1 i
i 1
hpk xip xi fik xi n
i 1
(263)
The left side of Eq. (263) is modified as
h n
1k xi1 xi h2 k xi 2 xi
i 1
hpk xip xi h1k xi1 xi h2k xi 2 xi n
n
i 1
i 1
n h1k r1 h2k r2
hpk rp
n
hpk xip xi i 1
(264) The right side of Eq. (263) is modified as n
f i 1
x
ik i
fik a 1 fi a 2 fi 2 n
i 1
n
n
i 1
i 1
a q fiq
a11 fik fi1 a12 f ik f i 2
n
a1q f ik f iq i 1
n
a1k fik fik i 1
na1k
(265)
Therefore, Eq. (263) is reduced to
h1k r1 h2k r2
hpk rp a k
(266)
Summarizing above, we obtain 1 r12 r21 1 rp1 rp 2
r1 p h1k a1k r2 p h2 k a2 k 1 h2 k a2 k
We then obtain the factor point factors as
(267)
Factor Analysis h1k 1 h2 k r21 hpk rp1
r12 1 rp 2
r1 p r2 p 1
1
419
a1k a2 k a2 k
(268)
We can then obtain factor points as
fˆik h1k xi1 h2k xi 2
hpk xip
(269)
The independent factor is given by ei1 xi1 a11 a12 ei 2 xi 2 f a21 f a22 i1 i 2 eip xip a p1 ap2
a1q a2 q fiq a pq
for i 1, 2,
,n
(270)
The above results are totally expressed by e11 e12 e21 e22 en1 en 2
e1 p x11 e2 p x21 enp xn1
x12 x22 xn 2
x1 p f11 x2 p f 21 xnp f n1
f12 f 22 fn2
f1q a11 f 2 q a12 f nq a1q
a21 a22 a2 q
a p1 ap2 a pq
(271)
5. SUMMARY To summarize the results in this chapter— We treat various kinds of variables in this analysis, and hence summarize the treatments systematically in this section.
5.1. Notation of Variables Number of kinds of variables is p , data number is n , and factor number is q . , . Symbolic expressions for the variable are
Kunihiro Suzuki
420
Symbolic expressions for the data are
i, j
Symbolic expressions for the factors are k , l
i, j , , , k , l The order of the subscript is given by Based on the above, each expression is as follows. p kinds variable are expressed as
x1 , x2 , , x p This can be expressed with symbolically as x for 1,2, , p
The detailed expression that denotes the each data value is given by x1 : x11 , x21 ,
, xn1
x2 : x12 , x22 ,
, xn 2
x p : x1 p , x2 p ,
, xnp
The corresponding symbolic expression is given by xi
Two kinds of variables can be expressed symbolically as x , x
The normalize variables are expressed by
u1 , u2 , , u p This can be expressed symbolically as u for 1,2,
,p
Factor Analysis The detailed expression that denotes the each normalized data value is given by u1 : u11 , u21 , , un1 u2 : u12 , u22 , , un 2 u p : u1 p , u2 p , , unp
This can be expressed symbolically as ui
Two kinds of normalized variables are symbolically expressed as u , u
We have q kinds of factor points and they are denoted as
f1 , f 2 , , f q We express them symbolically as fk for k 1,2,
,q
The detailed expression to denote the each data value is given by f1 : f11 , f 21 ,
, f n1
f 2 : f12 , f 22 ,
, fn2
f q : f1q , f 2 q ,
, f nq
This can be expressed symbolically as fik
Two kinds of factor points are symbolically expressed as f k , fl
421
Kunihiro Suzuki
422
There are q kinds of factor loadings as
a1 , a2 , , aq Each factor loading has p elements and are expressed as a11 a21 a1 a p1 a12 a22 a2 ap2 a1q a2 q aq a pq
The factor loading is symbolically expressed as ak
The elements are expressed as ak : a k
Independent load has the same structure as the data x . It has p kinds of variables and each variable has n data. We express them as
e1 , e2 , , e p This can be symbolically expressed as e for 1,2,
,p
Factor Analysis
423
The detailed expression to denote the each data value is given by e1 : e11 , e21 ,
, en1
e2 : e12 , e22 ,
, en 2
e p : e1 p , e2 p ,
, enp
This can be expressed symbolically as
ei Two kinds of independent loadings are symbolically expressed as
e , e We have k kinds of eigenvalues. However, we utilize up to the q number of eigenvalue in the factor analysis given by
1 , 2 , , q
5.2. Basic Equation The basic equation for the factor analysis is given by ui1 a11 a12 ui 2 f a21 f a22 i1 i2 uip a p1 ap2
a1q ei1 a2 q ei 2 fiq a pq eip
for i 1, 2,
,n
This equation should hold the following equation given by 1 r21 rp1
r12 1 rp 2
r1 p r2 p q a1k k 1 1
a2 k
2 a1k Se1 a2 k 0 a pk a pk 0ix1
0 Se2 2
0
0 0 2 Sep
Kunihiro Suzuki
424
5.3. Normalization of Variables The average and variance are given by
1 n xi n i 1 1 n 2 xi n 1 i 1
When the data are regarded as the one in population set, the variance is given by 1 n 2 xi n i 1
The normalized variable is given by ui
xi
5.4. Correlation Factor Matrix The correlation factors are evaluated as r
1 n ui ui n 1 i 1
When data are ones in population, the corresponding correlation factor is given by
r
1 n ui ui n i 1
We can then obtain the correlation factor matrix given by 1 r12 r21 1 R rp1 rp 2
r1 p r2 p 1
Factor Analysis
425
5.5. Factor Loading Once the correlation matrix is obtained, it can be expanded using related eigenvalues and eigenvectors as 1 r21 rp1
r1 p r2 p v v T 2 _1v2 _1v2 _1T 1_1 1_1 1_1 1
r12 1 rp 2
p _1v p _1v p _1T
We then approximate as follows as a first step. Se 2_1 1 0 0ix 1 1 r21 rp1
0 2
Se2 _1 0 r12 1 rp 2
0 0 2 Se p _1 r1 p r2 p v v T 2 _1v2 _1v2 _1T 1_1 1_1 1_1 1
q _1vq _1vq _1T
We then approximate the second approximated correlation matrix as below, and obtain related eigenvalues and eigenvectors and expand the matrix as below. 1 Se 2_1 r12 1 2 r21 1 Se2 _1 rp1 rp 2
r2 p 1_ 2 v1_ 2 v1_ 2T 2 _ 2 v2 _ 2 v2 _ 2T 2 1 Sep _1 r1 p
The second order independent factor is then given by Se 2_ 2 0 0 1 2 0 Se2 _ 2 0 2 0ix 0 S ep _ 2 1 2 1 Se _1 r12 r1 p 1 2 r21 1 Se2 _1 r2 p 2 rp1 rp 2 1 Se p _1 1_ 2 v1_ 2 v1_ 2T 2 _ 2 v2 _ 2 v2 _ 2T
q _ 2 vq _ 2 vq _ 2T
p _ 2 v p _ 2 v p _ 2T
Kunihiro Suzuki
426
We evaluate the deviation of independent factor DSe
DSe
2
2
as
Se 2_ 2 Se 2_1 1 1 Se 2_ 2 Se 2_1 2 2 2 2 Se _ 2 Se _1 p p
We then evaluate
1 p 2 Se _ 2 Se2_1 p 1
2
We repeat the above step until we obtain c , that is,
1 p 2 Se _ m Se2_ m1 p 1
2
c
We then obtain factor loading as a11 a21 a1 1_ m v1_ m a p1 a12 a22 a2 2 _ m v2 _ m a p2
aq q _ m vq _ m
a1q a2 q a pq
5.6. Valimax Rotation Factor loading and factor points have uncertainty with respect to rotation. We decide the rotation angle as follows. The element rotated with an angle of
is given by
Factor Analysis a1k cos a1l sin a1k a2 k cos a2l sin a2 k ak a pk cos a pl sin a pk a1k sin a1l cos a1l a sin a2l cos a2l al 2 k ak sin al cos a pl
We then evaluate as below. 2
2
2
2
2 2 1 p a k 1 p a k S p 1 d p 2 1 d 2 2 1 p a l 1 p a l p 1 d p 2 1 d
The angle
that maximize S is given by
tan 4
G2 G4
G2 sin 4 0 where 2 2 a 4 a 4 1 p a 2 p a 2 G1 k l k l 1 d d p 1 d 1 d p
2 2 p a 3 a a a 3 1 p a a p a a G2 4 k l k l k l k l d d d p 1 d d 1 d d 1 d
2 2 2 p a 2 a 2 p a k a l 1 p a k p a l k l G3 4 3 2 d d p d d d 1 1 1 1 d
427
Kunihiro Suzuki
428
1 G4 G1 G3 2 We perform the above process to any combination of k , l . The factor loading is denoted as ak , where we should be careful about the angle conversion as Condition 1
Condition 2
4
K 2 sin 4 0
none
none
sin 4 0
4 4
sin 4 0
4 4
K 2 sin 4 0
conversion
range
8
8
5.7. Factor Point The coefficient for the factor can be obtained from
h1k 1 h2 k r21 hpk rp1
r12 1 rp 2
r1 p r2 p 1
1
a1k a2 k a2 k
We then obtain factor point associated with a factor
fˆik h1k xi1 h2k xi 2
hpk xip
5.8. Independent Factor The independent factor is obtained as
k
is given by
4
8
4
8
Factor Analysis ei1 xi1 a11 a12 ei 2 xi 2 f a21 f a22 i1 i 2 eip xip a p1 ap2
429
a1q a2 q fiq a pq
for i 1, 2,
,n
This can be totally expressed as
e11 e21 en1
e12 e22 en 2
e1 p x11 e2 p x21 enp xn1
x12 x22 xn 2
x1 p f11 x2 p f 21 xnp f n1
f12 f 22 fn2
f1q a11 f 2 q a12 f nq a1q
a21 a22 a2 q
a p1 ap2 a pq
5.9. Appreciation of the Results The score of each item is appreciated by the factor of number q . The relationship between the factor and each item is expressed by a k . For example, the elements for factor 1 are expressed by a11 , a12 , , a1q . The most important one is the maximum one among a11 , a12 , , a1q . The point of each member for the factor is fik . Therefore, we should set the target where large a with low f .
Chapter 12
CLUSTER ANALYSIS ABSTRACT One subject (a person for a special case) has various kinds of items. We define distance between the items using the data, and evaluate the similarity. Based on the similarity, we form clusters of the subject. We show various methods to evaluate the similarity.
Keywords: cluster analysis, arithmetic distance, tree diagram, chain effect, Ward method
1. INTRODUCTION Let us consider scores of an examination of various subjects. The scores are scattered in general. We can deduce that some members are resembled. We want to make groups based on the results. On the other case, we can have data where each customer buys some products. It is convenient if we make some groups and perform proper advertisements depending on the characteristics of the groups. Therefore, we want to have some procedures to obtain the groups, which we study in this chapter.
2. SIMPLE EXAMPLE FOR CLUSTERING Cluster analysis defines a distance between two categories and evaluates the similarity by the distance. We have score data for English and mathematics as shown in Table 1. The distance between members A and B is denoted as
d A, B
, and it is defined as
Kunihiro Suzuki
432 d A, B
xA1 xB1
2
67 43
64 76
2
x A 2 xB 2
2
2
26.8
(1) Table 1. Score of English and Mathematics English x1
Member ID A B C D E F G H I J
67 43 45 28 77 59 28 28 45 47
Mathematics x2 64 76 72 92 64 40 76 60 12 80
We can evaluate the distance for the other combinations of students and obtain the data as shown in Table 2. The minimum distance among these data is the distance between members B and C. Therefore, we form a cluster of B and C first. After forming a cluster, we modify the data for the cluster. The simplest modification is to use its average one given by
xBC1
xB1 xC1 x x , xBC 2 B 2 C 2 2 2
(2)
Using this data, we evaluate the distance between the members which is shown in Table 3. Table 2. Distance of members A A B C D E F G H I J
B
C
D
E
F
G
H
I
J
26.8
23.4 4.5
48.0 21.9 26.2
10.0 36.1 33.0 56.4
25.3 39.4 34.9 60.5 30.0
40.8 15.0 17.5 16.0 50.4 47.5
39.2 21.9 20.8 32.0 49.2 36.9 16.0
56.5 64.0 60.0 81.8 61.1 31.3 66.2 50.9
25.6 5.7 8.2 22.5 34.0 41.8 19.4 27.6 68.0
Cluster Analysis
433
Table 3. The distance between members for second step English x1
Member ID A B,C D E F G H I J
67 44 28 77 59 28 28 45 47
Mathematics x2 64 74 92 64 40 76 60 12 80
A A B,C D E F G H I J
B,C
D
E
F
G
H
I
J
25.1
48.0 24.1
10.0 34.5 56.4
25.3 37.2 60.5 30.0
40.8 16.1 16.0 50.4 47.5
39.2 21.3 32.0 49.2 36.9 16.0
56.5 62.0 81.8 61.1 31.3 66.2 50.9
25.6 6.7 22.5 34.0 41.8 19.4 27.6 68.0
In this case, the distance between the cluster BC and J is the minimum. Therefore, we form the cluster BCJ, and modify the data for the cluster as
xBCJ 1
xB1 xC1 xJ 1 x x x , xBC 2 B 2 C 2 J 2 3 3
(3)
Using this data, we evaluate the distance between the members which is shown in Table 4. Table 4. The distance between members for third step Member ID A B,C,J D E F G H I
English x1 67 45 28 77 59 28 28 45
Mathematics x2 64 76 92 64 40 76 60 12
A A B,C,J D E F G H I
B,C,J
D
E
F
G
H
I
25.1
48.0 23.3
10.0 34.2 56.4
25.3 38.6 60.5 30.0
40.8 17.0 16.0 50.4 47.5
39.2 23.3 32.0 49.2 36.9 16.0
56.5 64.0 81.8 61.1 31.3 66.2 50.9
In this case the distance between members A and E is the minimum. Therefore, we form a cluster AE, and modify the data as
xAE1
xA1 xE1 x x , xAE 2 A2 E 2 2 2
(4)
Using this data, we evaluate the distance between the members which is shown in Table 5.
Kunihiro Suzuki
434
Table 5. The distance between members for fourth step English x1
Member ID A,E B,C,J D F G H I
72 45 28 59 28 28 45
Mathematics x2 64 76 92 40 76 60 12
A,E B,C,J A,E B,C,J D F G H I
29.5
D
F
G
H
I
52.2 23.3
27.3 38.6 60.5
45.6 17.0 16.0 47.5
44.2 23.3 32.0 36.9 16.0
58.6 64.0 81.8 31.3 66.2 50.9
In this case, the distance between D, G, and H is the minimum, and we form a cluster DGH, and modify the related data as
xDGH 1
xD1 xG1 xH 1 x x x , xDGH 2 D 2 G 2 H 2 3 3
(5)
Using this data, we evaluate the distance between the members which is shown in Table 6. Table 6. The distance between members for fifth step English x1
Member ID A,E B,C,J D,G,H F I
72 45 28 59 45
Mathematics x2 64 76 76 40 12
A,E B,C,J D,GH A,E B,C,J D,GH F I
F
I
29.55 45.61 27.29 58.59 17.00 38.63 64.00 47.51 66.22 31.30
In this case, the distance between cluster BCJ and cluster DGH is the minimum, and we form a cluster BCJDGH, and modify the data as
xBCJDGH 1
xB1 xC1 xJ 1 xD1 xG1 xH 1 x x x x x x , xDGH 2 B 2 C 2 J 2 D 2 G 2 H 2 6 6
(6)
Using this data, we evaluate the distance between the members which is shown in Table 7. Table 7. The distance between members for sixth step Member ID A,E B,C,J,D,G,H F I
English x1 72 36.5 59 45
Mathematics x2 64 76 40 12
A,E A,E B,C,J,D,G,H F I
B,C,J,D,G,H 37.47
F 27.29 42.45
I 58.59 64.56 31.30
Cluster Analysis
435
In this case the distance between cluster AE and F is the minimum. We then form a cluster AEF and modify the data as
xAEF1
xA1 xE1 xF1 x x x , xAEF 2 A2 E 2 F 2 3 3
(7)
Table 8. The distance between members for sixth step Member ID A,E,F B,C,J,D,G,H I
English Mathematics x1 x2 67.67 56.00 36.50 76.00 45.00 12.00
A,E,F A,E,F B,C,J,D,G,H I
B,C,J,D,G,H 37.03
I 49.50 64.56
Using this data, we evaluate the distance between the members which are shown in Table 8. In this case, the distance between cluster AEF and cluster BCJDGH is the minimum. We hence form a cluster AEF BCJDGH and evaluate the data for the cluster. Finally, we form the last cluster as shown in Table 9. Table 9. The distance between members for eighth step Member ID A,E,F,B,C,J,D,G,H I
English Mathematics x1 x2 46.89 69.33 45.00 12.00
A,E,F,B,C,J,D,G,H A,E,F,B,C,J,D,G,H I
I 57.36
After the above process, we can form a tree structure as shown in Figure 1. We can obtain some groups if we set the critical distance. For example, if we obtain 4 groups, we set the critical distance of 25 and obtain the groups of below. [B,C,J,D,G,H], [A,E], F,I
Figure 1. Tree diagram.
Kunihiro Suzuki
436
There are various definitions of distances. The process above is the simplest one among them. We should mention one point in the above process.
x
x
The above process is valid when the two variables 1 and 2 are similar ones, which means the magnitude of the two variables are in the almost same order. However, they are quite different variables and hence the magnitude is significantly different, the distance is determined by the one which is larger. Therefore, we consider only one variable in this case. In such case, we need to use normalized variables below.
xi1 x1
zi1
x
, zi 2
xi 2 x2
x2
1
(8)
where
x 1
x 2
1 1 xi1 , x1 xi1 x1 n i n i
1 1 xi 2 , x2 xi 2 x2 n i n i
2
(9)
2
(10)
and n is the data number.
3. THE RELATIONSHIP BETWEEN ARITHMETIC DISTANCE AND VARIANCE We evaluate the relationship between arithmetic distance and the variance. The square of the distance is modified as
x x x x x x x x x x x x x x 2 x x x x x x x x 2 x x x x S 2 x x x x x x x x
d 2 xax1 xbx1
2
2
ax2
bx2
2
ax1
x1
bx1
x1
bx1
x1
2
ax1
ax2
x2
bx2
x2
2
x1
2
2 ab
2
ax2
x2
bx2
ax1
ax1
x1
bx1
x1
2
x1
x2
bx1
ax2
x1
ax2
x2
bx2
x2
bx2
x2
x2
(11)
Cluster Analysis
437
where
x
ax1
xx1
x
bx1
xax xbx1 xax1 xbx1 xx1 xax1 1 xbx1 2 2 xax xbx1 xbx1 xax1 1 2 2 2 1 xax1 xbx1 4
(12)
Similarly, we obtain
x
ax2
xx2
x
bx2
xx2
1 xax2 xbx2 4
2
(13)
Therefore, we obtain
2 1 d 2 Sab xax xbx1 2 1 2 1 2 Sab d 2
x 2
ax2
xbx2
2
(14)
This can be modified as
d 2 2Sab 2
(15)
The arithmetic distance is double of the variance.
4. VARIOUS DISTANCE The distance is basically defined by d ab
1 n 2 xia xib n i 1
(16)
However, there are various definitions for the distance. We briefly mention the distance.
Kunihiro Suzuki
438
When each item is not identical, we multiply a weighted factor wi , and the corresponding distance is given by d ab
1 n 2 wi xia xib n i 1
(17)
This is called as weighted distance. The generalized distance is given by 1
1 n d ab xia xib n i 1
(18)
The maximum deviation is apt to dominate the distance with increasing .
5. CHAIN EFFECT Chain effect means that one cluster absorbs one factor in order. In that special case, the group is always two for any stage of distance. This situation is not appropriate for our purpose. The above simple procedure is apt to suffer this chain effect, although the reason is not clear.
6. WARD METHOD FOR TWO VARIABLES Ward method form a cluster in the standpoint that the square sum is the minimum. This method is regarded as the one that overcomes the chain effects, and hence is frequently used. Table 10. Score of Japanese and English for members
Member ID 1 2 3 4 5
Japanese x1
English x2 5 4 1 5 5
1 2 5 4 5
Cluster Analysis
439
Let us consider the distance between member 1 and 2. We evaluate the means of Japanese and English scores for members 1 and 2 as shown in Table 11, and evaluate the sum of square given by 2
2
K12 xik xk
2
i 1 k 1
5 4.5 4 4.5 1 1.5 2 1.5 2
2
2
2
1.00
(19)
We can evaluate the sum of square for the other combination of members as shown in Table 12. The minimum square value is that for the combination of member 4 and 5, and we form a cluster of 4 and 5 denoted as C1 . Table 11. Score of Japanese and English for member 1 and 2
Member ID
Japanese x1
1 2 Mean
English x2
5 4 4.5
1 2 1.5
Table 12. Sum of square for member combinations ID 1 2 3 4 5
1
2
3
4
1 16 4.5 8
9 2.5 5
8.5 8
0.5
5
After forming a cluster we evaluate the distance between cluster C1 and the member 1. The corresponding data is shown in Table 13. Table 13. Score of cluster
Cluster ID Member ID 1 C1 Average
1 4 5
C1
and member 1
Japanese x1 5 5 5 5.00
English x2 1 4 5 3.33
Kunihiro Suzuki
440
The mean of Japanese for this group of the cluster and member is 5 and that of English is 3.33. The square sum 3
2
K C11
KC11 xik xk
is evaluated as
2
i 1 k 1
5 5.00 5 5.00 5 5.00 2
2
1 3.33 4 3.33 5 3.33 2
2
2
2
8.67
(20)
We do not use this sum directly, but the deviation of the sum KC11 given by
KC11 KC11 KC1 K1
8.67 0.5 0 8.17
(21)
where K1 is 0 since there is only one data. The variance deviations for any combination are shown in Table 14. The minimum variance deviation is between members 1 and 2. Therefore, we form a cluster of members 1 and 2, and denote it as C2. Table 14. Variance deviation for second step ID
1 1 2 3 C1
2 1 16 8.17
3
9 4.83
C1
10.83
We then evaluate the variance deviation between C1, C2, and member 3, which is shown in Table 15. The minimum variance deviation is between cluster C1 and C2, and we form the cluster of C1 and C2. The corresponding tree diagram is shown in Figure 2. Table 15. Variance deviation for step 2 ID
3 3 C1 C2
10.83 16.33
C1
9.25
C2
Cluster Analysis
441
Figure 2. Tree diagram with Ward method.
7. WARD METHOD WITH MANY VARIABLES We can easily extend the procedure for the two variables to that for many variables of more than two. Let us consider that we form a cluster lm by merging cluster l and cluster m . Cluster l and cluster m has the data denoted as xilk , ximk , and the merged number is denoted as nl and nm . We then obtain the variables below n
nl
Kl xilk xil
2
i 1 k 1
n
nm
(22)
K m ximk xim
2
i 1 k 1
(23)
The merged variance is expressed by Klm Kl Km Klm
(24)
The deviation of the variance is given by Klm
nl nm nl nm
n
x i 1
il
xim
2
which is very important rule and is proved in the following
(25)
Kunihiro Suzuki
442 Square sum of Klm is given by
nm n nl 2 2 Klm xilk xilm ximk xilm i 1 k 1 k 1
(26)
where nl
xilm
nm
xilk ximk k 1
k 1
nl nm
(27)
We modify this equation to relate it to Kl and K m as nm n nl 2 2 K lm xilk xil xil xilm ximk xim xim xilm i 1 k 1 k 1 nl nl 2 2 xilk xil 2 xilk xil xil xilm nl xil xilm n k 1 k 1 n nm m 2 2 i 1 ximk xim 2 ximk xim xim xilm nm xim xilm k 1 k 1 nl nm n 2 2 xilk xil ximk xim i 1 k 1 k 1 n
2 2 nl xil xilm nm xim xilm i 1 n
2 2 K l K m nl xil xilm nm xim xilm i 1
(28)
where we utilize nl
nl
k 1
k 1
xilk xil xil xilm xil xilm xilk xil 0 nm
x k 1
imk
(29)
nl
xim xim xilm xim xilm ximk xim 0 k 1
(30)
Cluster Analysis nl
xil xilm
nl
xilk
k 1
nl
nm
xilk ximk k 1
k 1
nl nm
nl
k 1
nm
k 1 nl nl nm
k 1
nl
nm
k 1
k 1
nm xilk nl ximk nl nl nm
xilk nm nl
nl
nl nm xilk nl xilk ximk
nl
443
k 1
nm
nl nm
x k 1
imk
nl nm nl nl nm
nm nl xil xim nl nl nm
nm xil xim nl nm
(31)
Similarly, we obtain
xim xilm
nl xim xin nl nm
(32)
Therefore, we obtain n x xim nl xim xin nl m il nm nl nm nl nm 2
nl xil xilm nm xim xilm 2
2
n n l
2 m
nm nn2 xil xim
nl nm
2
2
2
nl nm 2 xil xim nl nm
(33)
Finally, Eq. (25) is proved. The distance is normalized by the data number n and is given by d lm
K lm n
(34)
Kunihiro Suzuki
444
The average in the cluster is given by nl
xilm
nm
xilk ximk k 1
k 1
nl nm
(35)
8. EXAMPLE OF WARD METHOD WITH N VARIABLES We apply the Ward method and form clusters. Table 1 shows the CS data of members for various items. The score is 5 point perfect. The member number n is 20. In this case, the following five items influence the customer satisfaction. a. Understanding customer b. Prompt response c. Flexible response d. Producing ability e. Providing useful information f. Active proposal We evaluate clustering of items using 20 member scores. The dummy variable for items a-f is k , and that for members 1-20 is i .
Step 1: Sum of Square The sum of square is evaluated as n
nl
Kl xilk xil
2
i 1 k 1
(36)
We consider an item a , and evaluate the corresponding square sum. Since we consider the item a , l is a . Since we do not form cluster, k a . The element is only one, and the mean is the same as the data given by
xia xia
(37)
Cluster Analysis
445
Table 16. Satisfaction score of members for various items
4 3 3 5 4 2 5 3 1 1 5 2 3 3 3 1 5 1 3 3
f:Active proposal
3 4 2 5 4 2 5 3 1 1 5 3 3 4 3 1 5 1 3 4
e:Providing uselul Information
4 3 1 5 4 2 5 3 1 1 5 3 3 4 3 1 5 1 4 4
d: Producing ability
4 4 3 5 4 2 5 4 2 2 4 3 3 3 4 1 5 4 4 4
c:Flexible response
b:Prompt response
a: Understanding customer
Member ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
5 3 2 5 4 2 5 3 1 1 5 2 3 3 3 1 5 3 4 3
4 2 1 5 4 2 5 3 2 3 5 2 3 3 3 1 5 3 3 3
Therefore, the variance is given by Ka 0
(38)
The variances for the other items are the same and is given by
K a Kb
Kf 0
This corresponds the initialization of Kl .
Step 2: Sum of the Variance for Whole Combinations The whole combination is as below. a-b, a-b, a-d, a-e, a-f,a-e,a-f b-c,b-d,b-e,b-f
(39)
Kunihiro Suzuki
446 c-d,c-e,c-f d-e,d-f e-f
Table 17. Variance deviation of items a and b
4 4 3 5 4 2 5 4 2 2 4 3 3 3 4 1 5 4 4 4
Square of deviation
b:Prompt response
a:Understanding customer
Member ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
4 3 1 5 4 2 5 3 1 1 5 3 3 4 3 1 5 1 4 4 Sum ΔK
0 1 4 0 0 0 0 1 1 1 1 0 0 1 1 0 0 9 0 0 20 10
We show the variance deviation evaluation of the combination between a and b . The corresponding variance is given by K ab
na nb 20 2 xia xib na nb i 1
(40)
where a and b are single item, and hence na nb 1 . This also means the means equal to the data themselves, that is, xia xia and xib xib . The corresponding sum of square is 20. 20
x i 1
ia
xib 20 2
(41)
Cluster Analysis
447
Therefore, the variance deviation is K ab
na nb na nb
20
x i 1
ia
xib
2
11 20 11 10
(42)
We can evaluate the other combinations as shown in Table 18. Table 18. Variance deviation for first step Item
a
a b c d
b
c 10
d
e
f
9
9
5.5
8
2
4
4.5
7
3
6.5
9
3.5
7
e
4.5
f
Step 3: First Cluster Formation The minimum variance deviation is that for Kbc . Therefore, we form a cluster of b and c . The corresponding distance dbc is given by dbc
K bc n
2 20 0.316
The variance associated with the cluster bc is given by
(43)
Kunihiro Suzuki
448 Kbc Kb K c K bc Kbc 2
(44)
The corresponding item number is given by nbc 2
(45)
Calculating means for the cluster, we obtain the updated table and obtain Table 19. Table 19. Satisfaction score of members for various items and the first cluster
3.5 3.5 1.5 5 4 2 5 3 1 1 5 3 3 4 3 1 5 1 3.5 4
4 3 3 5 4 2 5 3 1 1 5 2 3 3 3 1 5 1 3 3
f
f:Active proposal
3 4 2 5 4 2 5 3 1 1 5 3 3 4 3 1 5 1 3 4
e:Providing uselul Information
4 3 1 5 4 2 5 3 1 1 5 3 3 4 3 1 5 1 4 4
e
d: Producing ability
a: Understanding customer 4 4 3 5 4 2 5 4 2 2 4 3 3 3 4 1 5 4 4 4
d
[bc] mean
Member ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
[bc]
c:Flexible response
a
b:Prompt response
Cluster 1
5 3 2 5 4 2 5 3 1 1 5 2 3 3 3 1 5 3 4 3
Step 4: Sum of Square for Second Time We then evaluate the variance deviation of the combinations below. a-[bc], a-d, a-e, a-f,a-e,a-f [bc]-d,[bc]-e,[bc]-f d-e,d-f e-f
4 2 1 5 4 2 5 3 2 3 5 2 3 3 3 1 5 3 3 3
Cluster Analysis
449
We show the evaluation of variance between a and [bc] below K abc
na nbc na nbc
20
x i 1
ia
xibc
2
1 2 18 1 2 12
(46)
The corresponding table is given by Table 20. Table 20. Variance deviation of items a and a cluster [bc]
4 4 3 5 4 2 5 4 2 2 4 3 3 3 4 1 5 4 4 4
Square of deviation
[bc] mean
a: Understanding customer
Member ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
3.5 3.5 1.5 5 4 2 5 3 1 1 5 3 3 4 3 1 5 1 3.5 4 Sum ΔK
0.25 0.25 2.25 0 0 0 0 1 1 1 1 0 0 1 1 0 0 9 0.25 0 18 12
We perform similar analysis for the other combinations and obtain Table 21.
Kunihiro Suzuki
450
Table 21. Variance deviation for second step
ID
a
a
bc
d
12
bc
e
f
9
5.5
8
4
6.67
10
3.5
7
d
e
4.5
f
Step 5: Second Cluster Formation The minimum variance deviation is that for Kde . Therefore, we form a cluster of d and e . The corresponding distance dbc is given by d de
K de n
3.5 20 0.418
(47)
The variance associated with the cluster bc is given by K de K d K e K de K de 3.5
The corresponding item number is given by
(48)
Cluster Analysis nde 2
451 (49)
We repeat the above process and obtain the tree diagram
Figure 3. Tree diagram for the data.
SUMMARY To summarize the results in this chapter— The distance between two variables are given by
d A, B
xA1 xB1 xA2 xB2 2
2
We evaluate all combination of the variables, and select the minimum one, and form a cluster. After forming a cluster, we modify the data for the cluster. The simplest modification is to use its average one given by
xAB1
xA1 xB1 x x , xAB 2 A2 B 2 2 2
Using this data, we evaluate the distance between the members. We can use a modified distance as 1
1 n d xiA xiB n i 1
Kunihiro Suzuki
452
There is the other method for forming clusters called as a Ward method. The Ward method forms a cluster in the standpoint that the square sum is the minimum. Let us consider that we form cluster lm by merging cluster l and cluster m . Cluster l and cluster m has the data denoted as xilk , ximk , and the merged number is denoted as nl and nm . We then obtain the variables below. n
nl
Kl xilk xil
2
i 1 k 1
n
nm
K m ximk xim
2
i 1 k 1
The merged variance is expressed by Klm Kl Km Klm
The deviation of the variance is given by Klm
nl nm nl nm
n
x i 1
il
xim
2
We search the minimum combination for this deviation of variance, and repeat the process.
Chapter 13
DISCRIMINANT ANALYSIS ABSTRACT We assume that we have two groups and have reference data where we clearly know to which group the data belong to. When we obtain one set data for a certain member or subject, the discriminant analysis give us the procedure to decide the group for the member or the subjects. The decision is done by evaluating the distance between the data and ones for each group.
Keywords: distance, determinant, Maharanobis’ distance
1. INTRODUCTION When we have a bad condition in health, we go to a hospital, and take some kinds of medical checks. The doctor has to decide whether the person is sick or not. This problem is generalized as follows. Two groups are set, and the corresponding data with the same items are also provided. We consider the situation where we obtain one set data for the items. Discriminant analysis helps us to decide to which group it belongs. We perform matrix operation in this chapter. The basic matrix operation is shown in Chapter 15 of volume 3.
2. DISCRIMINANT ANALYSIS WITH ONE VARIABLE We want to judge whether a person is healthy or disease using one inspection value x1 . We define a distance between a data and two groups and select the shorter one and decide the person who belongs to the group of shorter distance.
Kunihiro Suzuki
454
We show the inspection data for healthy and diseased people as shown in Table 1. We inspect a person and judge whether he is healthy or diseased. First of all, we make two groups from the data in Table 1: healthy and disease. We then evaluate the averages and variances for the groups as shown in Table 2. Table 1. Inspection data for healthy and diseased people
No.
Kind 1 2 3 4 5 6 7 8 9 10
Healthy Healthy Healthy Healthy Healthy Disease Disease Disease Disease Disease
Inspection x1 50 69 93 76 88 43 56 38 21 25
Table 2. Average and variance for healthy and diseased people
Average Variance
Healthy
Disease
50 69 93 76 88 75.2 230.96
43 56 38 21 25 36.6 159.44
The average and variance of healthy people are denoted as H1 , and H 1 , and the 2
average and variance of disease people are denoted as D1 , and D1 . These are evaluated as below. 2
1 5 75.2
H 1 50 69 93 76 88
(1)
Discriminant Analysis
455
1 2 2 2 2 2 H 21 50 75.2 69 75.2 93 75.2 76 75.2 88 75.2 5 230.96
1 5 36.6
D1 43 56 38 21 25
(3)
1 2 2 2 2 2 D 21 43 36.6 56 36.6 38 36.6 21 36.6 25 36.6 5 159.44
(2)
(4)
We assume that the healthy and diseased people data follow normal distributions. The distribution for healthy people is given by
f x1
x1 H 1 2 exp 2 2 2 H 1 2 H 1 1
(5)
The distribution for disease people is given by
x1 D1 2 f x1 exp 2 2 2 D 1 2 D 1 1
Figure 1 shows the above explanation schematically.
Figure 1. The probability distribution for healthy and diseased people.
(6)
Kunihiro Suzuki
456
When we have a data value of x1 , we want to judge whether the person is healthy or diseased. We define the distance as below.
DH2
D 2 D
x1 H 1
2
(7)
H 21
x1 D1
2
(8)
D 2
We can judge whether the person is healthy or diseased as DH2 DD2 2 2 DH DD
Healthy Disease
(9)
We evaluate the value of x1 where the both distance is the same and denote it as c . We then have the relationship of
H 1 c H 21
c D1
(10)
D 21
We then obtain
c
H 1 H 21 D1 D 21
(11)
H 21 D 21
In this case, we obtain c
75.2 230.96 36.6 159.44 230.96 159.44
57.68
(12)
Summarizing above, we obtain general expressions for the decision as DD2 DH2 0 x1 c Helathy 2 2 DD DH 0 x1 c Disease
(13)
Discriminant Analysis
457
3. DISCRIMINANT ANALYSIS WITH TWO VARIABLES We consider two inspections 1 and 2, and groups to extend the analysis for many variables. Two inspection items are expressed by a vector
A and B . We here introduce vectors
x x 1 x2
(14)
The vector associated with averages are given by μ A 1 A , μ B 1B 2 A 2 B
(15)
2 12 2A 11 2B A 112A , 2 B 2 21A 22 A 21B
12 2B
2 22 B
(16)
The determinant for A and B are given by
A
B
11 2A 12 2A 2
21A 22 A 11 2B 12 2B 2
11 A 22 A 12 A 21 A
(17)
11 B 22 B 12 B 21B
(18)
2
2
2
2
21B 22 B
2
2
2
2
2
2
The distribution for two variables for the group A is given by
f A x1 , x2
1 2
2 2
DA2 exp 2 A
2 11 A D x1 1A , x2 2 A 21 2 A 2 A
where
2 12 x 1A A 1 22 2 x A 2 2A
(19)
(20)
Kunihiro Suzuki
458
2 2 2 11 12 11A 12 2A A A 21 A1 2 22 2 2 2 A A 21A 22 A
(21)
The Maharanobis’ distance associated with the group A is denoted as d A and is given by
d A DA2
(22)
The Maharanobis’ distance associated with the group B is denoted as d B and is given by
d B DB2
(23)
where 11 2 DB2 x1 1B , x2 2 B B21 2 B
2 12 x 1B B 1 22 2 x B 2 2B
(24)
We evaluate d A and d B , and decide that the data belong to which group.
4. DISCRIMINANT ANALYSIS FOR MULTI VARIABLES We extend the analysis in the previous section to a multi variables case, where we assume p variables. The data are expressed by a vector having p elements given by x1 x2 x xp
It follows the normal distribution with
(25)
N , k
, and is expressed by
Discriminant Analysis
f x1 , x2 , , x p
1 p
2 2
459
D2 exp 2
(26)
where 11 2 12 2 2 2 22 21 2 2 p2 p1
D 2 x1 1 , x2 2 ,
p
1 p2
2 2p
(27)
pp2
11 2 12 2 21 2 22 2 , xp p p1 2 p 2 2
p
xi i x j j
1 p 2 x1 1 2 p 2 x2 2
pp 2 x p p
(28)
ij 2
i 1 j 1
We also define a vector of μ as 1 2 μ p
(29)
The distance d is given by d D2
(30)
We can select the group for shorter distance.
SUMMARY To summarize the results in this chapter— The distance of the member and a group can be evaluated as
Kunihiro Suzuki
460 d D2
where
D 2 x1 1 , x2 2 ,
11 2 12 2 21 2 22 2 , xp p p1 2 p 2 2
1 p 2 x1 1 2 p 2 x2 2
pp 2 x p p
We evaluate the distance with respect to each group, and select the group with smaller distance.
REFERENCES [1] [2]
[3] [4] [5]
[6]
[7]
W. L. Carlson and B. Thorne, Applied Statistical Methods, 1997, Presence-Hall, Inc., New Jersey, U.S.A. R. A. Barnett, M. R. Ziegler, and K. E. Byleen, College Mathematics for Business, Economics, Life science, and Social sciences 12th edition, 2011, Pearson Education, Inc., 2011, U. S. A. G.Maruyama, Probability and statistics, 1956, Kyoritu, Japan, in Japanese. 丸山儀四郎、“確率および統計入門”、共立出版、日本、1956. A. Kobari, Introduction to probability and statistics, 1973, Iwanami Shoten, Japan. 永田靖、小針 宏、“確率・統計入門”、岩波書店、東京、1973. Y. Tanaka and K. Wakimoto, Multivariate statistical analysis, 1994, Gendai Sugakusha, Japan, in Japanese. 田中豊、脇本和昌、“多変量統計解析法”、現 代数学社、京都、1994. Y. Nagata and M. Munechika, Multivariate analysis, 2007, Science Company, Japan, in Japanese. 永田靖、棟近雅彦、“多変量解析法入門”、サイエンス社、東京、 2007. Y. Wakui and S. Wakui, Covariance Structural Analysis, Nihon Jitsugyo Publisher, 2003, Japan, in Japanese. 涌井良幸、涌井貞美、“共分散構造分析”、日本実業
出版社、東京、2003. [8] Y. Wakui, Bayes Statistics as a Tool, Nihon Jitsugyo Publisher, 2009, Japan, in Japanese. 涌井良幸、“道具としてのベイズ統計”、日本実業出版社、東京、 2009. [9] H. Cramer, Mathematical Methods of Statistics, 1999, Princeton University Press, U. S. A. [10] D. M. Levine,T. C. Krehbiel, and M. L. Berenson, Business Statistics, 2013, Pearson Education Inc., U. S. A.
ABOUT THE AUTHOR Kunihiro Suzuki, PhD Fujitsu Limited, Tokyo, Japan Email: [email protected]
Kunihiro Suzuki was born in Aomori, Japan in 1959. He received his BS, MS, and PhD degrees in electronic engineering from Tokyo Institute of Technology, Tokyo, Japan, in 1981, 1983, and 1996, respectively. He joined Fujitsu Laboratories Ltd., Atsugi, Japan in 1983 and was engaged in design and modeling of high-speed bipolar and MOS transistors. He studied process modeling as a visiting researcher at the Swiss Federal Institute of Technology, Zurich, Switzerland in 1996 and 1997. He moved to Fujitsu Limited, Tokyo, Japan in 2010, where he was engaged in a division that is responsible for supporting sales division. His current interests are statistics and queuing theory for business. His research covers theory and technology in both semiconductor device and process. To analyze and fabricate high-speed devices, he also organizes a group that includes physicists, mathematicians, process engineers, system engineers, and members for analysis such as SIMS and TEM. The combination of theory and experiment and the aid from various members make his group special to do various original works. His models and experimental data are systematic and valid for wide range conditions and can contribute to academic and practical product fields. He is the author and co-author of more than 100 refereed papers in journals, more than 50 papers in international technical conference proceedings, and more than 90 papers in domestic technical conference proceedings.
INDEX A analysis of variance, v, 277, 285, 290, 305 arithmetic distance, 431, 436, 437 average, ix, 3, 4, 5, 6, 46, 55, 56, 64, 65, 66, 75, 77, 84, 86, 92, 93, 126, 148, 227, 228, 231, 232, 233, 234, 235, 236, 239, 240, 241, 242, 244, 248, 249, 250, 253, 254, 255, 257, 261, 269, 270, 271, 274, 277, 279, 281, 282, 285, 287, 290, 291, 292, 293, 295, 299, 301, 302, 311, 321, 322, 328, 331, 341, 356, 357, 364, 375, 402, 423, 432, 444, 451, 454
B Bayesian network, 179, 180, 210, 211, 212, 217, 219, 220, 221 Bayesian updating, 179, 187, 191, 197, 208, 223, 224, 227, 228 beta distribution, 223, 229, 230, 269 beta function, 223 binomial distribution, 223, 232
C chain effect, 431, 438 child node, 179, 210, 212, 213, 217 cluster analysis, v, 431 correlation factor, v, ix, 1, 6, 7, 8, 9, 10, 12, 13, 15, 16, 17, 20, 21, 22, 23, 24, 27, 43, 44, 45, 50, 51, 52, 53, 54, 56, 58, 59, 62, 63, 64, 65, 66, 67, 69, 71, 74, 76, 77, 80, 81, 93, 117, 122, 123, 143,
144, 152, 153, 154, 155, 157, 164, 341, 356, 357, 405, 424 covariance, 1, 2, 4, 5, 6, 10, 15, 17, 75, 133, 150, 169, 173, 176, 364, 367, 368, 369, 370, 375, 379, 380, 381, 382, 402, 405, 408, 410, 411, 461
D degree of freedom adjusted coefficient of determination, 82, 124 determinant, 18, 20, 79, 82, 93, 121, 122, 123, 124, 138, 142, 144, 157, 162, 453, 457 determination coefficient, 79, 80, 122, 143 Dirichlet distribution, 223, 229, 232, 233, 270 discriminant analysis, v, ix, 453, 457, 458 distance, 93, 126, 147, 163, 351, 355, 358, 359, 431, 432, 433, 434, 435, 436, 437, 438, 439, 443, 447, 450, 451, 453, 456, 458, 459, 460
E eigenvalue, 339, 344, 346, 347, 351, 353, 369, 370, 381, 382, 410, 423 eigenvector, 339, 346, 347, 349, 350, 370, 382
F factor analysis, v, ix, 306, 314, 327, 361, 362, 363, 367, 368, 369, 371, 374, 377, 379, 380, 396, 397, 398, 408, 409, 413, 416, 423 factor loading, 354, 361, 363, 369, 375, 381, 409, 421, 422, 424, 426, 428
Index
466 factor point, 361, 363, 371, 374, 375, 383, 384, 385, 396, 397, 399, 400, 401, 402, 416, 418, 421, 427, 428 forced regression, 77
G Gamma distribution, 233, 234, 235, 271 Gamma function, 223 Gibbs sampling, 223, 244, 249, 250, 259, 275
N normal distribution, 18, 21, 22, 27, 30, 32, 33, 44, 49, 50, 55, 56, 65, 71, 83, 87, 89, 90, 125, 128, 135, 146, 149, 150, 236, 239, 255, 262, 265, 272, 299, 455, 458
O observed variable, 169, 170, 176, 177 orthogonal table, v, ix, 305, 317, 328
H hierarchical Bayesian theory, ix, 223, 261
I independent load, 361, 363, 364, 369, 371, 374, 375, 381, 396, 409, 416, 423 inverse Gamma distribution, 223, 239, 273
J Jacobian, 19, 21, 24, 25, 36, 37
L latent variable, 169, 170, 171, 172, 175 leverage ratio, 69, 83, 84, 85, 114, 124, 125, 126, 127, 145, 146, 147, 148 likelihood function, 223, 224, 228, 229, 230, 232, 233, 234, 236, 237, 240, 263, 264, 269, 270, 271, 272, 275 logarithmic regression, 69, 95 logistic regression, 100
M Metropolis-Hastings (MH) algorithm, 223, 250, 251, 252, 256, 257, 258, 259, 261, 275 Monty Hall problem, 185, 187 multicollinearity, 117, 157 multinomial distribution, 223, 232 multiple regression, v, viii, 117, 124, 125, 140, 146, 160, 167, 340, 355, 357, 359
P parent node, 179, 210 partial correlation, 117, 155, 157, 164 partial correlation factor, 117, 155, 157, 164 path diagram, 70, 169, 170, 171, 176, 177, 340, 361, 362 Poisson distribution, 233 population, v, ix, 4, 8, 23, 24, 53, 62, 66, 70, 80, 81, 82, 86, 117, 118, 128, 141, 149, 181, 223, 224, 229, 233, 268, 274, 424 population ratio, 229, 233 posterior probability, 179, 188, 189, 191, 223, 228, 235, 242, 244, 247, 250, 252, 253, 258, 259, 261, 268 prediction, vii, 17, 50, 51, 52, 54, 55, 64, 86, 117, 127, 149, 179, 227, 282, 284 principal component analysis, v, ix, 339, 340, 347, 350, 351, 355, 359, 361, 362, 367, 368, 379 priori probability, 179, 223, 228, 230, 233, 235, 240, 268, 269, 270 pseudo correlation, 1, 12
R rank correlation, 8 regression, v, viii, 12, 23, 69, 70, 71, 72, 74, 76, 77, 78, 79, 80, 81, 83, 84, 86, 90, 92, 93, 94, 95, 97, 100, 108, 109, 110, 111, 112, 113, 114, 117, 118, 121, 122, 123, 124, 125, 127, 135, 137, 138, 139, 140, 141, 142, 144, 145, 146, 149, 150, 151, 155, 157, 158, 159, 160, 161, 162, 163, 164, 165, 167, 339, 340, 355, 356, 357, 359
Index regression line, 69, 72, 74, 76, 77, 79, 80, 81, 83, 86, 90, 93, 100, 111, 118, 122, 127, 135, 141, 149, 150, 163 rotation matrix, 361, 380, 384, 413
S sample correlation factor, 6, 8, 23, 24, 27, 43, 44, 45, 51, 52, 53, 62, 63, 66, 81, 93 standard deviation, 56, 231, 235, 240, 261, 266, 351 standard normal distribution, 18, 27, 30, 32, 33, 49, 56, 83, 89, 125, 146 structural equation modeling, v, viii, 169, 170 studentized range distribution, 277, 283, 299
467 T three prisoners, 184
V Valimax method, 361, 385, 395
W Ward method, 431, 438, 441, 444, 452 weighted distance, 438