114 27 22MB
English Pages 359
Colorful Statistics with Basic Steps in Minitab® 19
Usman Zafar Paracha M. Phil. Pharmaceutics, Rawalpindi, Pakistan (2019)
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha This Book will help the students to learn and utilize some basic concepts of statistics while utilizing Minitab® 19.
Any Feedback will be Highly Appreciated.
Usman Zafar Paracha Owner of SayPeople.com [email protected] https://www.facebook.com/usmanzparacha
2
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
3
Some words from the author I tried to make this book on statistics as informative and illustrative as possible, especially for beginners in statistics. "Portions of information contained in this publication/book are printed with permission of Minitab Inc. All such material remains the exclusive property and copyright of Minitab Inc. All rights reserved." "MINITAB® and all other trademarks and logos for the Company's products and services are the exclusive property of Minitab Inc. All other marks referenced remain the property of their respective owners. See minitab.com for more information." This book may also have some trademarked names without using trademark symbol. However, they are used only in an editorial context, and there is no intention of infringement of trademark. It is important to note that calculations and examples used in this book could not take the place of actual research. Statistics has to be used under the guidance of experts. People with authentic comments and/or feedbacks (on Amazon) can ask me questions or send me “Message” here: https://www.facebook.com/usmanzparacha, and I will try to answer them.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
4
Contents Hypothesis ................................................................................................................................ 13 Hypothesis testing ..................................................................................................................... 14 Frequency ................................................................................................................................. 15 Using Minitab .................................................................................................................... 15 Sample and Population .............................................................................................................. 16 Type I Error and Type II Error .................................................................................................. 17 Measuring the Central Tendencies (mean, median, and mode), and Range ................................ 20 Using Minitab .................................................................................................................... 20 Geometric Mean........................................................................................................................ 21 Using Minitab .................................................................................................................... 22 Grand Mean .............................................................................................................................. 23 Using Minitab .................................................................................................................... 25 Harmonic Mean ........................................................................................................................ 25 Using Minitab .................................................................................................................... 26 Mean Deviation......................................................................................................................... 27 Using Minitab .................................................................................................................... 28 Mean Difference ....................................................................................................................... 29 Using Minitab .................................................................................................................... 29 Root mean square ...................................................................................................................... 30 Using Minitab .................................................................................................................... 30 Sample mean and population mean ........................................................................................... 31 Types of data in statistics .......................................................................................................... 32 Data Collection and its types ..................................................................................................... 32 Sampling plan ........................................................................................................................... 33
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
5
Sampling Methods .................................................................................................................... 34 Required Sample Size ............................................................................................................... 36 Using Minitab .................................................................................................................... 37 Simple Random Sampling ......................................................................................................... 38 Cluster Sampling ....................................................................................................................... 40 Stratified Sampling ................................................................................................................... 41 Graphs and Plots ....................................................................................................................... 42 Bar Graph .............................................................................................................................. 42 Using Minitab .................................................................................................................... 42 Pie Chart................................................................................................................................ 44 Using Minitab .................................................................................................................... 45 Scatter plot or Bubble chart ................................................................................................... 47 Different patterns of data in bubble charts .............................................................................. 48 Using Minitab .................................................................................................................... 49 Dot plot ................................................................................................................................. 52 Using Minitab .................................................................................................................... 53 Matrix plot............................................................................................................................. 56 Using Minitab .................................................................................................................... 57 Pareto Chart........................................................................................................................... 59 Using Minitab .................................................................................................................... 61 Histogram.............................................................................................................................. 62 Using Minitab .................................................................................................................... 63 Stem and Leaf plot................................................................................................................. 65 Using Minitab .................................................................................................................... 66 Box plot................................................................................................................................. 67
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
6
Using Minitab .................................................................................................................... 69 Outlier....................................................................................................................................... 70 Using Minitab: ................................................................................................................... 71 Quantile .................................................................................................................................... 74 Using Minitab .................................................................................................................... 77 Standard deviation and Variance ............................................................................................... 78 Using Minitab .................................................................................................................... 80 Range Rule of Thumb ............................................................................................................... 81 Probability ................................................................................................................................ 81 Bayesian statistics ..................................................................................................................... 83 Using Minitab .................................................................................................................... 89 Reliability Coefficient ............................................................................................................... 89 Cohen's Kappa Coefficient ........................................................................................................ 89 Fleiss’ Kappa Coefficient .......................................................................................................... 92 Using Minitab .................................................................................................................... 93 Cronbach's alpha ....................................................................................................................... 95 Using Minitab .................................................................................................................... 95 Coefficient of variation ............................................................................................................. 97 Using Minitab .................................................................................................................... 97 Chebyshev's Theorem ............................................................................................................... 99 Using Minitab .................................................................................................................. 100 Factorial .................................................................................................................................. 100 Using Minitab .................................................................................................................. 101 Distribution, and Standardization ............................................................................................ 102 Using Minitab .................................................................................................................. 108
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
7
Prediction interval ................................................................................................................... 110 Using Minitab .................................................................................................................. 112 Tolerance interval ................................................................................................................... 113 Using Minitab .................................................................................................................. 115 Parameters to describe the form of a distribution ..................................................................... 116 Skewness ............................................................................................................................. 117 Kurtosis ............................................................................................................................... 118 Using Minitab .................................................................................................................. 119 Different functions of distributions .......................................................................................... 121 Probability Density Function ............................................................................................... 124 Using Minitab .................................................................................................................. 125 Cumulative Distribution Function ........................................................................................ 127 Using Minitab .................................................................................................................. 127 Types of Distribution .............................................................................................................. 130 Binomial Distribution .......................................................................................................... 130 Using Minitab .................................................................................................................. 132 Chi-squared Distribution...................................................................................................... 134 Using Minitab .................................................................................................................. 134 Continuous Uniform Distribution ........................................................................................ 137 Using Minitab .................................................................................................................. 137 Cumulative Poisson Distribution.......................................................................................... 140 Using Minitab .................................................................................................................. 140 Exponential Distribution ...................................................................................................... 142 Using Minitab .................................................................................................................. 142 Normal Distribution ............................................................................................................. 145
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
8
Using Minitab .................................................................................................................. 146 Poisson Distribution ............................................................................................................ 148 Using Minitab .................................................................................................................. 148 Beta Distribution ................................................................................................................. 151 Using Minitab .................................................................................................................. 153 F Distribution ...................................................................................................................... 154 Using Minitab .................................................................................................................. 157 Gamma Distribution ............................................................................................................ 157 Using Minitab .................................................................................................................. 159 Negative Binomial Distribution ........................................................................................... 161 Using Minitab .................................................................................................................. 164 Gumbel Distribution ............................................................................................................ 165 Using Minitab .................................................................................................................. 167 Hypergeometric Distribution ............................................................................................... 167 Using Minitab .................................................................................................................. 169 Inverse Gamma Distribution ................................................................................................ 169 Using Minitab .................................................................................................................. 172 Log Gamma Distribution ..................................................................................................... 172 Using Minitab .................................................................................................................. 174 Laplace Distribution ............................................................................................................ 174 Using Minitab .................................................................................................................. 177 Geometric Distribution ........................................................................................................ 178 Using Minitab .................................................................................................................. 181 Level of Significance and confidence level.............................................................................. 181 Statistical Estimation ............................................................................................................... 182
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
9
Interval Estimation .................................................................................................................. 183 Using Minitab .................................................................................................................. 184 Best Point estimation............................................................................................................... 185 Using Minitab .................................................................................................................. 186 Correlation .............................................................................................................................. 187 Central Limit Theorem ............................................................................................................ 187 Standard Error of the Mean ..................................................................................................... 188 Using Minitab .................................................................................................................. 189 Statistical Significance ............................................................................................................ 190 Tests for Non-normally distributed data .................................................................................. 190 One tail and two tail tests ........................................................................................................ 191 Mood’s Median Test ............................................................................................................ 192 Using Minitab .................................................................................................................. 195 Goodness of Fit ................................................................................................................... 197 Chi-square test ..................................................................................................................... 197 Using Minitab .................................................................................................................. 202 McNemar Test ..................................................................................................................... 202 Using Minitab .................................................................................................................. 204 Kolmogorov-Smirnov Test (KS-Test) .................................................................................. 206 Using Minitab .................................................................................................................. 208 One-tailed test, two-tailed test, and Wilcoxon Rank Sum Test / Mann Whitney U Test ........ 210 Using Minitab .................................................................................................................. 219 The Sign Test ...................................................................................................................... 220 Using Minitab .................................................................................................................. 226 Wilcoxon Signed Rank Test ................................................................................................ 228
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
10
Using Minitab .................................................................................................................. 234 The Kruskal-Wallis Test ...................................................................................................... 236 Using Minitab .................................................................................................................. 244 Degrees of Freedom ......................................................................................................... 246 The Friedman Test ............................................................................................................... 248 Using Minitab .................................................................................................................. 254 Tests for Normally distributed data ......................................................................................... 257 Unpaired “t” test .................................................................................................................. 258 Using Minitab .................................................................................................................. 262 Paired “t” test ...................................................................................................................... 265 Using Minitab .................................................................................................................. 269 Analysis of Variance (ANOVA) .......................................................................................... 273 Sum of Squares ................................................................................................................ 274 Residual Sum of Squares .................................................................................................. 275 One way ANOVA ............................................................................................................... 275 Using Minitab .................................................................................................................. 283 Two way ANOVA ............................................................................................................... 287 Using Minitab .................................................................................................................. 292 Different types of ANOVAs ................................................................................................ 298 Using Minitab – General MANOVA ................................................................................ 300 Factor Analysis ....................................................................................................................... 302 Path Analysis .......................................................................................................................... 303 Structural Equation Modeling.................................................................................................. 304 Effect size ............................................................................................................................... 306 Using Minitab .................................................................................................................. 308
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
11
Odds Ratio and Mantel-Haenszel Odds Ratio .......................................................................... 308 Using Minitab .................................................................................................................. 309 Correlation Coefficient ............................................................................................................ 313 Using Minitab .................................................................................................................. 317 R-squared and Adjusted R-squared.......................................................................................... 318 Regression Analysis ................................................................................................................ 319 Using Minitab .................................................................................................................. 324 Logistic Regression ................................................................................................................. 326 Using Minitab .................................................................................................................. 326 Black-Scholes model............................................................................................................... 328 Using Minitab .................................................................................................................. 330 Combination ........................................................................................................................... 330 Using Minitab .................................................................................................................. 331 Permutation ............................................................................................................................. 332 Using Minitab .................................................................................................................. 333 Even and Odd Permutations .................................................................................................... 334 Circular Permutation ............................................................................................................... 337 Survival Analysis .................................................................................................................... 337 Kaplan-Meier method .......................................................................................................... 338 Using Minitab .................................................................................................................. 340 Bonus Topics .......................................................................................................................... 343 Most commonly used non-normal distributions in health, education, and social sciences ..... 343 Circular permutation in Nature ............................................................................................ 344 Time Series ......................................................................................................................... 346 Using Minitab .................................................................................................................. 347
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
12
Monte Carlo Simulation....................................................................................................... 350 Density Estimation .............................................................................................................. 351 Decision Tree ...................................................................................................................... 352 Meta-analysis ...................................................................................................................... 356 Important Statistical Techniques/Procedures used in Medical Research ............................... 358
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
13
Hypothesis Statistics is an art as well as science, and needs help of both fields for explanation. You need stronger imagination as an artist and well-designed research as a scientist. Let’s start with a concept of hypotheses. According to an ancient Chinese myth, there is an entirely different world behind mirrors. That world has its own creatures, and is known as Fauna of mirrors. So, if a person is of opinion that fauna of mirrors actually exist, and he wants to perform a research on fauna of mirrors, his hypothesis will be that fauna of mirrors actually exist. This is known as “research hypothesis” or “alternative hypothesis” as the person is doing research on this hypothesis. It is represented by Ha or H1.
On the other hand, there is another opinion showing that there is nothing like fauna of mirrors. This opinion can be considered as “null hypothesis” as it is negating the statement for research hypothesis and showing that research hypothesis is not a commonly observed phenomenon. Null hypothesis is represented by Ho or H0. For a research to be completed successfully, null hypothesis is usually rejected. So, in order to prove the fauna of mirrors after performing a research, null hypothesis has to be rejected.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
14
Shortly, it can be said that according to null hypothesis, nothing is changed or no significant new thing can be found (anywhere in any group), and according to alternative hypothesis, some significant change must have occurred or some significant new thing can be found (somewhere or in some group).
Hypothesis testing
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
15
Frequency
usually
Frequency
Absolute frequency
Number of times an event or incidence occurred
ni
Cumulative Frequency
Relative frequency
i
The number of times a specific event occurred divided by the total number of events
fi = ni / N
N
The sum of all previously presented frequencies
n1+n2+ +ni
Using Minitab
In order to work on the absolute frequency, relative frequency, and cumulative frequency, we can use these numbers: 1,1,1,2,2,3,3,3,3,3,4,5,5,5,5. Enter the values in the first column, i.e. from row 1 to row 15 in this case. Go to “Stat” tab above, go to “Tables”, and click “Tally Individual Variables
”
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
16
In the window “Tally Individual Variables”, move column 1, i.e. C1 in this case, from the left box to the “Variables:” in the right box by clicking on C1 and clicking the “Select” button below. You may also double C1, and this process automatically takes C1 to the box. Check “Counts”, “Percents”, and “Cumulative counts” under “Display.” Click “OK.” Results appear in the screen above the spreadsheet. The results show “Count” for absolute frequency; “Percent” for percentage of relative frequency (i.e. 0.2x100=20 in the first place; 0.13x100=13 in the second place, and so on), and “CumCnt” for Cumulative Frequency.
Sample and Population Sample refers to a small part of something that is used to represent the whole (population).
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
17
Population (N)
Random selection
Samples help in making inferences about population
Sample (n)
Selection of only blue colored samples
Sample (n)
Type I Error and Type II Error Consider the hypotheses mentioned earlier. Suppose initial findings on the hypotheses show that fauna of mirrors actually exist. So, it can be said that the null hypothesis is rejected. However, it is important to note that the results from initial findings could be false (or they could be true) due to the presence of errors in the research. Those errors are known as α error and β error.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
18
α error is also known as Type I error or False positive. In this condition, it is possible that we incorrectly reject the null hypothesis, i.e. the statement “there is nothing like fauna of mirrors” seems wrong after performing an experiment, when in reality it is right. So, it is also considered as False positive as in this condition, we think that alternative hypothesis is right (positive).This type of error can be fixed by performing further tests. Moreover, changing the level of significance can also help in reducing type I error.
On the other hand, β error is also known as Type II Error or False negative. In this condition, it is possible that we incorrectly accept the null hypothesis.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
19
Type II error is more serious as compared to type I error because after this error, nobody would do further research on alternative hypothesis. Errors of the higher kind could also be present. As, for example, type III error occurs when a researcher or an investigator gives the right answer to some wrong question. It is also considered when an investigator or researcher correctly rejects the null hypothesis for some wrong reason/s.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
20
Measuring the Central Tendencies (mean, median, and mode), and Range Population Mean
Sum of all data values
Mean
1 1 + +1 + 9 8 + 8 0 6+ 7+ 1+ 2+ 2+ 3+ 4+ 5+
Ʃx x= n Number of data values in sample
13
50% above
The middle value (or the mean of two middle values) 2
2
3
Ʃx N
Number of data values in population
50% below
1
μ=
Median
5
4
7
6
8
8
9
10
11
Mode The value/s which appear most often
Range = 11 – 1 = 10 Highest value
Lowest value
Using Minitab
In order to find the mean, median, mode, and range, we can use the numbers (as noted above). So, the numbers are 1,2,2,3,4,5,6,7,8,8,9,10, and 11. Enter the values in the first column, i.e. from row 1 to row 13 in this case. Go to “Stat” tab above, go to “Basic Statistics”, and click “Display Descriptive Statistics ” In the window “Display Descriptive Statistics”, move column 1, i.e. C1 in this case, from the left box to the “Variables:” in the right box by clicking on C1 and clicking the “Select” button below.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
21
Click “Statistics ”. A window “Display Descriptive Statistics: Statistics” appears. Check “Mean”, “Median”, “Mode”, and “Range”, and every other descriptive statistics you want to do. Click “OK.” Again click “OK.” Results appear in the screen above the spreadsheet. In this case, the results are mean = 5.846; median = 6; mode = 2,8 and range = 10.
Geometric Mean Geometric means refers to the nth root of the product of a set of data having n numbers as, for example, square root of the product of 2 numbers, cube root of the product of 3 numbers, and so on.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Using Minitab
22
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
23
In order to find the geometric mean, we can use the example of the percentage increase in the price of land (as noted above). So, the increases are 10% in the first instance i.e. (making 110% or) 1.1; than 20% i.e. (making 120% or) 1.2, and finally 50% i.e. (making 150% or) 1.5. Enter the values, 1.1, 1.2, and 1.5, in the first column, i.e. from row 1 to row 3 in this case. Go to “Calc” tab above, and click “Calculator
”
In the window “Calculator”, in the “Store result in variable:” in the right box enter a column, such as C5. In the box below “Expression:” enter “GMEAN(C1)” and click “OK.” The result will appear in the spreadsheet. In this case, the result (geometric mean) is 1.25571. So, this is the geometric mean value for the increase in every year in our example.
Grand Mean The grand mean of a set of samples is the sum of all the values in the data divided by the total number of samples.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
24
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
25
Using Minitab
Although Minitab has no direct function for the calculation of the grand mean, it can easily be calculated by using the “Calculator
”. In order to calculate the grand mean, we can take the
example as noted above. So, City
Person # 1
Person # 2
Person # 3
Person # 4
Person #5
City A
10
60
70
80
40
City B
50
20
80
120
60
City C
80
30
90
100
70
Enter the values in the columns. In this case, it is C1, C2, C3, C4, and C5. Go to “Calc” tab above, and click “Calculator
”
In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C10. In the box below “Expression:” enter “(Mean(C1) + Mean(C2) + Mean(C3) + Mean(C4) + Mean(C5))/5” and click “OK.” The result will appear in the spreadsheet in the column C10. In this case, the result (grand mean) is 64.
Harmonic Mean Harmonic mean is the number of observations in a data set divided by the sum of the reciprocals of the observations, i.e. number in the series.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Using Minitab
In order to calculate the harmonic mean, we can take the example as noted above. So, the numbers are 5, 8, and 13. Enter the values in the column C1 from row 1 to row 3. Go to “Calc” tab above, and click “Calculator
”
In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C10. In the box below “Expression:” enter “N(C1)/(sum(1/C1))”
26
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
27
The result will appear in the spreadsheet in the column C10. In this case, the result (harmonic mean) is 7.46411. In case of weighted harmonic mean, we enter the values 5, 5, 8, and 13 in C1 in the rows from 1 to 4, and use “N(C1)/(sum(1/C1))” in the box below “Expression:” in the window “Calculator.” Click “OK.” The result will appear in the spreadsheet in the specified column, i.e. C10. In this case, the result (weighted harmonic mean) is 6.64537.
Mean Deviation Mean of absolute deviations of observed or variable values from the mean value.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Mean of absolute deviations of observed or variable values from the mean value.
28
The vertical bars (||) represent absolute value, i.e. ignoring minus signs
Mean value of choices
Variable value
x D Mean Deviation N
Example
3 4 6 4.33 3
N
Number of observations or values
Their mean is
Difference
Suppose, we have three values: 3,4,6
x (3 4.33) (4 4.33) (6 4.33) 3.33 Mean deviation is
Mean Deviation
3.33 1.11 3
Using Minitab
In order to work on Minitab, we can use the same example as noted above. So, the example numbers are 3, 4, and 6. Enter the values in the column C1 in the rows from 1 to 3.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Go to “Calc” tab above, and click “Calculator
29
”
In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C10. In the box below “Expression:” enter “(ABS(3-Mean(C1))+ABS(4-Mean(C1))+ABS(6Mean(C1)))/3”. Here, ABS is for absolute value function that changes all negative values into positive values. Click “OK.” The result will appear in the spreadsheet in the column C10. In this case, the result (mean deviation) is 1.111.
Mean Difference It is also known as difference in means. It determines the absolute difference between the two groups’ mean values; thereby, helping in knowing the change on an average. So, Mean difference = Mean of one group – Mean of the other group Using Minitab
Mean difference can easily be calculated in Minitab. Suppose in one group, we have the numbers 3, 4, and 6, and in the other group, we have the numbers, 2, 3, and 4. Enter the number of one group in C1 and the other group in C2. Go to “Calc” tab above, and click “Calculator
”
In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C10. In the box below “Expression:” enter “C1-C2”. Click “OK.” The result will appear in the spreadsheet in the column C10. In this case, the results are 1, 1, and 2.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Root mean square Root mean square is the square root of the sum of the squares of the observations or values divided by the total number of values.
Using Minitab
In order to work on Minitab, we can use the numbers 3, 4, and 6 as an example. Enter the values in the column C1 in the rows from 1 to 3. Go to “Calc” tab above, and click “Calculator
”
In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C10. In the box below “Expression:” enter “sqrt(Mean(C1^2))”. Click “OK.”
30
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha The result will appear in the spreadsheet in the column C10. In this case, the result (root mean square) is 4.50925.
Sample mean and population mean Sample mean and population mean are two different things. Sample mean is represented by x̄, and population mean is represented by μ. Population consists of all the elements from a collection of data (N), and sample consists of some observations from that data (n).
31
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
32
Types of data in statistics
Data Collection and its types Data collection is the process of collecting information in relation to selected variables. There are two main types: primary data and secondary data.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Sampling plan It is the detailed outline of study measurements.
33
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Sampling Methods Sampling methods are of two main types: (1) Probability sampling and (2) Non-probability sampling.
34
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
No known probability or chance of samples to be selected
Considered as “sampling disasters”
Random selection of nth sample and then selecting every nth sample in succession
35
Sampling Methods Non-probability Sampling
Known probability or chance of Probability Sampling samples to be selected
Each sample has an equal Simple chance of Random being Sampling selected
Volunteer Samples Convenience (haphazard) Samples
Stratified Sampling
Proportionate stratified sampling
Systematic Sampling
Disproportion ate stratified sampling
Cluster Sampling
Multistage Sampling Combination of two or more sampling methods
One stage cluster sampling
Two stage cluster sampling
Multi-stage cluster sampling
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Required Sample Size Sample size can be determined either through a subjective approach or through a mathematical approach.
36
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
37
Using Minitab
Cochran’s sample size formula In order to work on Cochran’s sample size formula in Minitab, we can develop an example. So, a researcher wants to know how many people in a city of 10,000 people, with at least 5% precision, have cars. If we consider with 95% confidence that 60% of all the people have cars. Then the z-value at 95% confidence level will be 1.96 per the normal table. Go to “Calc” tab above, and click “Calculator
”
In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C10. In the box below “Expression:” enter “(((1.96)^2)*0.6*0.4)/0.05^2”. Click “OK.” The result will appear in the spreadsheet in the column C10. In this case, the result (sample size) is 368.794 or simply 369. This result is according to the Cochran’s sample size formula. Cochran’s sample size formula when the population size is small In order to work on Cochran’s sample size formula when the population size is small as, for example, 1000 people Go to “Calc” tab above, and click “Calculator
”
In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C11. In the box below “Expression:” enter “369/(1+(368/1000))”. Click “OK.” The result will appear in the spreadsheet in the column C11. In this case, the result (sample size) is 269.737 or simply 270. This result is according to the modification of the Cochran’s sample size formula when the population size is small. Slovin’s formula In order to work on Slovin’s formula in Minitab, go to “Calc” tab above, and click “Calculator
”
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
38
In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C12. In the box below “Expression:” enter “10000/(1+10000*0.05^2)”. Click “OK.” The result will appear in the spreadsheet in the column C12. In this case, the result (sample size) is 384.615 or simply 385. This result is according to the Slovin’s formula.
Simple Random Sampling It has a population with N items or objects, and sample with n items or objects. It can be done with replacement or without replacement.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
39
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Cluster Sampling In cluster sampling, a population is divided into separate groups known as clusters, and then simple random sampling of clusters (i.e. sampled clusters) is done.
40
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Stratified Sampling In stratified sampling, a population is divided into separate groups known as strata, and then simple random sampling of units/items from each strata is done.
41
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Graphs and Plots Bar Graph A bar graph is represented by bars for different categories on graph.
Using Minitab
In order to develop Bar Graph, we can use the example (as noted above).
42
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Reason of travelling
Percentage of people
Visiting family
25
Spending time with friends
15
Work-related
18
Personal issues
10
Escaping the colder climate
6
Discovering a new culture
11
Leisure
15
43
Enter the values in the columns. In this case, we have “Reason of travelling” in the first column and “Percentage of people” in the second column. Now go to “Graph” tab above and go to “Bar Chart
” in the 4th section. A window “Bar Charts”
appears. Under “Bars represent:”, select “A function of a variable” from drop down menu. Under “One Y”, select “Simple”, and click “OK.” A window “Bar Chart: A function of a variable, One Y, Simple” appears. Under “Function”, choose the function of the graph variable. In this case, “Mean” is selected as a function. Click in the box below the “Graph variables:”, move the variables having numerical data. In this case, C2 is moved to this box. Click in the box below the “Categorical variable:”, move the categorical data. In this case, C1 is moved to this box by clicking on the “Select” button below. You can also work on “Labels Now click “OK.”
” , “Chart Options
” etc.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
44
A chart appears as follows:
Chart of Mean( Percentage of people ) Mean of Percentage of people
25 20 15 10 5 0
g in er v o i sc D
a
w ne
re ltu u c
n pi ca s E
g
t
he
c
r de ol
m cli
e at
r isu e L
e
s ue ss i l na so r Pe
g in nd e Sp
e tim
w
ith
i fr
ds en s it Vi
g in
m fa
ily W
o
d te la e r rk
Reason of travelling
Pie Chart Pie chart, also known as pie graph, is a statistical graphical chart that is presented in a circular form in which numerical proportions are divided into slices.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
45
Using Minitab
In order to develop Pie Chart, we can use the example (as noted above). Cost of construction of house
Supposed amount spent in
Percentages
USD Labor
60000
20
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
46
Timber
18000
6
Electrical works
18000
6
Supervision
90000
30
Steel
30000
10
Bricks
30000
10
Cement
27000
9
Design and fee for
15000
5
12000
4
Engineer/Architect Other
Enter the values in the columns. In this case, we have “Cost of construction of house” (representing categorical variable) in the first column, “Supposed amount spent in USD” in the second column, and “Percentages” in the third column. Now go to “Graph” tab above and go to “Pie Chart
” in the 4th section. A window “Pie Chart”
appears. Select “Chart values from a table”, and in the box below, “Categorical variable:” move the categorical data. In this case, C1 is moved to this box by clicking on the “Select” button below. In the box below “Summary variables:”, move the variables having numerical data. In this case, C2 is moved to this box. You can also work on “Labels Now click “OK.” A chart appears as follows:
” and “Data Options
” etc.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
47
Pie Chart of Cost of construction of house Category Labor Timber Electrical works Supervision Steel Bricks Cement Design and fee for Engineer/Architect other
In the Minitab®, user can move the cursor above the color to know the percentage.
Scatter plot or Bubble chart Scatter plot or scatter chart graphically displays the relationship between two variables (two sets of data) in the form of dots.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
48
Scatter chart is called as bubble chart, if the data points (dots) in the graph are replaced with circles (bubbles) of variable size and characteristics as, for example, circles with only outline and no colors inside.
Different patterns of data in bubble charts Bubble charts could be of different types depending on linearity, slope, and strength.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Using Minitab
In order to develop Scatter Plot, we can use the example (as noted above). Year
Number of tourists
2011
1167000
2012
966000
2013
565212
2014
530000
2015
563400
49
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
2016
965498
2017
1750000
Enter the values in the columns. In this case, we have “Year” in the first column (i.e. C1), and “Number of tourists” in the second column (i.e. C2). Now go to “Graph” tab above and go to “Scatterplot
” in the 1st section. A window
“Scatterplots” appears. Select “Simple” and click “OK.” Now, click on the box below “X variables” in front of “1”; click on “C1 Year”, and press “Select.” Similarly, click on the box below “Y variables” in front of “1”; click on “C2 Number of tourists”, and press “Select.” You can also work on “Labels
” and “Data Options
Now click “OK.” A simple chart appears as follows:
” etc.
50
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
51
Scatterplot of Number of tourists vs Year 1750000
Number of tourists
1500000
1250000
1000000
750000
500000 2011
2012
2013
2014
2015
2016
2017
Year
In order to draw Bubble chart, go to “Graph” tab above and go to “Bubble plot
” in the 1st
section. A window “Bubble plots” appears. Select “Simple” and click “OK.” Now, click on the box below “Y variable:”, click on “C2 Number of tourists”, and press “Select.” Similarly, click on the box below “X variable:” click on “C1 Year”, and press “Select.” Now, click on the box below “Bubble size variable:”, click on “C2 Number of tourists”, and press “Select.” You can also work on “Labels
” and “Data Options
Now click “OK.” A simple chart appears as follows:
” etc.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Dot plot Dot plot displays graphical data in the form of dots.
52
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Using Minitab
In order to develop Dot Plot, we can use the example (as noted above). House number
Number of siblings
53
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
1
4
2
3
3
1
4
0
5
5
6
2
7
2
We have to slightly modify this as follows:
The no. of siblings in houses 1 1 1 1 2 2 2 3 5 5 5
54
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
55
5 5 6 6 7 7 In this table, the house numbers are mentioned according to the number of siblings. For example, there are four siblings in the first house, so the house number, i.e. 1 is mentioned four times; there are three siblings in the second house, so the house number, i.e. 2 is mentioned three times; there is no sibling in the fourth house, so the house number, i.e. 4 is not mentioned, and there are five siblings in the fifth house, so the house number, i.e. 5 is mentioned five times. Enter the values in the column. In this case, we have “House number” in the first column. Now go to “Graph” tab above and go to “Dotplot
” in the 2nd section. A window “Dotplots”
appears. Select “Simple” under “One Y,” and click “OK.” (Though other options are also available) Now, click on “C1 House number”, and press “Select” to move it to the box under “Graph variables:”. You can also work on “Labels
” and “Data Options
Now click “OK.” A simple chart appears as follows:
” etc.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Matrix plot Matrix plot is used to study the association of several pairs of variables at a time.
56
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
57
Matrix plot is used to study the association of several pairs of variables at a time. Suppose we have the following data with four variables (A, B, C, and D):
A, (other variable, such as B, C, or D)
B, (other variable)
Matrix plot is as follows:
D, (other variable)
Graphical representation of different variables in the form of dots
A, B
The position of a dot refers to its value on x-axis and yaxis of the two specific variables
B, C
Using Minitab
In order to develop Matrix Plot, we can use the same example (as noted above). A
B
C
D
1
3.6
4.9
4.1
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
58
1
5.0
3.5
4.1
1
4.7
4.1
4.4
1
6.1
7.3
5.3
2
7.6
5.2
2.4
2
6.9
4.7
4.0
2
5.6
5.0
5.4
2 5.0 5.7 5.8 Enter the values in the column. In this case, we have “A” in the first column, “B” in the second column, “C” in the third column, and “D” in the fourth column. Now go to “Graph” tab above and go to “Matrix Plot
”. A window “Matrix Plots” appears.
Select “Simple” under “Matrix of plots,” and click “OK.” (Though other options are also available) Now, select the columns “A”, “B”, “C”, and “D” and press “Select” to move them to the box under “Graph variables:”. You can also work on “Labels
” and “Data Options
Now click “OK.” A simple chart appears as follows:
” etc.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
59
Pareto Chart Pareto chart, also known as Pareto distribution diagram, is used for improving the most probably advantageous or beneficial areas with the help of the identification of commonly experienced defects or causes of customer complaints.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Pareto Chart
Improving the most advantageous or beneficial areas
Used for
With the help of
Also known as
Pareto distribution diagram
60
Identification of commonly experienced defects or causes of customer complaints
Based on Pareto principle showing that about 80% of the output is associated with 20% of the input
Example
Suppose, a hotel gets following complaints along with their counts:
This graph shows that improvements in Difficult parking , Cleaning issues , and Too noisy could solve about 80% of the complaints.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
61
Using Minitab
In order to develop Pareto Chart, we can use the same example (as noted above). Complaint
Counts
Difficult parking
175
Bad behavior of receptionists
25
Poor lightning
11
Room service is not good
21
Cleaning issues
145
Overpriced
12
Too noisy
32
Confusing layout 16 Enter the values in the column. In this case, we have “Complaint” in the first column, and “Counts” in the second column. Now go to “Stat” tab above, go to “Quality Tools”, and go to “Pareto Chart
”. A window
“Pareto Chart” appears. Move “Complaint” to the box with “Defects or attribute data in:” by selecting it and pressing the “Select” button, and move “Counts” to the box with “Frequencies in:” by selecting it and pressing the “Select” button. You can also work on labels and title by pressing the “Options ” button. Now click “OK.” A simple chart appears as follows:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Histogram A histogram is also a graph that groups numbers into ranges.
62
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Using Minitab
In order to develop Histogram, we can use the example (as noted above). House number
Number of siblings
1
4
63
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
2
3
3
1
4
0
5
5
6
2
7
2
Enter the values in the columns. In this case, we have “House number” in the first column, and “Number of siblings” in the second column. Now go to “Graph” tab above and go to “Histogram ” in the second section. A window “Histograms” appears. Select “Simple” from the appeared window, and click “OK.” In the box of “Graph variables:” move the column having numerical data. In this case, “C2 Number of siblings” has the numerical data, so it is moved to the box by clicking on the button “Select.” You can also work on “Labels
” and “Data Options
” etc.
Click “OK” again in the window of “Histogram: Simple.” A simple graph appears as shown below:
64
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
65
Histogram of Number of siblings 2.0
Frequency
1.5
1.0
0.5
0.0
0
1
2
3
4
5
Number of siblings
Stem and Leaf plot In a stem and leaf plot, the data values are split into the first digit or digits showing “stem” and the last digit showing “leaf”.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Suppose we have the following numbers...
15,14,17,12,23,24,29,35,47,41,59,62,67,69
15 splits into 1 (stem) and 5 (leaf)
Stem and leaf plot is as follows:
69 splits into 6 (stem) and 9 (leaf) In this way, large amounts of data can quickly be organized... Using Minitab
In order to develop Stem and Leaf plot, we can use the example (as noted above): 15,14,17,12,23,24,29,35,47,41,59,62,67,69 Enter the values in the first column, i.e. C1 in the rows from 1 to 14.
66
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
67
Now go to “Graph” tab above and go to “Stem-and-Leaf ” in the 2nd section. A window “Stemand-Leaf” appears. In the box of “Graph variables:” move the column C1 by clicking on the button “Select.” Click “OK”. Results appear in the screen above the spreadsheet. Stem-and-leaf of C1 N = 14 4
1
2457
7
2
349
7
3
5
6
4
17
4
5
9
3
6
279
Leaf Unit = 1
Box plot The box plot (also known as box and whisker diagram) is a graphical way of representing the distribution of data on the basis of five number summary: (1) minimum, (2) first quartile, (3) median, (4) third quartile, and (5) maximum.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
68
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
69
Using Minitab
In order to generate boxplot, we can use the following example: School 1
School 2
School 3
Number of desks in class 1
54
42
56
Number of desks in class 2
60
48
68
Number of desks in class 3
65
53
78
Number of desks in class 4
66
54
80
Number of desks in class 5
67
55
82
Number of desks in class 6
69
57
86
Number of desks in class 7
70
58
88
Number of desks in class 8
72
60
92
Number of desks in class 9
73
61
94
Number of desks in class 10
75
63
98
Enter the values in the first four columns, i.e. from C1 to C4. Now go to “Graph” tab above and go to “Boxplot
” in the 3rd section. A window “Boxplots”
appears. Select “Simple” under “Multiple Y’s” and click “OK.” (Though other options are also available) Now, click on “C2 School 1”, and press “Select” to move it to the box under “Graph variables:”. Now, click on “C3 School 2”, and press “Select” to move it to the box under “Graph variables:”. Now, click on “C4 School 3”, and press “Select” to move it to the box under “Graph variables:”. You can also work on “Labels
” and “Data Options
” etc. For example, click “Data View
In the window “Boxplot: Data View”, select “Interquartile range box”, “Outlier symbols”, and “Median symbol.” Now click “OK.” Again click “OK.” A simple chart appears as follows:
”
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
70
Boxplot of School 1, School 2, School 3 100
90
Data
80
70
60
50
40 School 1
School 2
School 3
Outlier A value that lies at an abnormal distance from (outside of the) other values.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
71
An value that lies at an abnormal distance from (outside of the) other values
Outlier example
Showing some sort of problem in data such as
Supercomputer in a lab of personal computers would be considered as an outlier
A case (associated with that value at abnormal distance) that is not according to the model under study
example
120 100
An error occurred during measurement
80 60 40
example
20 0 0
2
4
6
8
10
12
12 11 10 9 8 Time 7 worke 6 d 5 (hours 4 ) 3 2 1 0
Box plots are commonly used to find or display outliers Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Using Minitab:
Suppose we have the following values: 342.962 346.033 345.917 344.717 343.048
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
72
345.855 344.717 395.548 345.464 345.703 Put these values in the first column, i.e. C1. Now go to “Stat” tab above, go to “Basic Statistics”, and go to “Outlier Test
” in the 6th
section. Move the column, i.e. C1 to the box under “Variables:” by selecting it and pressing the button “Select”. Now click “OK.” The following results and graph appear. Grubbs' Test Variable C1
N 10
Mean 350.00
StDev 16.04
Min 342.96
Max 395.55
G 2.84
P 0.000
Outlier Variable C1
Row 8
Outlier 395.548
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
73
These results show that 395.548 is an outlier. Another way is to use the box plot. Now go to “Graph” tab above and go to “Boxplot
” in the
3rd section. A window “Boxplots” appears. Select “Simple” under “One Y” and click “OK.” (Though other options are also available) Move C1 (column containing all the values) to the box under “Graph variables:” You can also work on “Labels
” and “Data Options
” etc. For example, click “Data View
In the window “Boxplot: Data View”, select “Interquartile range box”, and “Outlier symbols.” Now click “OK.” Again click “OK.” A simple chart appears as follows:
”
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
74
This graph shows that one value, which is close to 395, is far away from other values, i.e. an outlier.
Quantile This word is from “quantity”. Usually, a quantile is obtained by dividing a sample into adjacent and equal-sized subgroups, OR it is obtained by dividing a probability distribution into areas or intervals of equal probability.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
75
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
76
Example for Quantiles
Suppose, we have ten numbers: 3,2,5,6,7,8,1,10,9,4 Arrangement of these numbers is as follows:
1
3
2
4
1
Percentile
2
1st
3
2nd 3rd
7
6
Lower quartile Q1
Quartile
Decile
5
4
Median Q2
5
4th 5th
8
9
10
Upper quartile Q3
7
6
6th
8
7th
9
8th
10
9th
5th 45th 95th 10th 20th 30th 40th 50th 60th 70th 80th 90th
There are Three quartiles, i.e. Q1, Q2, and Q3 Nine deciles, i.e. D1, D2, D3, ,D9 Ninety nine percentiles, i.e. P1, P2, ,P99 Each showing a specific percentage of data, e.g. below Q3, 75% of data falls; below D3, 30% of data falls, and below P80, 80% of data falls, and so on...
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
77
Using Minitab
Quartiles In order to work on quartiles, we can use the example (as noted above), i.e. 3,2,5,6,7,8,1,10,9,4. Enter the values in the first column, i.e. from rows 1 to 10 in C1. Go to “Stat” tab above, go to “Basic Statistics”, and click “Display Descriptive Statistics ” In the window “Display Descriptive Statistics”, move column 1, i.e. C1 in this case, from the left box to the “Variables:” in the right box by clicking on C1 and clicking the “Select” button below. Click “Statistics ”. A window “Display Descriptive Statistics: Statistics” appears. Check “First quartile”, “Median”, “Third quartile” and every other descriptive statistics you want to do. Click “OK.” Again click “OK.” Results appear in the screen above the spreadsheet. In this case, the results are Q1 = 2.750; median = 5.5, and Q3 = 8.250. Deciles In order to work on decile, go to “Calc” tab above, and click “Calculator
”
In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C7. In the box below “Expression:” enter “PERCENTILE (C1,0.10)”. Click “OK.” The result will appear in the spreadsheet in the column C7. In this case, the result is 1.1. For specific decile as, for example, 7th decile, enter “PERCENTILE(C1,0.7)”. Click “OK.” The result will appear in the spreadsheet in the column C7. In this case, the result is 7.7. Percentiles In order to work on percentile, go to “Calc” tab above, and click “Calculator
”
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
78
In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C7. In the box below “Expression:” enter “PERCENTILE(C1,0.01)”. Click “OK.” The result will appear in the spreadsheet in the column C7. In this case, the result is 1. For specific decile as, for example, 70th percentile, enter “PERCENTILE(C1,0.7)”. Click “OK.” The result will appear in the spreadsheet in the column C7. In this case, the result is 7.7. Standard deviation and Variance Standard deviation helps in knowing the variability of the observation or spread out of numbers about the mean. It is represented by the Greek letter sigma (σ). Low standard deviation shows that the values are close to mean, i.e. close to normal range or required range. σ can be obtained by taking the square root of variance, which is represented by σ 2. So, we have to calculate variance to calculate standard deviation. Variance is the average of squared deviation about the mean. It is also important to note that standard deviation and variance are of two types, i.e. population standard deviance and population variance, and sample standard deviation and sample variance, respectively.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
79
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
80
Using Minitab
In order to calculate sample variance and standard deviation, we can use the following example: Number of flowering plant
Number of flowers on the plant
1
5
2
7
3
6
4
4
5
7
6
9
7
11
Enter the data in the columns. In this case, we enter the data in the first two columns, i.e. “Number of flowering plant” in the first column and “Number of flowers on the plant” in the second column. Now go to “Stat” tab above, go to “Basic Statistics”, and click “Display Descriptive Statistics ” Move the variables’ column for which variance and standard deviation have to be calculated, in this case it is “C2 Number of flowers on the plant”, to the box of “Variables:” and click “Statistics ” Select “Standard deviation” and “Variance” from the appeared window. You can also select several other options for Descriptive Statistics. Click “OK.” Click “OK” again on the Display Descriptive Statistics (window). This shows the results of StDev (Standard Deviation), i.e. 2.380, and Variance, i.e. 5.667. However, it is important to note that these results are showing sample variance and sample standard deviation.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
81
Range Rule of Thumb It helps in rough estimation of the standard deviation from a data obtained from known samples or population.
Probability Probability is a measure of how likely it is that some event will occur. Some of the basic rules of probability are as follows: Probability
For any event A,
Probability can be as low as zero or as
Rule # 1
the probability
high as 1.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
82
P(A) satisfies 0 ≤ P(A) ≤ 1. Here P(A)=Number of favorable outcomes / Total number of outcomes = m/n Probability
Probability model The sum of the probabilities of all possible
Rule # 2
for a sample space S satisfies P(S)=1.
outcomes is 1. S
not A not B A
Probability Complement Rule # 3
Rule
B
For an event A,
The probability of the happening of an
P(not A) = 1 –
event that is complement of event A (i.e.
P(A)
not A) is similar to 1 minus the probability of happening of event A. not A
Probability The Addition Rule # 4
If A and B are
Rule for
two disjoint or
Disjoint
mutually
Events
exclusive events,
A
A and B are disjoint
A
B
P(A or B) = P(A) + P(B) Probability The General Rule # 5
P(A or B) =
Addition Rule P(A∪B)= P(A) +
The probability that event A or event B will occur is similar to the probability that event A occurs plus the probability that event B
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
83
P(B) – P(A and
occurs minus the probability of event A
B)
and event B occurring simultaneously (which is also equal to A∩B, i.e. P(A and B) = P(A∩B) ) A and B are not disjoint
A
Probability The Rule # 6
If A and B are
Multiplication two independent Rule for
events, then P(A
Independent
and B) = P(A∩B)
Events
= P(A) × P(B).
Probability Conditional Rule # 7
B
The conditional
Probability
probability of
Rule
event B, given
Given event or already occurred event
P(B | A)
event A, is P(B | A) = P(A and B) / P(A) Probability General Rule # 8
For any two
Event that may occur or may not occur
This line shows Given
A
B
Multiplication events A and B, Rule
P(A and B) =
A
B
P(A∩B) = P(A) × P(B | A)
Bayesian statistics Bayesian statistics has been named for Thomas Bayes (1702-1761) English priest. Bayes came to a theorem with the help of which a hypothesis could be established based on the observation of consequences.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
84
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
85
An easy example for Bayes Theorem
P(Burger|Restaurant) =0.4
Probability of burger given that there is restaurant
But burgers are commonly found (about 80% of all markets contain burgers)
P(Burger)=0.8
Probability of burgers
And restaurants are fairly common (suppose 70% of all markets have a restaurant )
P(Restaurant)=0.7
Probability of restaurant
What is the probability that we see restaurant when we see burgers?
P(Restaurant|Burger) =?
Suppose about 40% of restaurants serve burgers
P(Restaurant|Burger)=
Probability of restaurant given burgers
P(Restaurant) P(Burger|Restaurant) P(Burger)
P(Restaurant|Burger)=
0.7 0.4 0.35 0.8
35% chance of finding restaurants when there are burgers
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
86
Another example: Suppose seeds are common (15%) but plants are scarce (5%) due to small area of land, and about 85% of all seeds produce plants then P(seed|plant)= [P(seed) P(plant|seed)]/P(plant) P(seed|plant)=[15% x 85%]/5% P(seed|plant)=255% So the probability of seed when there is plant is 255%.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
87
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
88
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
89
Using Minitab
The “Calculator” function could be used for Bayesian statistics, i.e. enter the formula associated with Bayesian statistics in the box below “Expression:” in the “Calculator”.
Reliability Coefficient Reliability refers to consistency of a measure, or trust or dependence on a measure. Reliability coefficient refers to the degree of consistency or accuracy of a measuring instrument or a test. Usually, reliability coefficient is determined by the correlation of the two different sets of measures. Different ways are used to assess reliability coefficients as, for example, inter-rater reliability that is usually measured by Cohen’s kappa coefficient; internal consistency reliability that is usually measured by Cronbach's alpha, and test-retest reliability that can be measured by Pearson's correlation coefficient.
Cohen's Kappa Coefficient It is used to determine inter-rater or interobserver agreement for qualitative/categorical items.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
90
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
91
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
92
Fleiss’ Kappa Coefficient It is used to determine inter-rater or inter-observer agreement for Likert scale data, nominal scale data, or ordinal scale data.
Fleiss Kappa Coefficient
Used to determine inter-rater agreement/ interobserver agreement for Likert scale data, nominal scale data, or ordinal scale data Also considers the chance agreement
is determined when
A trial on each sample is rated by three or more different observers/raters
Range 0-1
Perfect agreement Coefficient > 0.75 => good
Agreement similar to chance
Acceptable level of agreement depends on the field of research
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
93
Using Minitab
In order to work on Minitab, we can use the same example as noted above. B
A
Yes
No
Yes
40
10
No
20
30
We set the codes for observers, A and B (assessors or appraisers) and children, 1 to 100 (samples or attributes). So, in this example, observer A has the code 1, and observer B has the code 2. Children have the codes from 1 to 100. Response or Assessment is noted as “Yes” or “No”. In the first column, i.e. C1, enter “Assessor” and enter the codes 1 in the first 100 rows, and 2 in the rows from 101 to 200. In order to repeat the number 1 in the first 100 rows, enter 1 in the first cell, take the pointer or cursor to the lower right corner of the cell, where the shape of the pointer changes to Autofill handle, now click and drag the Autofill handle down the column to the 100th row, and after that, release the mouse button. Do the same thing after writing the number 2 in the 101st row and dragging it to the 200th row. In the second column, i.e. C2, enter “Children” and enter the numbers from 1 to 100 in the rows from 1 to 100, and again enter the numbers from 1 to 100 in the rows from 101 to 200. In order to add the series of numbers from 1 to 100 in the first 100 rows, enter 1 in the first cell, take the pointer or cursor to the lower right corner of the cell, where the shape of the pointer changes to Autofill handle, now click and press “Ctrl” key on keyboard and drag the Autofill handle down the column to the 100th row, and after that, release the mouse button. Do the same thing after writing the number 1 in the 101st row and dragging it to the 200th row. Press “home” button on keyboard. It will take you back to the top. In the third column, i.e. C3, enter “Response”, and enter “Yes” in the first 40 rows (i.e. 40 rows), “No” in the rows from 41 to 70 (i.e. 30 rows), “Yes” in the rows from 71 to 80 (i.e. 10 rows), and
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
94
“No” in the rows from 81 to 100 (i.e. 20 rows). These responses are for observer A. Now, enter “Yes” in the rows from 101 to 140 (i.e. 40 rows), “No” in the rows from 141 to 180 (i.e. 30+10 rows), and “Yes” in the rows from 181 to 200 (i.e. 20 rows). You can enter “Yes” or “No” in the cells, and drag the Autofill handle in the lower right corner of the cell to the required number of cells (rows) in the column. Now go to “Stat” tab above, go to “Quality Tools”, and click “Attribute Agreement Analysis
”
in the 5th section. In the appeared window (i.e. Attribute Agreement Analysis), move “C3 Response” to the box with “Attribute Column:”, move “C2 Children” to the box with “Samples:”, and move “C1 Assessor” to the box with “Appraisers:” by clicking the “Select” button below. Now click the button “Options ”, select “Calculate Cohen’s kappa if appropriate” by checking the box, and click “OK”, and again click “OK” on the “Attribute Agreement Analysis” (window). This gives the results for Fleiss’ Kappa Statistics (0.39) and Cohen’s Kappa Statistics (0.4) as follows: Fleiss’ Kappa Statistics Response
Kappa
SE Kappa
Z
P(vs > 0)
No
0.393939
0.1
3.93939
0.0000
Yes
0.393939
0.1
3.93939
0.0000
Cohen’s Kappa Statistics Response
Kappa
SE Kappa
Z
P(vs > 0)
No
0.4
0.0979796
4.08248
0.0000
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Yes
0.4
0.0979796
95 4.08248
0.0000
Cronbach's alpha It is used to measure the internal consistency reliability.
Using Minitab
Suppose we have 7 questions (or 7 items) in a questionnaire. A total of 18 participants completed the questionnaire. Each question is measured utilizing a 5-point Likert item ranging from “strongly agree” (5) to “strongly disagree” (1). Suppose we got the following results: Q1
Q2
Q3
Q4
Q5
Q6
Q7
Individual1
5
1
3
2
1
4
2
Individual2
1
5
4
3
5
5
4
Individual3
4
2
1
3
5
3
4
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
96
Individual4
4
4
3
5
3
4
5
Individual5
2
3
3
5
2
5
4
Individual6
5
1
1
2
4
5
1
Individual7
2
1
1
3
1
5
2
Individual8
3
2
2
5
2
5
4
Individual9
2
1
5
3
1
3
4
Individual10
4
4
1
4
2
5
4
Individual11
4
3
4
3
4
4
5
Individual12
5
1
1
3
3
5
3
Individual13
5
4
3
2
5
3
4
Individual14
2
4
4
2
4
4
5
Individual15
2
4
3
2
2
3
2
Individual16
4
5
2
3
4
2
3
Individual17
3
4
4
1
4
2
4
Individual18
2
4
3
5
1
3
5
Put these numerical values below Q1 to Q7 in Minitab in the columns C1 to C7. Now go to “Stat” tab above, go to “Multivariate”, and click “Item Analysis
” in the 2nd section.
In the appeared window, move all the items (C1 to C7) to the box under “Variables:” and the items “Q1-Q7” appear in the box. Click the button “Results ” and select “Cronbach’s Alpha”. Some other items may also be selected. Click “OK.” Click the button “Graphs ” and deselect “Matrix plot of data with smoother”. Click “OK”, and click “OK” again in the window of “Item Analysis.” This gives the results for Cronbach’s Alpha as follows: Cronbach’s Alpha Alpha 0.09139
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
97
Interpretation of result: This value of Cronbach’s Alpha (0.091) is unacceptable as it is lower than 0.5. Thereby, showing a low level of internal consistency for our scale.
Coefficient of variation It is a measure of dispersion or relative variability.
Using Minitab
In order to work on Minitab, we can use the following example: Weight loss program X
Weight loss program Y
Individual1
1
1
Individual2
2
1
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
98
Individual3
1
1
Individual4
6
2
Individual5
7
8
Individual6
4
6
Individual7
8
2
Individual8
7
11
Individual9
9
22
Put these values of “Weight loss program X” and “Weight loss program Y” in the columns C1 and C2, respectively. Now go to “Stat” tab above, go to “Basic Statistics”, and click “Display Descriptive Statistics ” in the 1st section. In the window “Display Descriptive Statistics”, move the columns C1 and C2 to the box under “Variables:” by selecting them and clicking the button “Select.” Click “Statistics ”. A window “Display Descriptive Statistics: Statistics” appears. Check “Coefficient of variation”, and every other descriptive statistics you want to do. Click “OK.” Again click “OK” in the “Display Descriptive Statistics” (window). Results appear in the screen above the spreadsheet, and the results for the “Mean”, “Standard deviation” (StDev), and “Coefficient of variation” (CoefVar) are as follows: Variable
Mean
StDev
CoefVar
Weight loss program X
5.00
3.08
61.64
Weight loss program Y
6.00
7.00
116.67
Interpretation of results: These results show that the program Y has more variability relative to its mean as compared to program X.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
99
Chebyshev's Theorem This theorem states that for any set of observations or numbers, the minimum proportion of the values that lie within k standard deviations of the mean is 1-1/k2, where k is a positive number greater than 1. This theorem applies to all types of distributions of data.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Using Minitab
Suppose k=2 as noted in the above example. Go to “Calc” tab above, and click “Calculator
”
In the window “Calculator”, in the “Store result in variable:” in the right box enter a column, such as C5. In the box below “Expression:” enter “1-(1/2^2)” and click “OK.” The result will appear in the spreadsheet. In this case, the result is 0.75 (or 75%).
Factorial Factorial is related to the product of all the integers less than and equal to a given integer (n). Integers are greater than zero.
100
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Using Minitab
Take the example as noted above, i.e. 4!. Go to “Calc” tab above, and click “Calculator
”
101
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
102
In the window “Calculator”, in the “Store result in variable:” in the right box enter a column, such as C5. In the box below “Expression:” enter “FACTORIAL(4)” and click “OK.” The result will appear in the spreadsheet. In this case, the result is 24.
Distribution, and Standardization Suppose, there is a place known as “Virana” where beneficial viruses are living and fighting against some harmful bacteria. Those harmful bacteria are assigned “value-of-danger” from 1 to 9 according to their harmfulness to viruses, i.e. higher the number, the more harmful the bacteria are. Suppose we have different number of bacteria (frequency) according to their value-ofdanger as shown in the following table: Frequency (number of bacteria)
Value-of-danger
200,000
1
300,000
2
500,000
3
900,000
4
1000,000
5
900,000
6
500,000
7
300,000
8
200,000
9
Then a frequency distribution plot could be developed as follows:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
103
Frequency distribution plot Frequency (number of bacteria)
1,200,000 1,000,000 800,000
600,000 400,000 200,000 0 0
1
2
3
4
5
6
7
8
9
10
Value-of-danger
This frequency distribution plot shows “bell-curve” as it looks like a bell. Technically, this type of distribution is known as normal or Gaussian distribution as noted by statistician professor Gauss. This type of distribution is commonly found in nature as, for example, blood sugar levels and heart rates follow this type of distribution. Normal distribution has three important characteristics. 1. In a normal distribution, mean, median, and mode are equal to each other. 2. In this type of distribution, there is symmetry about the central point. 3. Half of the values, i.e. 50% are less than the mean value, and half of the values, i.e. 50% are more than the mean value. In the above illustration, the value-of-danger “5” is the mean and median, and as it is related to the most commonly found value (highest or largest number of bacteria that could be frequently found), so it is also mode. Disturbance in the values would result in the disturbance of the normal distribution; thereby, leading to non-normal or non-Gaussian distribution in which there is no appropriate bell-shaped curve. Frequency distribution has a close relationship with standard deviations:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
104
About 68.27% of all values lie within one standard deviation of the mean on both sides, i.e., total of two standard deviations,
About 95% of all values lie within two standard deviations of the mean on both sides, i.e. total of four standard deviations, and
About 99.7% of all values lie within three standard deviations of the mean on both sides, i.e., total of six standard deviations.
The number of standard deviations from the mean is known as “Standard Score”, “z-score”, or “sigma”, and in order to convert a value to a z-score or Standard Score, subtract the mean and then divide the value by Standard Deviation. It is represented as
in which z is showing the z-score; x is showing the value that had to be standardized; μ is showing the mean value, and σ is showing the Standard Deviation. This process of getting a zscore is known as “Standardizing.” Let’s consider the frequency distribution and frequency distribution plot again:
Frequency distribution plot Frequency (number of bacteria)
1,200,000
1,000,000 800,000 600,000
400,000 200,000 0 0
1
2
3
4
5
6
Value-of-danger
7
8
9
10
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha And its values: Frequency (number of bacteria)
Value-of-danger
200,000
1
300,000
2
500,000
3
900,000
4
1000,000
5
900,000
6
500,000
7
300,000
8
200,000
9
In this table, the mean value is shown by the value 5, and its Standard Deviation could be calculated from variance. So, variance is calculated as
105
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
106
So, Standard Deviation = σ = 2.58 In order to get the z-score of the first value, i.e. 1, first subtract the mean. So, it will be 1-5 = -4. Then the value will be divided by the Standard Deviation. So, it will be -4/2.58 = -1.55. So, zscore will be -1.55, i.e. the value-of-danger 1 will be -1.55 Standard Deviations from the mean. If we calculate the z-scores of all the values and placed the values in the table, we get: Frequency (number of
Value-of-danger
z-score
200,000
1
(1-5)/2.58 = -1.55
300,000
2
(2-5)/2.58=-1.16
500,000
3
(3-5)/2.58=-0.78
900,000
4
(4-5)/2.58=-0.39
1000,000
5
(5-5)/2.58=0
bacteria)
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
107
900,000
6
(6-5)/2.58=0.39
500,000
7
(7-5)/2.58=0.78
300,000
8
(8-5)/2.58=1.16
200,000
9
(9-5)/2.58=1.55
From this table, Standard Normal Distribution Graph could also be obtained in which z-scores are along x-axis and frequency (number of bacteria) is along y-axis.
The graph shows that nearly 68.27% of the values in the “value-of-danger” are present within one standard deviation of the mean on both side, i.e. from -1 to 1. In that case, 68.27% is also considered as a confidence interval between upper limit of z-score=1 and lower limit of zscore=-1. On a further note, nearly 95% of all values are present within two standard deviations of the mean on both sides, i.e. from -2 to 2.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
108
In case of disturbance in the values, the normal graph could be changed into non-normal and start showing binomial or Poisson distribution. Using Minitab
In order to calculate z-score, enter the values in the columns. We can take the example of “Frequency (number of bacteria)” and their “Value-of-danger”. Frequency (number of bacteria)
Value-of-danger
200,000
1
300,000
2
500,000
3
900,000
4
1000,000
5
900,000
6
500,000
7
300,000
8
200,000
9
In this case, we can enter “Frequency (number of bacteria)” in the first column (C1) ranging from rows 1 to 9, and “Value-of-danger” in the second column (C2) ranging from rows 1 to 9. Now go to “Calc” tab above and click “Standardize” in the 1st section. In the appeared window, in the “Input column(s):” enter the name of the column for which zscores have to be calculated. In this case, we have C2 to enter into this box. In the box below “Store results in:” enter the name of column in which you want to show your results. In this case, we can enter C5.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
109
Click “OK.” Results appear in Column C5. Results are as follows: -1.46059 -1.09545 -0.73030 -0.36515 0.00000 0.36515 0.73030 1.09545 1.46059 These are the values of z-scores, so you can give the C5 the name of “z-scores”. It is important to note that these z-scores are calculated according to the sample standard deviation. It is not according to the population standard deviation as calculated above manually. A graph of z-scores on x-axis and frequency (number of bacteria) on y-axis can also be plotted. Go to the “Graph”, click “Scatterplot
”, select “With Connect Line”, and click “OK.”
A window “Scatterplot: With Connect Line” appears. Move C1 to the cell under “Y variables” and C5 under “X variables”. Click “OK.” Following graph is obtained:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
110
Scatterplot of Frequency (number of bacteria) vs z-scores
Frequency (number of bacteria)
1000000 900000 800000 700000 600000 500000 400000 300000 200000 100000 -1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
z-scores
Prediction interval It is a modified form of confidence interval that covers a range of values likely containing a future value. It works on the basis of pre-existing values and some sort of regression analysis.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Prediction interval
A modified form of confidence interval
111
An interval covering a range of values likely containing a future value
Utilizing pre-existing values and some sort of regression analysis
formula
Sample estimate (predicted value or fitted ± t-multiplier × value)
When predictor is
xh
Standard error of the prediction
1 ( xh x) 2 yˆ h t( /2,n 2) MSE 1 n ( x x) 2 i Similar to standard error of the fit Confidence interval formula
Check the difference from confidence interval
1 ( xh x) 2 yˆ h t( /2,n 2) MSE n ( x x) 2 i Prediction intervals are wider as compared to confidence intervals
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
112
Using Minitab
In order to work on prediction interval, we can use the following example of weights and heights of 8 males in the age range of 18 years to 28 years: Serial no.
Weight of person (kg)
Height of person (cm)
1
84.2
189
2
98.8
178
3
62.6
176
4
73.6
172
5
67.4
168
6
77.9
179
7
72.9
186
8
68.6
164
In order to find 95% prediction interval for the weight of an individual who is randomly selected and who is 168 cm tall, and to find 95% confidence interval for the average weight of all individuals who are 168 cm tall, we enter the values in the columns. In this case, the values of “Serial no.” are entered in the first column, i.e. C1; the values of “Weight of person (kg)” are entered in the second column, i.e. C2, and the value of “Height of person (cm)” are entered in the third column, i.e. C3. Now go to “Stat”, go to “Regression”, go to “Regression”, and click “Fit Regression Model ” In the box under “Responses:” move C2, and in the box under “Continuous predictors:” move C3. Click “OK.” This gives the following results: Regression Equation
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Weight of person (kg)
=
113
-27.6 + 0.586 Height of person (cm)
Now again go to “Stat”, go to “Regression”, go to “Regression”, and click “Predict
” In the box
with “Response:” select (or already selected) “Weight of person (kg)”. In drop down menu, select “Enter individual values”, and in the top box under “Height of person (cm)”, enter 168. Click “OK.” This gives the following results: Prediction Fit
SE Fit
95% CI
95% PI
70.7731
5.75709
(56.6860, 84.8602)
(40.1473, 101.399)
Here, 95% CI is 95% Confidence Interval, and 95% PI is 95% Prediction Interval. So, 95% prediction interval for the weight of an individual who is randomly selected and who is 168 cm tall is 40.1473-101.399, and to find 95% confidence interval for the average weight of all individuals who are 168 cm tall is 56.6860-84.8602. In other words, it can be said that we can be 95% confident that the single future value of 168 cm will fall within the weight range of 40.1473-101.399 kg.
Tolerance interval An interval, also known as enclosure interval, covering a fixed proportion (or percent) of the population with a particular confidence. The endpoints, also known as a maximum value and a minimum value, are known as tolerance limits.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
An interval covering a fixed proportion (or percent) of the population with a particular confidence
Tolerance interval Also known as
Enclosure interval
For a normally distributed population (due to Howe)
k z( p 1)/2
Critical value of the normal distribution related to cummulative probability (p+1)/2
Different from confidence limits
12 ,df Critical value of the chisquare distribution with df degrees of freedom exceeded with probability α Tolerance limit
A limit within which a specific proportion (or percentage) of population may lie with stated confidence
A limit within which a given population Mean such as parameter may lie with stated confidence
95% probability mean lies here
x
The endpoints, also known as a maximum value and a minimum value, are known as tolerance limits.
df (1 1/ n)
Confidence limit
Lower confidence limit
114
95% probability 95% population lies here
Upper confidence limit
Lower tolerance limit
x
Upper tolerance limit
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
115
Using Minitab
In order to work on tolerance interval, we can use the following example: Serial no.
Weight of person (kg)
1
84.2
2
98.8
3
62.6
4
73.6
5
67.4
6
77.9
7
72.9
8
68.6
Suppose we want to know the tolerance interval, i.e. the weight range within which the weight of at least 95% of all men lie in a community, we can enter the values in the columns. In this case, the values of “Serial no.” are entered in the first column, i.e. C1, and the values of “Weight of person (kg)” are entered in the second column, i.e. C2. Now go to “Stat”, go to “Quality Tools”, and click “Tolerance Intervals (Normal Distribution)
”. From the drop down menu, select “One or more samples, each in a column”,
and in the box, enter C2, i.e. “Weight of person (kg)”. Click “OK.” This gives the following results: 95% Tolerance Interval Variable Weight of person (kg)
Normal Method (32.955, 118.545)
Nonparametric Method (62.600, 98.800)
Achieved confidence level applies only to nonparametric method.
Achieved Confidence 5.7%
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
116
In this graph, the Normal Probability Plot shows that the plotted points are following an approximate straight line. Furthermore, the p-value for the Normality Test is 0.364 that is >0.05. Therefore, the data follows normal distribution, and the Normal Method results (obtained in the table above) can be used. According to these results, it can be said with 95% confidence that the weight of 95% of all men lie in the range of 32.955 kg to 118.545 kg.
Parameters to describe the form of a distribution Different parameters to describe the form of a distribution are (1) location parameters, (2) scale parameters, and (3) shape parameters.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
117
Parameters to describe the function of a distribution x0 Location parameter
To determine the shift or location of a distribution
f x 0 ( x) f ( x x0 ) examples
Mean Median Mode
s Scale parameter
To describe the width of a probability distribution
f s ( x) f ( x / s ) / s examples
Large s, more spread out distribution Small s, more concentrated distribution
Shape parameters
To describe all other parameters beyond location parameter and scale parameter
examples
Skewness Kurtosis
Skewness Skewness helps in knowing about the symmetry (or the lack of symmetry) of distribution or data set. A distribution is said to be symmetric, if it appears same to the right and left of the center point.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
118
Kurtosis It is a measure of the degree of peakedness or flatness of a frequency-distribution curve, i.e. the extent to which a distribution is showing a peak or flatness.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
119
Using Minitab
Suppose in a company 14 people work. They have different salaries as noted in the table below: Salary (USD) per month
Number of people
500
1
750
2
1000
3
1250
5
1500
2
1750
1
So, the table means, the salary USD 500 is repeated 1 time; the salary USD 750 is repeated 2 times; the salary USD 1000 is repeated 3 times, and so on from the above mentioned information. Salary (USD) per month 500 750
So, we can make the following table
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
120
750 1000 1000 1000 1250 1250 1250 1250 1250 1500 1500 1750
Enter the values in this way in the first column, i.e. from row 1 to row 14 in this case. Go to “Stat” tab above, go to “Basic Statistics”, and click “Display Descriptive Statistics ” In the window “Display Descriptive Statistics”, move column 1, i.e. C1 in this case, from the left box to the “Variables:” in the right box by clicking on C1 and clicking the “Select” button below. Click “Statistics ”. A window “Display Descriptive Statistics: Statistics” appears. Check “Skewness” and “Kurtosis”, and every other descriptive statistics you may want to do. Click “OK.” Click “Graphs
” and select “Histogram of data, with normal curve”. Click “OK.”
Again click “OK” in the “Display Descriptive Statistics” (window). Results appear in the screen above the spreadsheet. In this case, the results are as follows: Variable Salary (USD) per month And the following graph appears:
Skewness
Kurtosis
-0.18
-0.09
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
121
Histogram (with Normal Curve) of Salary (USD) per month Mean 1143 StDev 335.6 N 14
5
Frequency
4
3
2
1
0
400
600
800
1000
1200
1400
1600
1800
Salary (USD) per month
So, the results are showing that the graph is representing slightly negative skewness and slightly negative kurtosis.
Different functions of distributions Different functions of probability distributions represent different aspects of those distributions. These functions may include, (1) probability density function (PDF), (2) cumulative distribution function (CDF), (3) survival function (SF), (4) percentile- point function (PPF), and (5) inverse survival function (ISF).
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
122
Different functions of probability distributions
Probability density function (PDF)
What is the probability To get the that the age of a man is probability of example a value in a between 40 to 45 certain interval years?
Cumulative distribution function (CDF)
To obtain the probability of a value less than a given value
example
1 - CDF Survival function (SF)
Percentile point function (PPF)
Inverse survival function (ISF)
To obtain the probability of a value larger than a given value Inverse of CDF To obtain a value less than a given value when probability is given
Inverse of SF To obtain a value more than a given value when probability is given
What is the probability that the age of a man is less than 45 years? What is the probability that the age of a man is more than 45 years? The probability that the age of a man is less than 95% of other men, what is the age of the man?
The probability that the age of a man is more than 95% of other men, what is the age of the man?
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
123
Different functions of probability distributions
Probability density function (PDF)
example
Cumulative distribution function (CDF)
example
a
b
b Survival function (SF)
example
b Percentile point function (PPF)
Inverse survival function (ISF)
example
b
example
b
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Probability Density Function It shows probability distribution for a continuous random variable.
124
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
125
Using Minitab
In order to work on probability density function, we can use the same example as noted above. . In the first column, i.e. C1, enter the values 0 in the first row, 1 in the second row, 2 in the third row, and so on, until 24 in the 25th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Uniform ” in the first section. Select “Probability density”. In the “Lower endpoint:” enter 0, and “Upper Endpoint:” enter 24, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Continuous uniform on 0 to 24 x
f( x )
0 0.0416667 1 0.0416667 2 0.0416667 3 0.0416667 4 0.0416667 5 0.0416667 6 0.0416667 7 0.0416667 8 0.0416667 9 0.0416667 10 0.0416667 11 0.0416667 12 0.0416667
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
126
13 0.0416667 14 0.0416667 15 0.0416667 16 0.0416667 17 0.0416667 18 0.0416667 19 0.0416667 20 0.0416667 21 0.0416667 22 0.0416667 23 0.0416667 24 0.0416667 As we have to check for the hours between 4 pm and 6 pm, so there are 2 hours. This is equal to 2×0.0416667=0.083333.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
127
Cumulative Distribution Function
Using Minitab
In order to work on probability density function, we can use the same example as noted above. . In the first column, i.e. C1, enter the values 0 in the first row, 1 in the second row, 2 in the third row, and so on, until 24 in the 25th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Uniform ”
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
128
Select “Cumulative probability”. In the “Lower endpoint:” enter 0, and “Upper Endpoint:” enter 24, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Cumulative Distribution Function Continuous uniform on 0 to 24 x P( X ≤ x ) 0
0.00000
1
0.04167
2
0.08333
3
0.12500
4
0.16667
5
0.20833
6
0.25000
7
0.29167
8
0.33333
9
0.37500
10
0.41667
11
0.45833
12
0.50000
13
0.54167
14
0.58333
15
0.62500
16
0.66667
17
0.70833
18
0.75000
19
0.79167
20
0.83333
21
0.87500
22
0.91667
23
0.95833
24
1.00000
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
129
25 1.00000 So, cumulative distribution function for 2 hours is 0.08333. With probability density function, it can be found that every hour has equal probability, and with cumulative distribution function, it can be found that with the addition of every hour, the probability increases.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Types of Distribution
Binomial Distribution
130
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
131
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
132
Using Minitab
In order to work on Binomial Probability Distribution, we can use the same example as above. In the first column, i.e. C1, enter the values 1 in the first row, 2 in the second row, 3 in the third row, and so on, until 8 in the 8th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Binomial
” in the
2nd section. Select “Probability”. In the “Number of trials:” enter 8, and “Event probability:” enter 0.5, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Binomial with n = 8 and p = 0.5 x P( X = x ) 1
0.031250
2
0.109375
3
0.218750
4
0.273438
5
0.218750
6
0.109375
7
0.031250
8
0.003906
As the likelihood of no less than 6, so we consider the values for 6, 7, and 8. These are 0.109375, 0.031250, and 0.003906, respectively. These values on addition gives 0.144531 that is equal to 37/256 (as noted in the example above). In order to make a graph, go to the “Graph” tab above, go to “Probability Distribution Plot” in the 2nd section, select “View Single”, and click “OK.”
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
133
Now select, “Binomial” under “Distribution:”. In the “Number of trials:” enter 8, and in the “Event Probability:” enter 0.5. This gives the following graph: Distribution Plot
Binomial, n=8, p=0.5 0.30
0.25
Probability
0.20
0.15
0.10
0.05
0.00
0
2
4
X
6
8
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
134
Chi-squared Distribution
Using Minitab
In order to work on Minitab, we can take the example of a classroom having a teacher and students on their seats. It is reasonable to suppose that the students having the seats in the first row will get more information from the teacher as compared to the students with seats in the 7th row. In this case, it can be said that obtaining the information is a function of distance of the row from the teacher. It can be considered as a Chi-Squared distribution with 1 degree of freedom,
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
135
i.e. parameter k=1. On the basis of this information, suppose, there are 7 rows in the classroom with 1 degree of freedom. In the first column, i.e. C1, enter the values 1 in the first row, 2 in the second row, 3 in the third row, and so on, until 7 in the 7th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Chi-square ” A window “Chi-Square Distribution” appears. Select “Probability density”. In the “Degrees of freedom:” enter 1, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Chi-Square with 1 DF x
f( x )
1 0.241971 2 0.103777 3 0.051393 4 0.026995 5 0.014645 6 0.008109 7 0.004553 This result shows that the probability of learning is highest in the first row. In order to make a graph, go to the “Graph” tab above, go to “Probability Distribution Plot”, select “View Single”, and click “OK.”
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
136
Now select, “Chi-Square” under “Distribution:” In the “Degrees of freedom:” enter 1, and click “OK”. This gives the following graph:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
137
Continuous Uniform Distribution
Using Minitab
In order to work on Continuous Uniform Distribution, we can use the same example as noted above. In the first column, i.e. C1, enter the values 0 in the first row, 1 in the second row, 2 in the third row, and so on, until 30 in the 31 st row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Uniform ” Select “Probability density”. In the “Lower endpoint:” enter 0, and “Upper Endpoint:” enter 30, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Continuous uniform on 0 to 30
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha x
f( x )
0
0.0333333
1
0.0333333
2
0.0333333
3
0.0333333
4
0.0333333
5
0.0333333
6
0.0333333
7
0.0333333
8
0.0333333
9
0.0333333
10 0.0333333 11 0.0333333 12 0.0333333 13 0.0333333 14 0.0333333 15 0.0333333 16 0.0333333 17 0.0333333 18 0.0333333 19 0.0333333 20 0.0333333 21 0.0333333
138
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
139
22 0.0333333 23 0.0333333 24 0.0333333 25 0.0333333 26 0.0333333 27 0.0333333 28 0.0333333 29 0.0333333 30 0.0333333
As we have to check for inside of 5 seconds, so the values below 5 seconds are added. This is equal to 5×0.033333=0.166665 (=1/6). For 20 contenders, 20×0.166665=3.3333 ≈ 3 contenders have the probability to react within 5 seconds.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
140
Cumulative Poisson Distribution
Using Minitab
In order to work on Cumulative Poisson Distribution, we can use the same example as noted above. In the first column, i.e. C1, enter the values 0 in the first row, 1 in the second row, 2 in the third row, and so on, until 7 in the 8th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Poisson ” Select “Cumulative probability”. In the box with “Mean:” enter 7, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Poisson with mean = 7 x P( X ≤ x )
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 0
0.000912
1
0.007295
2
0.029636
3
0.081765
4
0.172992
5
0.300708
6
0.449711
7
0.598714
141
Here, P(X=2) that can be calculated by subtracting P(X ≤ 1), i.e. 0.007295 from P(X ≤ 2), i.e. 0.029636. So, the value is 0.022341. Based on this result, it can be said that there is about 2.23% chance that 2 errors occur in 5000 lines of randomly selected lines of code.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Exponential Distribution
Using Minitab
Suppose a particular type of light bulb has a mean lifetime of about 1200 hours. Suppose, we have 8 bulbs of that type with the following determined lifetimes. Bulb number
Lifetime
1
3783.74
2
2243.01
3
1027.74
4
1202.26
5
778.17
142
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
6
7471.94
7
1834.41
8
528.42
143
In order to work on Exponential Distribution, we can use this example. In the first column, i.e. C1, enter the values 3783.74 in the first row, 2243.01 in the second row, 1027.74 in the third row, and so on, until 528.42 in the 8th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Exponential ” Select “Probability density”. In the “Scale:” enter 1200, in the “Threshold:” enter 0.0. In the “Input column:” enter C1. Click “OK.” This gives the following result: Exponential with mean = 1200 x
f( x )
3783.74 0.0000356 2243.01 0.0001285 1027.74 0.0003539 1202.26 0.0003060 778.17
0.0004357
7471.94 0.0000016 1834.41 0.0001807 528.42
0.0005365
This shows that probability is high below 1200. In order to make a graph, go to the “Graph” tab above, go to “Probability Distribution Plot select “View Single”, and click “OK.”
”,
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
144
Now select, “Exponential” under “Distribution:”. In the “Scale:” enter 1200, and in the “Threshold:” enter 0.0. Click “OK.” This gives the following graph: Distribution Plot
Exponential, Scale=1200, Thresh=0 0.0009 0.0008 0.0007
Density
0.0006 0.0005 0.0004 0.0003 0.0002 0.0001 0.0000
0
5000
10000
15000
20000
X
You can see the X and Y values of the graph by double clicking on the graph, selecting the third option that looks like a + sign, and that is “Pinpoint X and Y Coordinates”, and taking the cursor to any point on the graph.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Normal Distribution
145
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
146
Using Minitab
In order to work on Normal Distribution, we can use the same example as noted above. In the first column, i.e. C1, enter the values 26 in the first row, 33 in the second row, 65 in the third row, and so on, until 34 in the 20th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Normal ” Select “Probability density”. In the “Mean:” enter 38.8, and in the “Standard deviation:” enter 11.4. In the “Input column:” enter C1. Click “OK.” This gives the following result: Normal with mean = 38.8 and standard deviation = 11.4 x
f( x )
26 0.0186315 33 0.0307466 65 0.0024949 28 0.0223416 34 0.0320264 55 0.0127497 25 0.0168191 44 0.0315373 50 0.0215978 36 0.0339551 26 0.0186315 37 0.0345614 43 0.0326987 62 0.0044124
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
147
35 0.0331038 38 0.0349089 45 0.0301840 32 0.0292917 28 0.0223416 34 0.0320264 This shows that highest probability is at 37 minutes. In order to make a graph, go to the “Graph” tab above, go to “Probability Distribution Plot select “View Single”, and click “OK.” Now select, “Normal” under “Distribution:”. In the “Mean:” enter 38.8, and in the “Standard Deviation:” enter 11.4. Click “OK.” This gives the following graph: Distribution Plot
Normal, Mean=38.8, StDev=11.4 0.04
Density
0.03
0.02
0.01
0.00
0
10
20
30
40
X
50
60
70
80
”,
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
148
Poisson Distribution
Using Minitab
In order to work on Poisson Distribution, we can use the same example as noted above. In the first column, i.e. C1, enter the values 0 in the first row, 1 in the second row, 2 in the third row, and so on, until 11 in the 12th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Poisson ” Select “Probability”. In the “Mean:” enter 8, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Poisson with mean = 8 x P( X = x ) 0
0.000335
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 1
0.002684
2
0.010735
3
0.028626
4
0.057252
5
0.091604
6
0.122138
7
0.139587
8
0.139587
9
0.124077
10
0.099262
11
0.072190
149
This shows that on a given weekday, the probability of 11 cars visiting the park is 0.072. In order to make a graph, go to the “Graph” tab above, go to “Probability Distribution Plot
”,
select “View Single”, and click “OK.” Now select, “Poisson” under “Distribution:”. In the “Mean:” enter 8. Click “OK.” This gives the following graph:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
150
Distribution Plot Poisson, Mean=8
0.14 0.12
Probability
0.10 0.08 0.06 0.04 0.02 0.00
0
5
10
X
15
20
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Beta Distribution
151
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Some example plots for beta probability density function and beta cumulative distribution function are presented below:
152
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
153
Using Minitab
Suppose the bulbs in a shipment have a defect with a beta distribution with α = 3 and β = 6, and we want to know the probability that the shipment could have 30% to 40% of defective bulbs. In the first column, i.e. C1, enter the values 0 in the first row, 0.1 in the second row, 0.2 in the third row, and so on, until 1.0 in the 11th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Beta ” Select “Cumulative probability”. In the “First shape parameter:” enter 3, and in the “Second shape parameter:” enter 6. In the “Input column:”, enter C1. Click “OK.” This gives the following result: Beta with first shape parameter = 3 and second = 6 x P( X ≤ x ) 0.0
0.00000
0.1
0.03809
0.2
0.20308
0.3
0.44823
0.4
0.68461
0.5
0.85547
0.6
0.95019
0.7
0.98871
0.8
0.99877
0.9
0.99998
1.0
1.00000
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
154
The probability that the shipment could have 30% to 40% of defective bulbs could be calculated by subtracting the value 0.68461 (for 0.4) from 0.44823 (for 0.3). So, the answer (probability) is 0.23638.
F Distribution F Distribution is most commonly used in the Analysis of Variance and the F test.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Some example plots for probability density function of F distribution are presented below:
155
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
156
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
157
Using Minitab
Suppose 2 follows F-distribution, and independent random samples have size n1=27 and n2= 99999. This 99999 represents an infinite sample size. So, we have the degrees of freedom as n11=26, and n2-1=99999. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “F ” Select “Cumulative probability”. In the “Numerator degrees of freedom:” enter 26, and in the “Denominator degrees of freedom:” enter 99999. Select “Input constant:” and enter 2 in the box with it. Click “OK.” This gives the following result: Cumulative Distribution Function F distribution with 26 DF in numerator and 99999 DF in denominator x P( X ≤ x ) 2
0.998196
This value shows the area under the curve (i.e. area under the curved line in the graph) up to 2.
Gamma Distribution Gamma distribution is used for queuing models, and for financial and climatology services.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
158
Some example plots for probability density function of gamma distribution are presented below:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
159
Using Minitab
Suppose we don’t know the survival time of an experimental model (animal) that was exposed to a high dose of radiation. Again suppose the survival time is in weeks, and follows the gamma distribution with α=9 and β=14.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
160
In the first column, i.e. C1, enter the values 10 in the first row, 20 in the second row, 30 in the third row, and so on, until 140 in the 14 row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Gamma Select “Cumulative probability”. In the “Shape parameter:” enter 9, and in the “Scale parameter:” enter 14. In the “Input column:”, enter C1. Click “OK.” This gives the following result: Cumulative Distribution Function Gamma with shape = 9 and scale = 14 x P( X ≤ x ) 10
0.000000
20
0.000019
30
0.000390
40
0.002776
50
0.011135
60
0.031140
70
0.068094
80
0.124706
90
0.200011
100
0.289715
110
0.387520
120
0.486695
130
0.581352
”
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 140
161
0.667180
In order to find the probability that the experimental model (animal) will survive a minimum of 40 weeks, we will subtract the value from 1. Therefore, 1-0.002776=0.997224 is the probability of survival for a minimum of 40 weeks. In order to find the probability that the experimental model (animal) will survive between 70 weeks to 130 weeks, we will subtract their respective values. Therefore, 0.5813520.068094=0.513258 is the probability of survival between 70 weeks and 130 weeks.
Negative Binomial Distribution
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Some example plots for probability density function of negative binomial distribution are presented below:
162
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
163
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
164
Using Minitab
Suppose we want to know the probability that the second six will be scored in the fifth throw in the consecutive and independent throws of one die. The event probability (probability of success – p) is 1/6, number of trials (x) are 5, and the number of success (r) is 2. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Negative Binomial ” in the 2nd section. Select “Probability”. In the “Event probability:” enter 0.167 (i.e. =1/6), and in the “Number of events needed:” enter 2. Select “Input constant:”, and enter 5 in the box with it. Click “OK.” This gives the following result: Probability Density Function Negative binomial with p = 0.167 and r = 2 x
P( X = x )
5 0.0644804 * NOTE * X = total number of trials. Now, suppose that Peter is a high school footballer and he is successful at 65% in penalties. What is the probability that Peter to hit second penalty in the fifth attempt? In this example: p=0.65, x=5, r=2 By entering these values, we get the following results: Probability Density Function Negative binomial with p = 0.65 and r = 2 x
P( X = x )
5 0.0724587 * NOTE * X = total number of trials.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
165
Gumbel Distribution It is used to model the distribution of extreme values, i.e. maximum or minimum values as, for example, peak temperatures of the year or maximum streamflow.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
166
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Using Minitab
The “Calculator” function could be used for the Gumbel Distribution, i.e. enter the formula associated with Gumbel Distribution in the box below “Expression:”.
Hypergeometric Distribution It is similar to binomial distribution.
167
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
168
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
169
Using Minitab
Suppose a deck of cards has 30 (N) cards in which there are 12 (M) red cards and 18 black cards. Suppose 7 (n) cards are taken without replacement. What would be the probability that exactly 6 (x) red cards are taken? Go to the “Calc” tab above, go to “Probability Distributions”, and click “Hypergeometric ” Select “Probability”. In the “Population size (N):” enter 30, in the “Event count in population (M):” enter 12, and in the “Sample size (n):” enter 7. Select “Input constant:”, and enter 6 in the box with it. Click “OK.” This gives the following result: Hypergeometric with N = 30, M = 12, and n = 7 x
P( X = x )
6 0.0081698 So, the probability is 0.0081698.
Inverse Gamma Distribution Inverse Gamma Distribution is a Pearson Type 5 Distribution that is most commonly used in Bayesian statistics.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
170
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
171
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Using Minitab
The “Calculator” function could be used to work on the Inverse Gamma Distribution.
Log Gamma Distribution
172
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
173
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Using Minitab
The “Calculator” function could be used to work on the Log Gamma Distribution.
Laplace Distribution Laplace distribution is double exponential distribution.
174
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
175
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
176
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
177
Using Minitab
Suppose the return of a stock follows Laplace distribution with location (μ) = 7 and scale (λ) = 4. What would be the probability that the stock will return a value between 7 and 12? In the first column, i.e. C1, enter the value 1 in the first row, 2 in the second row, 3 in the third row, and so on, until 12 in the 12th row. Go to the “Calc” tab above, go to “Probability Distributions”, and click “Laplace ” Select “Cumulative probability”. In the “Location:” enter 7, in the “Scale:” enter 4. Select “Input column:”, and enter C1. Click “OK.” This gives the following result: Cumulative Distribution Function Laplace with location = 7 and scale = 4 x P( X ≤ x ) 1
0.111565
2
0.143252
3
0.183940
4
0.236183
5
0.303265
6
0.389400
7
0.500000
8
0.610600
9
0.696735
10
0.763817
11
0.816060
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 12
0.856748
The probability that the stock will return a value between 7 and 12 could be calculated by subtracting the value 0.856748 (for 12) from 0.500000 (for 7). So, the answer is 0.356748.
Geometric Distribution Geometric distribution is related to two probable outcomes in a trial.
178
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
179
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
180
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
181
Using Minitab
Suppose a person’s success probability is 0.3. What is the probability that the person has to select 5 other people until he finds his success? Go to the “Calc” tab above, go to “Probability Distributions”, and click “Geometric
”
Select “Probability”. In the “Event probability:” enter 0.3. Select “Input constant:”, and enter 5. Click “OK.” This gives the following result: Probability Density Function Geometric with p = 0.3 x P( X = x ) 5
0.07203
* NOTE * X = total number of trials. So, the answer is 0.07203.
Level of Significance and confidence level Some events occur commonly, and they may have decreased significance; whereas, some events or outcomes occur rarely and they are of significant nature. The same thing applies in biostatistics. We, usually, say that the probability, which is denoted by capital ‘P’, of an event is low, if it is a rare event, and P-value, which can be considered in percentages, represents the value at which a rare event can be separated from other normal events. For example, researchers usually consider the value of less than 5% or 0.05 as a level of significance, which is a P-value. It can be shown by the sign of ‘less than’, i.e. P < 0.05. So, an event is of significant nature in biostatistics, if it has less than 5% probability. Significance level also refers to the probability of incorrectly rejecting a null hypothesis, i.e. probability of committing a Type I error
and the chances, which can be represented in
percentages, of rejecting null hypothesis, refers to the level of significance. Again take a look at the Type I error.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
182
Suppose, in reality H0 is true but if we reject it incorrectly False Positive Because we think that alternative hypothesis is right (positive)
Type I error α error
As level of significance is related to α-error, so it is also known as α level. So, by changing the level of significance, the chances of Type I error can also be changed. Whereas, the remaining level, which is obtained after removing the level of significance, is considered as level of confidence, which is represented by γ, and is obtained by the equation γ = 100% - α level or γ = 1 - α level. Usually, 5% or 0.05 is considered as the level of significance, so 95% or 0.95 is considered as the level of confidence.
Statistical Estimation
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
183
Interval Estimation It refers to the use of sample data for the calculation of an interval of most probable values (two values, i.e. range of values) of an unknown population parameter.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Using Minitab
The “Calculator” function could be used to work on the interval estimation.
184
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
185
Best Point estimation Point estimation is the utilization of sample data for the calculation of single value, which is also referred to as statistic. This single value is considered as a “best estimate” or “best guess” of population parameter that is not known. For example, 1. sample variance is a point estimate of population variance, 2. sample mean is the point estimate of population mean.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Using Minitab
Simple mean and variance can be calculated using descriptive statistics.
186
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
187
Correlation Correlation refers to systematic changes in the amount of one variable in relation to systematic changes in the amount of another variable. The correlation coefficient is represented by ‘r’ and it ranges from -1 to 0 to +1. In case of -1, the effect of one variable is completely negative on the other, whereas in case of +1, which is the highest value, the effect of one variable is completely positive on the other, i.e. with increase in one variable, second variable also increases. Those two variables can be plotted on x-axis and y-axis on a graph, and their correlation can be checked with the help of line in the plot. (Correlation coefficient is further explained later in the book.)
Central Limit Theorem According to Central Limit Theorem, the shape of the sampling distribution of the mean becomes almost normal with increase in sample size, i.e. n≥30, irrespective of the distribution in the population.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
188
Standard Error of the Mean Standard Error of Mean is represented by the symbol for Standard deviation, i.e. σ along with a subscript M representing mean. So, σM is the symbol for Standard Error of Mean. It could be calculated by the formula:
Where σ is standard deviation and n is the amount of numbers.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
189
It is helpful in knowing about all the population, i.e. the smaller the standard error of mean, the more accurately the sample would represent the entire population. Standard error of mean helps in working on confidence intervals.
Using Minitab
In order to calculate standard error of mean, we can use the same example as noted above. So, in the first column, i.e. C1, enter the value 14 in the first row; 36 in the second row; 45 in the third row; 70 in the fourth row, and 105 in the fifth row. Now go to “Stat”, go to “Basic Statistics”, and click “Display Descriptive Statistics ” Move the column name having variables for which standard error of mean has to be calculated. In this case, we have C1. Click “Statistics ” and check “SE of mean” and/or any other Descriptive statistics, such as “Mean” and “Standard deviation”, you want to do. Click “OK.” Now Click “OK.” This gives us the following results:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
190
Statistics Variable Mean SE Mean StDev C1
54.0
15.6
34.9
Statistical Significance
Tests for Non-normally distributed data Non-parametric testing can be applied to ranked, ordinal, or continuous outcome variables, and for those variables that do not follow normal distribution, i.e. non-normally distributed data. For example, severity of pain is an ordinal outcome and can be ordered from no pain to severe pain to agonizing pain on a scale from 1 to 12.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha One tail and two tail tests
191
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
192
Mood’s Median Test It is a nonparametric test to determine whether the medians of two or more samples or populations or groups are equal or not. Suppose we are working on a disease and the pain associated with the disease, and we have worked on two different medicines, i.e. A and B. After receiving the treatments, we assessed the quality-of-life and pain through the Quality-ofLife Scale and overall ranking of pain respectively. After getting the scores for quality-of-life and pain, those scores are joined and an overall score out of 10 is developed. Higher scores show improvement, while lower scores show no effect of treatment. Overall score out of 10
Medicines
7
A
2
A
5
A
4
A
6
B
8
B
6
B
9
B
Calculate overall median of 8 values, and in this case it is “6”. Now, for each medicine, count the number of observations that are equal to or less than the overall median (≤6) and greater than the overall median (>6). These values are placed in a 2×2 contingency table. A
B
>median
1
2
≤median
3
2
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
193
Now calculate the chi-square by using the equation χ2 = ((O - E)2 / E), where O shows observed values and E shows Expected values. The expected values can be considered as the average values as follows:
>median
Treatment group
Placebo Group
1.5
1.5
≤median 2.5 Now the Chi-square values are as follows:
2.5
χ2 = ((O - E)2 / E) = ((1 – 1.5)2 / 1.5) = 0.17 χ2 = ((O - E)2 / E) = ((3 – 2.5)2 / 2.5) = 0.1 χ2 = ((O - E)2 / E) = ((2 – 1.5)2 / 1.5) = 0.17 χ2 = ((O - E)2 / E) = ((2 – 2.5)2 / 2.5) = 0.1 Total χ2 = 0.167 + 0.1 + 0.167 + 0.1 = 0.534 Now we will check the Degrees of freedom (df). For χ2, we will calculate df by the following formula, df = k-1 = 2 - 1 = 1. So, we have df = 1. Now we will compare our calculated χ2 value, i.e. 0.534 with χ2 value in a table of χ2 with 1 df. If the calculated value of the Chi-square test is more than the value in the Chi-square table, null hypothesis will be rejected. χ2 value at α = 0.05 and df=1 is 3.841. Our observed χ2 value is 0.534 that is less than 3.841. This is showing that the median of A medicine has a nonsignificant difference from the B medicine.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Mood s median test (Part 1)
1
Determination of hypothesis
2
The overall median of the data is calculated
3
For each group, count the number of observations that are
4
The counts from step 3 are placed into a 2×k contingency table
A nonparametric test
194
It is used to determine whether the medians of two or more samples or populations are equal or not
Null Hypothesis – H0: Median of all groups is same Alternative Hypothesis – H1: Median of at least one of the groups is different
Equal to or less than the overall median ( median)
Greater than the overall median (>median)
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Mood s median test (Part 2)
195
Observed cases in row i at column j
Expected cases in row i at column j
Formula
5
Calculate a Chisquare statistic
6
Determination of degree of freedom (DF)
2
(O
2 E ) ij ij
Eij
DF = k - 1 Number of levels of categorical variables, or (simply) categories, or sample, or populations, etc.
7
Compare χ2 test statistic (i.e. calculated χ2 Comparison of values value) with determined DF with χ2 value in a table of χ2
8
If the calculated value of the χ2 test is more than the value in the χ2 table, null hypothesis will be rejected.
Interpretation of results
Using Minitab
In order to work on Mood’s Median test, we can use the same example as noted above: Overall score out of 10
Medicines
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
7
A
2
A
5
A
4
A
6
B
8
B
6
B
9
B
196
Enter the values in columns. In this case, we have “Overall score out of 10” in first column, i.e. C1, and “Medicines” in C2. Now go to “Stat”, go to “Nonparametrics”, and click “Mood’s Median Test
” in the 3rd
section. In the appeared window (Mood’s Median Test), move C1, i.e. “Overall score out of 10” in the box with “Response:”, and move C2, i.e. “Medicines” in the box with “Factor:”. Click “OK.” Following results appear: Descriptive Statistics
Medicines A B Overall
Median 4.5 7.0 6.0
N Overall Median Q3 – Q1 CI 1 4.00 (2, 7) 2 2.75 (6, 9)
Levels with < 6 observations have confidence < 95.0% 95.0% CI for median(A) - median(B): (-7,1) Test Null hypothesis
H₀: The population medians are all equal
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Alternative hypothesis DF Chi-Square 1 0.53
197
H₁: The population medians are not all equal P-Value 0.465
Our observed χ2 value is 0.53 and p-value is 0.465. This p-value>0.05, thereby showing that the median of A medicine has a nonsignificant difference from the B medicine.
Goodness of Fit A goodness of fit test is used to check how well an observed frequency distribution (or observed set of data) fits a claimed distribution of a population (or an expected outcome). A goodness of fit test is among the most commonly used non-parametric tests. Chi-square test is commonly utilized for goodness of fit tests.
Chi-square test Chi-square (χ2) test is a very simple non-parametric test. If we consider the pain as a variable, this test can be used to show an association between the efficacies of treatments in reducing the pain. Testing Assumptions:
Two variables have to be measured as categories, usually at a nominal or ordinal level.
The study groups have to be independent of each other.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
198
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
199
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
200
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
201
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
202
Using Minitab
In order to work on chi-square test, we can use the following data as an example; Groups
Outcomes
Total
Like Mathematics
Don’t like Mathematics
Boys
40
10
50
Girls
20
30
50
Total
60
40
100
Enter the values in columns. In this case, we have “Like Mathematics” in first column, i.e. C1, “Dont like Mathematics” in C2, and “Total” in C3. Now go to “Stat”, go to “Tables”, and click “Chi-Square Test for Association ” in the 2nd section. In the appeared window (Chi-Square Test for Association), select “Summarized data in a twoway table” from the drop down menu. Move C1 and C2 to the box under “Columns containing the table:”, and click “OK.” Results appear, and you may find that the expected values have also been generated below observed values. In the results, Pearson Chi-Square is 16.667 with 2 degrees of freedom (DF) and p-value is 0.000. This p-value is less than 0.05, i.e. p 5 P-Value
Ranking of pain before treatmen
2
0
6
0.289
Ranking of pain after 1 week of
3
3
2
1.000
C3
0
1
0
1.000
According to the results, p-value for C1 (Ranking of pain before treatment) is 0.289. This calculated p-value is more than the alpha level of 0.05, so we cannot reject the null hypothesis and say that the results are not statistically significant. Even if we calculate the median value for “Ranking of pain before treatment”, we get 7, and with this median value, the p-value for C2 is 0.1250 that is again more than 0.05, so we cannot reject the null hypothesis and say that the results are not statistically significant.
Wilcoxon Signed Rank Test It is nonparametric test. Testing Assumptions:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
229
Two variables in which one is dependent variable that is measured on an ordinal or continuous level, and the other is independent variable that has two groups, or pairs (as you can see in the example of treatment of pain).
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
230
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
231
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
232
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
233
Wilcoxon Signed Rank Test can also be performed on the same data (as mentioned in The Sign Test). We will rank the difference scores from 1 through 8 after ordering the absolute values of the difference scores. In case of the same absolute values of the difference scores, we will assign the mean rank. Then we will attach the positive or negative signs of the observed differences to the ranks. So, we can get the following table: Difference, i.e.
Ordered absolute
Ranks or Mean
Signed ranks
Before treatment –
values of differences
Ranks
1
1
(1+2+3)/3 = 2
2
2
-1
(1+2+3)/3 = 2
-2
-1
1
(1+2+3)/3 = 2
2
3
2
(4+5+6)/3 = 5
5
6
-2
(4+5+6)/3 = 5
-5
1
2
(4+5+6)/3 = 5
5
2
3
7
7
-2
6
8
8
After treatment
We will work with the same hypotheses as in The Sign Test. However, capital W is the test statistic for the Wilcoxon Signed Rank Test. W is the smaller value of the sum of the positive ranks that could be represented by W+ and the sum of the negative ranks that could be represented by W-. Null hypothesis is accepted, if W+ is similar to W-, whereas null hypothesis is rejected, if W+ is much larger in value than W-. From the data, we have found that W+ = 29 and W- = 7. As the sum of the ranks must always by equal to n(n+1)/2. So, 8(8+1)/2 = 72/2 = 36
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
234
that is equal to 29 + 7 = 36. Our test statistic is 7, i.e. W =7, which is the smaller value in 29 and 7. Now we will check the table of critical values of W by considering the sample size, i.e. n = 8, and level of significance of 5%, i.e. α=0.05. If the observed value, i.e. 7 is less than or equal to the critical value, null hypothesis is rejected, whereas if the observed value is greater than the critical value, null hypothesis is not rejected. In the table, the critical value of W in two-tailed test is 4 that is smaller than 7. This is showing that the null hypothesis cannot be rejected. In short, in Wilcoxon Signed Rank Test, the test statistic is W. The null hypothesis is that the median difference is zero or W+ is equal to W-. W is equal to the smaller value in W+ and W-. In this case, calculated value of W is 7, which is more than the critical value of 4. So, the null hypothesis cannot be rejected; thereby, showing that the new treatment is not working. Using Minitab
In order to work on Wilcoxon Signed Rank Test, we can use the following data as an example; Participants
Ranking of pain before
Ranking of pain after 1
treatment
week of treatment
1
9
8
2
7
5
3
4
5
4
7
4
5
8
2
6
8
7
7
6
4
8
3
5
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
235
Enter the values in the columns. In this case, we enter the values of “Ranking of pain before treatment” in the first column, i.e. C1, and “Ranking of pain after 1 week of treatment” in the second column, i.e. C2. Now go to “Calc” and click “Calculator
”
In the box with “Store result in variable:” enter C3 (i.e. third column). In the box under “Expression:” enter C1-C2, and click “OK.” This generates another column, i.e. C3 as shown below: Ranking of pain before
Ranking of pain after 1
C3
treatment
week of treatment
9
8
1
7
5
2
4
5
-1
7
4
3
8
2
6
8
7
1
6
4
2
3
5
-2
Now go to “Stat”, go to “Nonparametrics”, and click “1-Sample Wilcoxon ” Move the third column, i.e. C3, from the left box to the box under “Variables:” Select “Test Median:” and enter “0.0” in the box. Select “not equal” from the drop down menu with “Alternative:” Click “OK.”
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
236
This gives the results in which p-value is 0.141. Test Null hypothesis
H₀: η = 0
Alternative hypothesis H₁: η ≠ 0 N for Wilcoxon Sample C3
Test 8
Statistic P-Value 29.00
0.141
This p-value is more than 0.05, so the results are statistically not significant, and null hypothesis could not be rejected.
The Kruskal-Wallis Test It is a nonparametric test. Testing Assumptions:
Two variables in which one is dependent variable that is measured on an ordinal or continuous level, and the other is independent variable that has two or more categorical groups.
Observations are independent of each other, i.e. they are not related to each other.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
237
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
238
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
239
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Suppose, some people are feeling pain due to some disease and we want to know the effect of different concentrations of a new drug on the treatment of pain. We will get 12 participants
240
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
241
(samples) from those people and check the efficacy of the new drug and its different concentrations in reducing the pain. The concentrations of the new drug include 5%, 10% and 15% of drug in the solution, and the response of the people will be taken in the form of ranking of pain on a scale from 1 to 12 in which 1 shows least pain and 12 shows highest level of pain. Suppose, we give 5% solution to 3 people (i.e. n1), 10% solution to 5 people (i.e. n2), and 15% solution to 4 people (i.e. n3). In this situation, we will perform Kruskal-Wallis Test as the sample sizes are small and they are not equal, i.e. n1 = 3, n2 = 5, and n3 = 4. Moreover, they are not normally distributed. Kruskal-Wallis Test is useful as it helps in comparing the outcomes of more than two independent groups. Our null hypothesis is that the population medians of all three groups are equal, whereas alternative hypothesis is that the population medians of the groups are not equal at 5% level of significance. Suppose, following responses are obtained: 5% solution
10% solution
15% solution
7
6
5
6
5
2
9
7
3
8
4
5
This table is showing that 15% solution of the drug is more helpful in reducing pain as compared to 5% solution. However, it is important to check whether this observed difference is statistically significant or not. For Kruskal Wallis Test, we have to order the data of 12 subjects from smallest value to largest value while keeping the track of group assignments in the total sample.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
242
First we will check whether the total value of ranks, i.e. 29+38+11=78 is equal to n(n+1)/2 or not. So, n(n+1)/2= 12(13)/2 = 78. So, these are equal. The test statistic for the Kruskal Wallis test is represented by H and can be calculated by using the equation,
Where n is the total number of subjects or samples, i.e. 12; R1 is the sum of the ranks in the first group, i.e. 29;
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
243
R2 is the sum of the ranks in the second group, i.e. 38; R3 is the sum of the ranks in the third group, i.e. 11; n1 is the sample size of the first group, i.e. 3; n2 is the sample size of the second group, i.e. 5, and n3 is the sample size of the third group, i.e. 4.
Now we will check whether this test statistic, i.e. H=7 is in favor of null hypothesis or it rejects the null hypothesis. So, we will check it by considering the critical value of H. If this value is more than or equal to the critical value, null hypothesis will be rejected, whereas if this value is less than the critical value null hypothesis will not be rejected. From the table, the critical value with sample sizes of 3, 5, and 4, and α=0.05 is 5.656. Our observed H value is 7 and it is greater than the critical value, so we can reject null hypothesis. It can also be said that the test is statistically significant as the null hypothesis is rejected. It is quite good news as we can use increased concentration of the drug to reduce the pain.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
244
In short, in Kruskal Wallis Test, the test statistic is H. The null hypothesis is that the medians of all the populations are equal. Our calculated value of H is 7, which is more than the critical value of 5.656. In this case, the null hypothesis is rejected. It cannot be rejected only when calculated H value is less than the critical value. Using Minitab
In order to work on Kruskal Wallis Test, we can use the following data as an example: 5% solution
10% solution
15% solution
7
6
5
6
5
2
9
7
3
8
4
5 These are three groups. So, we have to assign them numbers, i.e. the group of 5% solution is considered as “1”, the group of 10% solution is considered as “2”, and the group of 15% solution is considered as “3”. Then we place these assigned numbers in the first column according to the number of results. In this case, we place number “1” in the first column in first three rows; number “2” in the next five rows, i.e. from 4th to 8th row, and number “3” in the next four rows, i.e. from 9th to 12th row. In the next column, we place the results obtained according to the assigned numbers (to groups). So, in the second column in the first three rows, we put 7, 6, and 9; in the next five rows, we put 6, 5, 7, 8, and 5, and in the next four rows, we put 5, 2, 3, and 4. We get the following table: 1
7
1
6
1
9
2
6
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
2
5
2
7
2
8
2
5
3
5
3
2
3
3
3
4
245
Now go to “Stat”, go to “Nonparametrics”, and click “Kruskal-Wallis ” In the box with “Response:” move the column containing responses, i.e. C2 in this case, and in the box with “Factor:” move the column containing groups, i.e. C1 in this case. Click “OK.” This gives the Kruskal-Wallis H, which is equal to 7.26 in this case, and p-value, which is 0.027 in this case. Test Null hypothesis
H₀: All medians are equal
Alternative hypothesis H₁: At least one median is different Method
DF H-Value P-Value
Not adjusted for ties
2
7.11
0.029
Adjusted for ties
2
7.26
0.027
The chi-square approximation may not be accurate when some sample sizes are less than 5. This p-value is less than 0.05, so the results are statistically significant.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Degrees of Freedom
The number of values or scores in a data set (or distribution) that are free to vary while considering the final calculation.
246
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
247
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
248
The Friedman Test It is a nonparametric test that is considered as an alternative to Two Way ANOVA. Testing Assumptions:
Data, especially related to dependent variable, has to be measured at the continuous or ordinal level.
Data is related to a single group measured at a minimum of three different occasions.
Random sampling strategy for the group from the selected population.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
249
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
250
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
251
Suppose, there is a treatment for pain (in some disease) that works equally on pain-related outcomes after the use of 15% solution of that treatment in all age groups. We give the new treatment to four different groups and get their responses. Further suppose that the four groups are A, B, C, and D. Group A has 6 participants having 15 years of age. Group B has 6 participants having 25 years of age. Group C has 6 participants having 35 years of age. Group D has 6 participants having 45 years of age. This difference of years is helpful in knowing the differences of the new treatment on different ages of samples. In this case, Friedman test can help as it is used to compare three or more groups, and all the groups have same number of sample sizes. Our null hypothesis is that median rankings of the four groups are equal, while the
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
252
alternative hypothesis is that median ranking of at least one of the groups is different from the median ranking of at least one of the other groups at α = 0.05. After giving the treatment to the participants (blocks of raters), suppose we get the following results:
Now, we need to convert the data to ranks within blocks.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
253
These Rank Totals are showing that there are differences in the rankings of pain of different age groups. So, we have to test the statistical significance of these results. First, we will check the rankings in the Friedman Test, so
Where r = number of participants in every group = 6 c = number of groups = 4 So, we have
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
254
For the Friedman test, we have the following equation,
Now, we have to check whether the calculated value is greater than, equal to, or smaller than the critical value of the chi-square distribution with c -1 = 3 degrees of freedom and α = 0.05. Critical value in the table is 7.815, and our calculated value is 13.35, which is greater than the critical value. Therefore, we can reject the null hypothesis, and there are significant differences in the median rankings of participants from different age groups. In short, in Friedman test, the test statistic is F. The null hypothesis is that the median rankings of all the groups are equal. Our calculated value of F is 13.35, which is more than the critical value of 7.815. In this case, we can reject the null hypothesis. It cannot be rejected only when calculated F value is less than the critical value. Using Minitab
In order to work on the Friedman Test, we can work on the following data:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
255
Enter the data in three columns, namely “Groups”, “Raters”, and Responses. We assign the numbers to Groups. So, in the column of Groups, i.e. C1 in this case, enter 1 (for A) in first six rows; 2 (for B) in the next six rows; 3 (for C) in the next six rows, and 4 (for D) in the next six rows. In the column of Raters, i.e. C2 in this case, enter 1, 2, 3, 4, 5, 6 in first six rows, and repeat these numbers in the next six rows. Overall, repeat this four times, i.e. up to the row number 24. In the column of Responses, i.e. C3 in this case, enter the respective responses. We get the following table: Groups
Raters
Responses
1
1
8
1
2
9
1
3
7
1
4
6
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
256
1
5
9
1
6
8
2
1
7
2
2
8
2
3
5
2
4
7
2
5
6
2
6
7
3
1
6
3
2
7
3
3
6
3
4
5
3
5
4
3
6
8
4
1
5
4
2
6
4
3
4
4
4
3
4
5
4
4
6
5
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
257
Now go to “Stat”, go to “Nonparametrics”, and click “Friedman ” Move the columns to their respective boxes on the right side. In this case, we move C3 to “Response:”, C1 to “Treatment:” and C2 to “Blocks.” Click “OK.” This gives the results in which Chi-Square (Not adjusted for ties) is 13.35, and p-value is 0.004. Test Null hypothesis
H₀: All treatment effects are zero
Alternative hypothesis H₁: Not all treatment effects are zero Method
DF Chi-Square P-Value
Not adjusted for ties
3
13.35
0.004
Adjusted for ties
3
13.81
0.003
This p-value is less than 0.05, so the results are statistically significant.
Tests for Normally distributed data Unpaired “t” test, Paired “t” test, One Way ANOVA, and Two Way ANOVA can be used to test Normally distributed data.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
258
Unpaired “t” test Testing Assumptions:
A dependent continuous variable and an independent categorical variable with two groups, or levels.
The dependent variable follows a normal distribution or at least approximately normal distribution.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
259
The variances or standard deviation of the two groups are equal, i.e. homogeneity of variance or homogeneity, respectively. N.B.: If these assumptions are not followed, you may consider using non-parametric tests.
Suppose we are working on a disease and the pain associated with the disease, and we have developed two groups of 8 participants each. One group receives the new treatment, while the other group receives placebo. After receiving the treatments, those participants are assessed for the quality-of-life and pain through the Quality-of-Life Scale and overall ranking of pain respectively. After getting the scores for quality-of-life and pain, those scores are joined and an overall score out of 10 is developed. Higher scores show improvement, while lower scores show no effect of treatment. Overall scores out of 10 For Treatment Group
For Placebo Group
6
4
3
4
4
1
5
3
5
5
9
6
7
2
8
7
In this case, we will use student’s t-test as the number of sample is less than 30, i.e. overall sample size is equal to 16. Moreover, we will use unpaired or independent t-test as the
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
260
comparison has been made on the outcomes in two different groups. Now we have to perform some calculations on these results as follows:
Where s2 is pooled sample variance. Proceeding with this data, we have
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
261
Now, we have to look at the t table with two tailed test as we are not sure about the efficacy of the new treatment on human beings. We will check the table at degrees of freedom = number of samples – 2 = 14 and at α = 0.05.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Table 3: t table (Source: Anonymous)
In the table, the critical value is 2.145, but our calculated value is 1.875 that is less than the critical value of the table. This is showing that the differences of the two groups are not statistically significant. Using Minitab
In order to work on unpaired t-test, we can use the following data as an example, Overall scores out of 10
262
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
For Treatment Group
For Placebo Group
6
4
3
4
4
1
5
3
5
5
9
6
7
2
8
7
263
Put these values in the first two columns, i.e. C1 showing “For Treatment Group”, and C2 showing “For Placebo Group.” Normality test First, we have to work on the testing assumptions. In order to know the normality of the data, we can use Kolmogorov-Smirnov Test (KS-Test). For this test, open a new Worksheet, and combine the values of both groups in a single column, i.e. C1, as the groups (two groups) are independent variables, and the values are dependent variables. Now go to “Stat” tab above, go to “Basic Statistics”, and click “Normality Test
”. Move C1 to
the box with “Variable:” Select “Kolmogorov-Smirnov” under “Tests for Normality”, and click “OK.” The following graph and results are obtained:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
264
The data points are comparatively closer to the fitted normal distribution line. The p-value (more than 0.150) is more than 0.05. So, it can be said that the null hypothesis could not be rejected, and the data follow a normal distribution. Homogeneity of variance In order to test the homogeneity of variance, put these values in the first two columns, i.e. C1 showing “For Treatment Group”, and C2 showing “For Placebo Group.” Now go to “Stat” tab above, go to “ANOVA”, and click “Test for Equal Variances
”. From the
drop down menu, select “Response data are in a separate column for each factor level”. In the box under “Responses:” move both the columns, i.e. C1 and C2. Click “OK.” The following results are obtained:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
265 Test
Method
Statistic
P-Value
Multiple comparisons
0.00
0.961
Levene
0.05
0.833
The analysis of homogeneity of variances shows that the findings meet the assumption of homogeneity of variance as p-value is more than 0.05, and there is no statistically significant difference between the groups. So, now we can conduct “t” test after looking at other assumptions. “t” test Now go to “Stat”, go to “Basic Statistics”, and click “2-Sample t
”
Select “Each sample is in its own column” in the appeared window (Two-Sample t for the Mean), move the first column, i.e. C1 in this case, into the box with “Sample 1:” and move the second column, i.e. C2 in this case, into the box with “Sample 2:”. Click “OK.” This gives the t-test results. Test Null hypothesis
H₀: μ₁ - µ₂ = 0
Alternative hypothesis H₁: μ₁ - µ₂ ≠ 0 T-Value DF P-Value 1.86
13
0.086
In this case, we get T-Value as 1.86 and P-Value as 0.086, which is more than 0.05 that is also showing that the results are not statistically significant.
Paired “t” test Testing Assumptions:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
266
There is a dependent continuous variable and an independent variable with two groups, or pairs.
The observations or values are independent of each other.
The dependent variable shows approximate normal distribution.
There must not be any significant outlier in the dependent variable. N.B.: If these assumptions are not fulfilled, you may consider using non-parametric tests.
Suppose we are working on the same treatment for the disease as mentioned above on 20 other participants. This time, we are checking the efficacy of the new treatment on 20 participants, i.e. n = 20 before the start of treatment and after the start of treatment, i.e. after one month of starting the treatment. After working on the participants, suppose we have obtained the following data that is showing the combined value of the quality-of-life and pain ranking out of 10. Higher scores show improvement, while lower scores show no effect of treatment. Overall score out of 10 Participants
Before treatment
After treatment
1
3
8
2
1
6
3
5
7
4
6
9
5
1
10
6
1
7
7
1
5
8
3
6
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
267
9
0
4
10
5
8
11
5
7
12
5
9
13
3
6
14
2
8
15
6
6
16
2
7
17
2
9
18
4
8
19
1
6
20
7
5
From this data, we develop another table: Overall score out of 10 Participants
Before treatment
After treatment
Difference
1
3
8
5
2
1
6
5
3
5
7
2
4
6
9
3
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
268
5
1
10
9
6
1
7
6
7
1
5
4
8
3
6
3
9
0
4
4
10
5
8
3
11
5
7
2
12
5
9
4
13
3
6
3
14
2
8
6
15
6
6
0
16
2
7
5
17
2
9
7
18
4
8
4
19
1
6
5
20
7
5
-2
After calculating the differences, it can be found that those differences are following approximately normal distribution and there are almost no extreme outliers. Therefore, paired ttest could be performed on the data. Mean difference calculated from this data is equal to 3.9, and standard deviation is 2.343. Therefore, standard error of the mean difference is equal to 0.52 (i.e. standard deviation / square
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
269
root of total number of participants = 2.343 / square root of 20). In order to calculate the tstatistic, we have, t = mean difference / standard error of the mean difference = 3.9 / 0.52 = 7.5. Now we will look at the t table with two tailed test. For the table, degrees of freedom = number of samples – 1 = 19, and α = 0.05. Critical value in the t table is 2.093 and our calculated value is 7.5, i.e. larger than the critical value. So, there is strong evidence that the new treatment would work effectively. Using Minitab
In order to work on paired t-test, we can use the following data, Overall score out of 10 Participants
Before treatment
After treatment
1
3
8
2
1
6
3
5
7
4
6
9
5
1
10
6
1
7
7
1
5
8
3
6
9
0
4
10
5
8
11
5
7
12
5
9
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
270
13
3
6
14
2
8
15
6
6
16
2
7
17
2
9
18
4
8
19
1
6
20
7
5
Enter the values in the columns. In this case, values of “Before treatment” are entered in the first column, i.e. C1, and the values of “After treatment” are entered in the second column, i.e. C2. Normality test First, we have to work on the testing assumptions. In order to know the normality of the data, we can use Kolmogorov-Smirnov Test (KS-Test). For this test, open a new Worksheet, and combine the values of pairs or groups in a single column, i.e. C1, as the groups (two groups) or pairs are independent variables, and the values are dependent variables. Now go to “Stat” tab above, go to “Basic Statistics”, and click “Normality Test
”. Move C1 to
the box with “Variable:” Select “Kolmogorov-Smirnov” under “Tests for Normality”, and click “OK.” The following graph and results are obtained:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
271
The data points are comparatively closer to the fitted normal distribution line. The p-value (0.065) is more than 0.05. So, it can be said that the null hypothesis could not be rejected, and the data follow a normal distribution. Outlier Test Utilizing the same Worksheet, go to “Stat” tab above, go to “Basic Statistics”, and go to “Outlier Test
” in the 6th section.
Move the column, i.e. C1 to the box under “Variables:” by selecting it and pressing the button “Select”. Now click “OK.” The following results and graph appear. Grubbs' Test Variable C1
N 40
Mean 5.100
StDev 2.687
Min 0.000
Max 10.000
G 1.90
P 1.000
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
272
* NOTE * No outlier at the 5% level of significance
These results show that there is no significant outlier. So, paired “t” test could be conducted after looking at other assumptions. “t” test Now go to “Stat”, go to “Basic Statistics”, and click “Paired t
”
Enter the first column, i.e. C1 in “Sample 2:” and C2 in “Sample 1:”. Click “OK.” We get the results showing T-value (t-stat) of 7.26 and p-value of 0.000. Test Null hypothesis
H₀: μ_difference = 0
Alternative hypothesis H₁: μ_difference ≠ 0 T-Value P-Value
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 7.26
273
0.000
This p-value is less than 0.05, so we can say that the results are statistically significant. If we look at the mean values, we can find that mean values increase “After treatment”. Sample
N
Mean
StDev
SE Mean
After treatment
20
7.050
1.572
0.352
Before treatment
20
3.150
2.084
0.466
Analysis of Variance (ANOVA)
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Sum of Squares
274
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Residual Sum of Squares
One way ANOVA Testing Assumptions:
There is a continuous dependent variable and an independent variable with two or more categorical groups.
The data must represent independent observations.
There must not be any significant outliers.
275
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
276
The dependent variable must be normally distributed.
The dependent variable must have the variance equal in each population, i.e. homogeneity of variance. N.B.: If these assumptions are not fulfilled, you may consider using non-parametric tests.
Suppose we have worked on the same disease and the treatment as mentioned above, but now we have divided the participants of the study into four different groups receiving four different concentrations of the treatment and placebo for three months. Among those four groups, one group receives 25% solution of the new treatment; second group receives 15% solution of the new treatment; third group receives 5% solution of the new treatment, and fourth group receives placebo and is considered as control group. Every group has 5 samples or participants. Those groups give scores for quality-of-life and pain, and those scores are joined and an overall score out of 10 is developed. Group A receives
Group B receives 5%
Group C receives
Group D receives
placebo
solution of treatment
15% solution of
25% solution of
treatment
treatment
3
4
3
9
3
6
5
2
5
5
4
7
1
3
6
8
4
4
2
4
These numbers are following approximately normal distribution as shown by the following graph:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
277
We will use the Analysis of Variance (ANOVA) procedure, i.e. One way ANOVA, in this case as we have more than 3 sets of data. In order to perform ANOVA, our null hypothesis is that means of all the groups are equal, and alternative hypothesis is that means of all the groups are not equal at α=0.05. The test statistic for this process is F statistic for ANOVA, where F is equal to Mean Squares Between Treatments divided by Mean Squares Error (or Residual). The appropriate critical value will be noted from the table of probabilities for the F distribution. Our degrees of freedom will be degree of freedom one (df1) that is equal to total number of groups (k) minus one, i.e. df1 = k -1 = 4-1 =3, and degree of freedom two (df2) that is equal to total number of samples in all groups (N) minus number of groups (k), i.e. df2 = N – k = 20 – 4 = 16. With these values of degrees of freedom and at α=0.05, the critical value is 3.24. Therefore, we have to reject the null hypothesis if the observed F value will be greater than or equal to 3.24.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Table 4: F distribution table at alpha level of 0.05 (Source: Anonymous)
278
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
279
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
280
Now we will calculate the F statistic. For this calculation, it is important to take the sample mean for each group and then the overall mean on the basis of the total sample.
Number of
Group A
Group B
Group C
Group D
received placebo
received 5%
received 15%
received 25%
solution of
solution of
solution of
treatment
treatment
treatment
5
5
5
5
3.2
4.4
4
6
samples (n) Group mean
If we consider all N=20 observations, the overall mean is equal to 4.4, i.e. 88/20 = 4.4. Now Sums of Squares Between Treatments (SSB) is calculated by the following formula, SSB = number of samples in Group A (Group A mean – Overall mean)2 + number of samples in Group B (Group B mean – Overall mean)2 + number of samples in Group C (Group C mean – Overall mean)2 + number of samples in Group D (Group D mean – Overall mean)2 So, SSB = 5 (3.2 – 4.4)2 + 5 (4.4 – 4.4)2 + 5 (4 – 4.4)2 + 5 (6 – 4.4)2
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
281
SSB = 5 (1.44) + 5 (0) + 5(0.16) + 5(2.56) SSB = 7.2 + 0 + 0.8 + 12.8 SSB = 20.8 Now, we will calculate Sums of Squares for Errors (or Residuals) [SSE]. In order to calculate SSE, squared differences are required between each observation and its group mean, i.e. SSE = Total value of (Score – 3.2)2 of Group A + Total value of (Score – 4.4)2 of Group B + Total value of (Score – 4.0)2 of Group C + Total value of (Score – 6.0)2 of Group D. So, it will be calculated in parts. For the samples in Group A, Group A
(Score – 3.2)
(Score – 3.2)2
3
-0.2
0.04
3
-0.2
0.04
5
1.8
3.24
1
-2.2
4.84
4
0.8
0.64
Total
0
8.8
Group B
(Score – 4.4)
(Score – 4.4)2
4
-0.4
0.16
6
1.6
2.56
5
0.6
0.36
For the samples in Group B,
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
282
3
-1.4
1.96
4
-0.4
0.16
Total
0
5.2
Group C
(Score – 4)
(Score – 4)2
3
-1
1
5
1
1
4
0
0
6
2
4
2
-2
4
Total
0
10
Group D
(Score – 6)
(Score – 6)2
9
3
9
2
-4
16
7
1
1
8
2
4
4
-2
4
For the samples in Group C,
For the samples in Group D,
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Total
0
283
34
Now SSE = Total value of (Score – 3.2)2 of Group A + Total value of (Score – 4.4)2 of Group B + Total value of (Score – 4.0)2 of Group C + Total value of (Score – 6.0)2 of Group D SSE = 8.8 + 5.2 + 10 + 34 SSE = 58 Now, the ANOVA Table is as follows: Source of
Sums of
Degrees of
Means Squares
Variation
Squares (SS)
Freedom (df)
(MS)
Between
20.8 = SSB
4-1=3 = df1
20.8/3=6.93 =
MSB / MSE =
MSB
1.91
Treatments Error (or
58 = SSE
20-4=16 = df2
58/16=3.63 = MSE
Residual) Total
F
78.8
20-1=19
From this calculated F value, i.e. 1.91, we can conclude that we cannot reject the null hypothesis as 1.91 is smaller than 3.24. We don’t have statistically significant evidence at α=0.05 to show that there is a difference in the mean score among the four groups or treatment levels. In short, the new treatment at any concentration below 30% in solution is not much effective as compared to placebo. Using Minitab
In order to work on one way ANOVA, we can use the following data:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
284
Group A receives
Group B receives 5%
Group C receives
Group D receives
placebo
solution of treatment
15% solution of
25% solution of
treatment
treatment
3
4
3
9
3
6
5
2
5
5
4
7
1
3
6
8
4
4
2
4
Enter these values in four columns, i.e. C1 for “Group A receives placebo”, C2 for “Group B receives 5% solution of treatment”, C3 for “Group C receives 15% solution of treatment”, and C4 for “Group D receives 25% solution of treatment”.
Normality test First, we have to work on the testing assumptions. In order to know the normality of the data, we can use Kolmogorov-Smirnov Test (KS-Test). For this test, open a new Worksheet, and combine the values of all the groups in a single column, i.e. C1, as the groups (four groups) are independent variables, and the values are dependent variables. Now go to “Stat” tab above, go to “Basic Statistics”, and click “Normality Test
”. Move C1 to
the box with “Variable:” Select “Kolmogorov-Smirnov” under “Tests for Normality”, and click “OK.” The following graph and results are obtained:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
285
The data points are comparatively closer to the fitted normal distribution line. The p-value (0.094) is more than 0.05. So, it can be said that the null hypothesis could not be rejected, and the data follow a normal distribution. Outlier Test Utilizing the same Worksheet, go to “Stat” tab above, go to “Basic Statistics”, and go to “Outlier Test
” in the 6th section.
Move the column, i.e. C1 to the box under “Variables:” by selecting it and pressing the button “Select”. Now click “OK.” The following results and graph appear. Grubbs' Test Variable C1
N 20
Mean 4.400
StDev 2.037
Min 1.000
Max 9.000
G 2.26
P 0.317
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
286
* NOTE * No outlier at the 5% level of significance
These results show that there is no significant outlier. Homogeneity of variance In order to test the homogeneity of variance, enter these values in four columns, i.e. C1 for “Group A receives placebo”, C2 for “Group B receives 5% solution of treatment”, C3 for “Group C receives 15% solution of treatment”, and C4 for “Group D receives 25% solution of treatment”.
Now go to “Stat” tab above, go to “ANOVA”, and click “Test for Equal Variances
”. From the
drop down menu, select “Response data are in a separate column for each factor level”. In the box under “Responses:” move all the four columns, i.e. C1, C2, C3, and C4. Click “OK.” The following results are obtained:
Method Multiple comparisons
Test Statistic —
P-Value 0.243
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Levene
287 1.27
0.319
The analysis of homogeneity of variances shows that the findings meet the assumption of homogeneity of variance as p-value is more than 0.05, and there is no statistically significant difference between the groups. So, now we can conduct One way ANOVA after looking at other assumptions. One Way ANOVA
Now go to “Stat”, go to “ANOVA”, and click “One-Way ” Select “Response data are in a separate column for each factor level” from the drop down menu; move all the columns, i.e. C1, C2, C3, and C4 to the box under “Responses:” Click “OK.” This gives us the results in which F-value is equal to 1.91 and p-value is equal to 0.168. Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Factor
3
20.80
6.933
Error
16
58.00
3.625
Total
19
78.80
1.91
0.168
This p-value is more than 0.05 showing that the results are not statistically significant.
Two way ANOVA Two-Factor ANOVA procedure could be performed by different variables at the same time as, for example, we can consider different ages and different concentrations of treatments at the same time as opposed to only different concentrations as shown above in One way ANOVA. Testing Assumptions:
There is a continuous dependent variable and two independent variables with two or more categorical groups.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Observations are independent of each other.
Samples represent normal distribution.
Samples show homogeneity of variance.
There are no significant outliers in the data.
288
N.B.: If these assumptions are not fulfilled, you may consider using non-parametric tests.
Suppose we have participants belonging to two different age groups, i.e. Group A having 15 samples (participants) in the age range of 10-25 years, and Group B having 15 samples (participants) in the age range of 25-40 years. Each group of 15 samples were randomly assigned to three different solutions of the new treatment, i.e. Treatment X received 25% of the solution, Treatment Y received 15% of the solution, and Treatment Z received 5% of the solution. Suppose, we have obtained the following data: Treatment
X receiving 5% of the
Group A of participants in the Group B of participants in the age range of 10-25 years
age range of 26-40 years
3
4
4
5
4
5
5
6
4
7
5
6
6
5
5
6
solution
Y receiving 15% of the solution
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Z receiving 25% of the
289
8
7
6
7
7
7
7
9
8
9
6
7
6
8
solution
Here, we will use two way ANOVA as we have two groups, i.e. Group A of participants in the age range of 10-25 years and Group B of participants in the age range of 26-40 years, and Three treatment, i.e. X receiving 25% of the solution, Y receiving 15% of the solution, and Z receiving 5% of the solution. In order to perform two way ANOVA, we have the following ANOVA table: Source of
Sums of
Degrees of
Mean Squares
F
Variation
Squares (SS)
freedom (df)
(MS)
Total number
SSN
df1=Total
SSN/df1 = MSN
MSN/MSE
df2 = Total
SST / df2 =
MST/MSE
number of
MST
number of
of groups
groups – 1 = 6-1 =5 Treatment
SST
treatment groups – 1 = 3 -1 = 2
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Groups
SSG
df3 = Total
SSG / df3 =
number groups
MSG
290
MSG/MSE
by age range – 1 = 2 -1 = 1 Treatment
SSTG = SSN –
df4 = df2 * df3 = SSTG / df4 =
versus Group
(SST + SSG)
2*1=2
MSTG
SSE
df5 = Total
SSE / df5 =
number of
MSE
MSTG/MSE
interaction Error (or Residual)
samples – (Total number groups by age range * Total number of treatment groups) = 30 – (2*3) = 30-6 = 24
Total
After thorough calculation, we can make another table with final values. Source of
Sums of
Degrees of
Mean Squares
Variation
Squares (SS)
freedom (df)
(MS)
Total number
45
df1 = 5
45/5 = 9
of groups
F
9/0.95 = 9.47
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Treatment
36.5
df2 = 2
291
36.5 / 2 = 18.3
18.3/0.95 = 19.26
Groups
6.5
df3 = 1
6.5 / 1 = 6.5
6.5/0.95 = 6.84
Treatment
2
df4 = 2
2/2=1
1/0.95 = 1.05
22.8
df5 = 24
22.8 / 24 = 0.95
versus Group interaction Error (or Residual) Total
In this table, there are four statistical tests. The first test is an overall test to check whether a difference exists between 6 group means. The F statistic is 9.47, which is greater than the critical value of 2.62 at alpha level of 0.05, so it is statistically significant. After looking the significance of the overall test, it is important to check the factors that may result in the significance, i.e. treatment, group, or their interaction. F statistic for the treatment is 19.26, which is greater than the critical value of 3.40, and it is significant. Similarly, F statistic for the groups is 6.84, which is greater than the critical value of 4.26, and it is significant. However, the F statistic for the treatment versus group interaction is 1.05, which is lower than the critical value of 3.40, and it is non-significant. Now, we will look at the mean values of different groups and treatments. So, we have the following table, Treatment
Group A
Group B
X
4
5.4
Y
6
6.2
Z
6.8
8
Treatment Z appears to be the best treatment in different treatments and groups. Among all treatments and groups, Group B shows better results.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
292
Using Minitab
In order to work on two-way ANOVA, we can use the following data as an example: Treatment
X receiving 5% of the
Group A of participants in the Group B of participants in the age range of 10-25 years
age range of 26-40 years
3
4
4
5
4
5
5
6
4
7
5
6
6
5
5
6
8
7
6
7
7
7
7
9
8
9
6
7
6
8
solution
Y receiving 15% of the solution
Z receiving 25% of the solution
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
293
Normality test First, we have to work on the testing assumptions. In order to know the normality of the data, we can use Kolmogorov-Smirnov Test (KS-Test). For this test, open a new Worksheet, and combine the values of all the groups in a single column, i.e. C1, as the groups are independent variables, and the values are dependent variables. Now go to “Stat” tab above, go to “Basic Statistics”, and click “Normality Test
”. Move C1 to
the box with “Variable:” Select “Kolmogorov-Smirnov” under “Tests for Normality”, and click “OK.” The following graph and results are obtained:
The data points are comparatively closer to the fitted normal distribution line. The p-value (>0.150) is more than 0.05. So, it can be said that the null hypothesis could not be rejected, and the data follow a normal distribution. Outlier Test
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
294
Utilizing the same Worksheet, go to “Stat” tab above, go to “Basic Statistics”, and go to “Outlier Test
” in the 6th section.
Move the column, i.e. C1 to the box under “Variables:” by selecting it and pressing the button “Select”. Now click “OK.” The following results and graph appear. Grubbs' Test Variable N Mean StDev C1 30 6.067 1.530 * NOTE * No outlier at the 5% level of significance
Min 3.000
Max 9.000
These results show that there is no significant outlier. Homogeneity of variance, normality, and independence of residuals
G 2.00
P 1.000
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
295
We have to assign numbers to the groups and treatments. So, we assign number “1” to “Group A of participants in the age range of 10-25 years” and 2 to “Group B of participants in the age range of 26-40 years.” Similarly, we assign number “1” to “X receiving 5% of the solution”, “2” to “Y receiving 15% of the solution”, and “3” to “Z receiving 25% of the solution.” After assigning the numbers, we enter these numbers and the outcomes according to the assigned numbers in Minitab. For example, Group 1, Treatment 1, Outcome 3; Group 1, Treatment 1, Outcome 4;
. Group 2, Treatment 1, Outcome 4;
. Group 2, Treatment 2, Outcome 6 etc.
Enter the values of Groups in the first column, i.e. C1; treatments in the second column, i.e. C2, and outcomes in the third column, i.e. C3. Now go to “Stat”, go to “ANOVA”, go to “General Linear Model”, and click “Fit General Linear Model ” Move the columns to the respective boxes. In this case, C3 shows “Responses:”, C2 shows “Factors:”, and C1 shows “Covariates:”. Now click “Graphs ”, select “Four in one”, and click “OK.” Click “OK” again. Following graphs are obtained:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
296
In these graphs, “Normal Probability Plot” shows that the residuals follow an approximately straight line, thereby, showing that the residuals are approximately normally distributed. The “Versus Fits” plot shows that none of the groups have substantially different variability (almost all of them are overlapping), and there is not significant outlier. This shows constant variance. The “Histogram” shows that none of the bars is significantly far from the other bars that is representative of no outliers. Furthermore, there is no significantly long tail in one direction that is representative of no skewness. The “Versus Order” plot shows that the residuals fall randomly about the centerline that is indicative of the residuals being independent from each other. These tests show that Two Way ANOVA can be conducted on the data. In fact, it has already been conducted (above the graphs in Minitab). Two Way ANOVA Analysis of Variance Source
DF Adj SS Adj MS F-Value P-Value
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Groups
1
6.5333
6.83
0.015
Treatments
2 36.467 18.2333
19.06
0.000
1.09
0.353
Error
6.533
297
26 24.867
Lack-of-Fit Pure Error Total
2
0.9564
2.067
1.0333
24 22.800
0.9500
29 67.867
In the obtained results, p-values of groups (C1) and treatments (C2) are less than 0.05, i.e. they are showing statistically significant results. In order to look at the mean values of different groups and treatments, go to “Stat”, go to “Basic Statistics”, and go to “Display Descriptive Statistics ” In the box under “Variables:” move the column “C3 Outcomes” and in the box under “By variables (optional):” move the columns “C1 Groups” and “C2 Treatments”. Click “Statistics” and select only “Mean.” Click “OK.” Click “OK” again. This gives the following results: Descriptive Statistics: Outcomes Results for Groups = 1 Statistics Variable Outcomes
Treatments Mean 1
4.000
2
6.000
3
6.800
Results for Groups = 2 Statistics
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Variable Outcomes
298
Treatments Mean 1
5.400
2
6.200
3
8.000
Treatment # 3 (i.e. Z) appears to be the best treatment in different treatments and groups. Among all treatments and groups, Group # 2 (i.e. B) shows better results.
Different types of ANOVAs
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
299
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
300
Using Minitab – General MANOVA
Suppose researchers want to check the effect of a Drug on heart rate and mobility, and perform an experiment to check the difference of the Drug from that of Standard drug and placebo. In order to perform the experiment, researchers work with 36 participants, and expose 12 participants to the Drug, 12 participants to the Standard drug, and 12 participants to the placebo. The results are obtained in the form of ratings from 1 to 10, where higher numbers show increased heart rate and mobility. In Minitab, assign C1 to the Intervention, i.e. Drug, Standard Drug, and Placebo. Enter “Drug” in the first 12 rows, “Standard drug” in the rows from 13 to 24, and “placebo” in the rows from 25 to 36. Assign C2 to the rating for Heart rate, and C3 to the rating for mobility. Write the obtained outcomes for the different interventions and effects. For example,
Now go to “Stat” tab above, go to “ANOVA”, and go to “General MANOVA ”
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
301
In the appeared window, move “Heart rate” and “Mobility” to box with “Responses:”, and move “Intervention” to the box under “Model:” Click “OK.” The following results are obtained: General Linear Model: Heart rate, Mobility versus Intervention MANOVA Tests for Intervention DF
Test Criterion Wilks’
Statistic
F
Num Denom
P
0.63686 4.049
4
64
0.005
Lawley-Hotelling 0.56570 4.384
4
62
0.003
4
66
0.009
Pillai’s
0.36601 3.696
Roy’s
0.55762 s=2
m = -0.5
n = 15.0
These results show that one-way MANOVA is statistically significant as the p-values are less than 0.05. Therefore, it can be said that the there is a statistically significant difference between the interventions on the outcomes. In order to look at the mean values of different treatments (interventions) and outcomes, go to “Stat”, go to “Basic Statistics”, and go to “Display Descriptive Statistics
”
In the box under “Variables:” move the columns “Heart rate” and “Mobility”, and in the box under “By variables (optional):” move the column “Intervention”. Click “Statistics” and select only “Mean.” Click “OK.” Click “OK” again. This gives the following results: Descriptive Statistics: Heart rate, Mobility Statistics
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
302
Variable
Intervention
Mean
Heart rate
Drug
2.583
placebo
6.500
Standard Drug 3.500 Mobility
Drug
2.917
placebo
5.417
Standard Drug 2.917 The “Drug” shows better results as compared to “Standard Drug” and “Placebo” in case of Heart rate. However, the “Drug” is almost similar to “Standard Drug” in case of Mobility, but they show useful results as compared to “placebo.”
Factor Analysis Factors analysis is used as Data reduction method OR Structure detection method.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Path Analysis It is an extension of multiple regression.
303
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Structural Equation Modeling It is a confirmatory technique that is an advancement of path analysis.
304
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
305
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
306
Effect size Effect size shows the magnitude of the difference between two different variables or groups indicating the strength of the difference between groups on numeric scale.
Magnitude of the difference between two different variables or groups
Effect size (ES) (Part 1)
Absolute effect size
Standardized means difference
Odds ratio
Showing the strength of the difference between groups on numeric scale Difference between the mean or average outcomes of two groups
Obtained by dividing the difference of means of two groups by their standard deviation
Odds of success in one group relative to the odds of success in the other group
ES
Cohen s d effect size
1 2
Obtained by dividing the difference of means of two groups by the standard deviation from the data
d
ad bc
x1 x2 s
(n1 1) s12 (n2 1) s2 2 s n1 n2
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
307
Effect size (ES) (Part 2)
Pearson r correlation
Hedges g method of effect size
Association of two variables (x and y)
Modified method of Cohen s d
g
x1 x2 s*
(n1 1) s12 (n2 1) s2 2 s* n1 n2 2 Glass s Δ method of effect size
Modified method of Cohen s d
Standard deviation of control group or second group
x1 x2 s2
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
308
Using Minitab
The “Calculator” function could be used to work on effect size.
Odds Ratio and Mantel-Haenszel Odds Ratio Odds Ratio (OR) is used to determine the effect size of the difference in two interventions or treatments. Suppose there is a bacterial disease that could be treated by fish meat, especially rainbow trout. So, we can develop two groups of mice (animal models) to know and compare the treatments’ efficacy. One group receives the standard treatment with the commonly available fishes, and the other group receives rainbow trout. After giving the treatment with fishes, suppose we get the following results. Standard treatment
Treatment with
Odds
with fishes
rainbow trout
Mouse Died
454
19
454/19 = 23.89
Mouse Survived
358
105
358/105=3.41
812
124
Odds Ratio = 23.89/3.41 = 7
This table is showing that the animals, who received standard treatment with the fishes died 7 times more often as compared to the animals, who received the treatment with rainbow trout. Now suppose, we perform experiment on male and female animal models and get the following results: Standard
Treatment
treatment
with
with fishes
rainbow
Totals
OR
60 = tfd
2.67
trout Animals died
53 = a
7=b
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Female
Animals
animal
survived
models
Male animal models
309
102 = c
36 = d
138 = tfs
Totals
155 = nfs
43 = nfr
198 = nf
Animals died
111 = e
13 = f
124 = tmd
Animals
152 = g
71 = h
223 = tms
363 = nms
84 = nmr
447 = nm
3.99
survived Totals
The table is showing that the impact of treatment in males is more as OR in male animal models is higher as compared to OR in female animal models. In order to check the impact of treatment on different sexes, “Weighted” OR, or MantelHaenszel OR (ORMH) has been used. ORMH is as follows:
This is showing that weighted chance of death related to the standard treatment is 3.4 times the chance of death of animal models having treatment with rainbow trout. Using Minitab
In order to work on Odds Ratio, we can use the following data as an example;
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Standard treatment with
310
Treatment with rainbow trout
fishes Mouse Died
454
19
Mouse Survived
358
105
In Minitab, we have to put this table as follows: Standard Treatment
Mouse Died
454.00
Treatment with Rainbow
Mouse Died
19.00
Standard Treatment
Mouse Survived
358.00
Treatment with Rainbow
Mouse Survived
105.00
Trout
Trout Now go to “Stat”, go to “Regression”, go to “Binary Logistic Regression”, and click “Fit Binary Logistic Model ” In “Response:” enter C2; in “Frequency:” enter C3, and in “Categorical predictors:” enter C1. Click “OK.” Results appear in which “Odds Ratio” is 7.0082. Odds Ratios for Categorical Predictors Level A
Level B
Odds Ratio
95% CI
C1 Treatment with Rainbow Trout Standard Treatment
7.0082 (4.2173, 11.6462)
Odds ratio for level A relative to level B In order to work on ORMH, we can use the following data:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Standard
Treatment
treatment
with
with fishes
rainbow
311
Totals
OR
2.67
trout Female animal models
Male animal models
Animals died
53 = a
7=b
60 = tfd
Animals
102 = c
36 = d
138 = tfs
Totals
155 = nfs
43 = nfr
198 = nf
Animals died
111 = e
13 = f
124 = tmd
Animals
152 = g
71 = h
223 = tms
363 = nms
84 = nmr
447 = nm
survived
3.99
survived Totals
However, in Minitab we have to put the data in the following way female
standard treatment
animal died
53.00
male
standard treatment
animal died
111.00
female
treatment with rai
animal died
7.00
male
treatment with rai
animal died
13.00
female
standard treatment
animal surv
102.00
male
standard treatment
animal surv
152.00
female
treatment with rai
animal surv
36.00
male
treatment with rai
animal surv
71.00
Now go to “Stat”, go to “Tables”, go to “Cross Tabulation and Chi-Square ” Select “Raw data (categorical variables)” from the drop down menu.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
312
Move C2 to “Rows:”, C3 to “Columns:”, C1 to “Layers:”, and C4 to “Frequencies:”. Click “Other Stats ” and select “Cochran-Mantel-Haenszel test for multiple tables” and click “OK.” Click “OK.” This gives us the result for Mantel-Haenszel chi-squared test. Cochran-Mantel-Haenszel Test Common Odds Ratio CMH Statistic DF 3.47808
23.3116
P-Value
1 0.0000014
Results for all 2x2 tables
In this result, we have “Common odds ratio” equal to 3.478.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
313
Correlation Coefficient
Suppose research is performed on young animal models of approximately same weight. Initially, animals were injected with the disease causing bacteria. Those animals were initially placed in the lab for five days. After 5 days, their physical condition was checked, and their weight was assessed. After thorough assessment, those animals were provided with a specific amount of rainbow trout, and after 5 days their physical condition was again checked and their weight was assessed. Suppose thorough assessment helps in giving the following results: Animal Model #
1
Gram of rainbow trout per
Weight of animals (gms) after
day (i.e. feed of animal
providing the specific amount
models for 5 days)
of rainbow trout per day
1
7
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
314
2
2
10
3
3
15
4
4
16
5
5
17
6
6
18
This information helps in providing the following graph:
Correlation coefficient, which is also represented by ‘r’, can help in finding that increased grams of rainbow trout help in increased level of improvement in animal models. The sample correlation coefficient varies from -1 to +1. With every increase in value from -1 to +1, the strength and direction of the linear association also increase among the two variables, i.e. grams of rainbow trout given to the animal models and improvement in their condition, and if the correlation is close to zero, it means that there is no relation between the two variables. However, before going further, it is important to know that in this case, grams of rainbow trout is the independent variable and presented on x-axis, and weight of animals is dependent variable and
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
315
presented on y-axis. So, we develop a scatter diagram as shown here. Each point on the diagram is showing an (x,y) pair. This scatter diagram is apparently showing a positive or direct relation between two variables, i.e. increased amount of fish taken per day can increase the level of improvement. We can also develop a table showing the total and mean values of grams of rainbow trout given to the animal models and their condition. Animal Model #
Gram of rainbow trout per
Weight of animals after
day (i.e. feed of animal
providing a specific amount
models for 5 days)
of rainbow trout
1
1
7
2
2
10
3
3
15
4
4
16
5
5
17
6
6
18
Total
21
83
Mean value
3.5 = X-mean
13.8 = Y-mean
In order to work on sample correlation coefficient, variance of the values on Y-axis, variance of the values of X-axis, and the covariance of the values on both X-axis and Y-axis [Cov(x,y)] are required. Variance of the values of X-axis is as follows: Animal Model #
Gram of rainbow trout per day (i.e.
Grams – X-mean
(Grams – X-mean)2
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
316
feed of animal models for 5 days) 1
1
-2.5
6.25
2
2
-1.5
2.25
3
3
-0.5
0.25
4
4
0.5
0.25
5
5
1.5
2.25
6
6
2.5
6.25
Total
21
0
17.5
The variance of the values on X-axis = 17.5/6 = 2.9. Variance of the values of Y-axis is as follows: Animal Model #
Weight of animals
Weight – Y-mean
(Weight – Y-mean)2
after providing a specific amount of rainbow trout 1
7
-6.83
46.69
2
10
-3.83
14.69
3
15
1.17
1.36
4
16
2.17
4.69
5
17
3.17
10.03
6
18
4.17
17.36
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Total
83
≈0
317
94.83
The variance of the values on Y-axis = 94.83/6 = 15.8. The covariance of the values of X-axis and Y-axis is presented as: Animal Model #
Grams – X-mean
Weight – Y-mean
(Grams – X-mean)( Weight – Y-mean)
1
-2.5
-6.83
17.075
2
-1.5
-3.83
5.745
3
-0.5
1.17
-0.585
4
0.5
2.17
1.085
5
1.5
3.17
4.755
6
2.5
4.17
10.425
Total
38.5
The covariance of the values on X-axis and the values on Y-axis = Cov (x,y) = 38.5/6 = 6.4. The formula for calculation of the sample correlation coefficient is as follows:
r = 6.4 / 6.8 r = 0.94 This value of sample correlation coefficient is clearly showing a strong positive correlation.
Using Minitab
In order to work on Correlation Coefficient, we can use the following data:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Animal Model #
318
Gram of rainbow trout per
Weight of animals (gms) after
day (i.e. feed of animal
providing the specific amount
models for 5 days)
of rainbow trout per day
1
1
7
2
2
10
3
3
15
4
4
16
5
5
17
6
6
18
Enter the values in the columns. In this case, the values of “Gram of rainbow trout per day (i.e. feed of animal models for 5 days)” are entered in the first column, i.e. C1, and the values of “Weight of animals (gms) after providing the specific amount of rainbow trout per day” are entered in the second column, i.e. C2. Now go to “Stat”, go to “Basic Statistics”, and click “Correlation ” In the box under “Variables ” move the variables, i.e. C1 and C2. Click the button “Options ” and select “Pearson correlation” from the drop down menu with “Method:” click “OK.” This gives the result of 0.945. This value of sample correlation coefficient is clearly showing a strong positive correlation.
R-squared and Adjusted R-squared R-squared estimates the differences in one variable (dependent variable) in relation to differences in a second variable (independent variable). Adjusted R-squared adjusts the measurements/statistic on the basis of a number of independent variables.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Regression Analysis Simple linear regression on the obtained data can be performed to estimate the relationships between variables.
319
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
320
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
321
For simple linear regression analysis, we can consider the data as mentioned above and develop the following table: Animal
Values on X- (Values on
Model #
axis
X-axis)2
Values on Y- (Values on axis
Y-axis)2
(Values on Xaxis)(Values on Y-axis)
1
1
1
7
49
7
2
2
4
10
100
20
3
3
9
15
225
45
4
4
16
16
256
64
5
5
25
17
289
85
6
6
36
18
324
108
Mean
3.5
Sum Total
21 = Ʃx
1243 = Ʃy2
329 = Ʃxy
Square of
441 = (Ʃx)2
13.83 91 = Ʃx2
83 = Ʃy 6889 = (Ʃy)2
the sum total
Now we need three corrected sums of squares, i.e. SST, SSX, and SSXY. So,
Then,
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
322
Then,
Now the slope of the best fit line, which is represented by b, and the intercept, which is represented by a, can be calculated by the following equations:
And a = Mean of the values on Y-axis – b × Mean of the values on X-axis a = 13.8 – 2.2 × 3.5 = 6.1 Now, we will calculate the regression sum of squares (SSR) and the error sum of squares (SSE) as follows: SSR = b × SSXY SSR = 2.2 × 38.5 = 84.7 And SSE = SST – SSR
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
323
SSE = 94.8 – 84.7 = 10.13 A completed ANOVA table with all these values can be represented as follows: Source
SS
df
MS
F
Regression
84.7
1
84.7
33.48
Error
10.13
4
2.53
Total
94.83
5
The calculated F value of 33.48 is much larger than the critical values of 7.71 at α=0.05 and 1 degree of freedom in the numerator and 4 degrees of freedom in the denominator. This is clearly showing that we have to reject the null hypothesis that the slope of the line is zero, and we can confidently work on the increased amount of rainbow trout for increasing the level of improvement in case of the disease. Standard errors of the slope (SEb) and intercept (SEa) can be calculated as follows:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
324
In order to calculate 95% confidence intervals for the slope and intercept, 2-tailed value of Student’s t with 4 degrees of freedom is required. So, it is 2.776. This value is then multiplied with SEa and SEb to get the 95% confidence interval for intercept and slope respectively. So, a = 6.1 ± (SEa x 2.776) = 6.1 ± (1.48 x 2.776) = 6.1 ± 4.11 b = 2.2 ± (SEb x 2.776) = 2.2 ± (0.37 x 2.776) = 2.2 ± 1.03 = 3.23, 1.17 This is showing that we are 95% confident that the average weight of animal models increases between 3.23 and 1.17 grams per one gram increase of rainbow trout in the diet. Moreover, this range of confidence interval does not contain zero; thereby, showing a significant relationship between the two variables at alpha level of 0.05. Using Minitab
In order to work on Regression Analysis, we can use the following data: Animal Model #
Gram of rainbow trout per
Weight of animals (gms) after
day (i.e. feed of animal
providing the specific amount
models for 5 days)
of rainbow trout per day
1
1
7
2
2
10
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
325
3
3
15
4
4
16
5
5
17
6
6
18
Enter the values in the columns. In this case, the values of “Gram of rainbow trout per day (i.e. feed of animal models for 5 days)” are entered in the first column, i.e. C1, and the values of “Weight of animals (gms) after providing the specific amount of rainbow trout per day” are entered in the second column, i.e. C2. Now go to “Stat”, go to “Regression”, go to “Regression”, and click “Fit Regression Model ” In the box under “Responses:” move C2, and in the box under “Continuous predictors:” move C1. Click “OK.” This gives the results. Regression Analysis: C2 versus C1 Analysis of Variance Source
DF Adj SS Adj MS F-Value P-Value
Regression
1
84.70
84.700
33.43
0.004
C1
1
84.70
84.700
33.43
0.004
Error
4
10.13
2.533
Total
5
94.83 Coefficients
Term
Coef
SE Coef T-Value P-Value VIF
Constant
6.13
1.48
4.14
0.014
C1
2.200
0.380
5.78
0.004
1.00
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
326
In the results, F-Value is 33.43 and p-value is 0.004. Coefficient of Constant is 6.133 and Coefficient of C1, i.e. “Gram of rainbow trout per day (i.e. feed of animal models for 5 days)” is 2.2. With p-value less than 0.05, they are showing statistical significance.
Logistic Regression
Using Minitab
In order to work on logistic regression, we can take the same example as noted above: In the first column C1, enter 0.50 in the first row, 0.75 in the second row, 1.00 in the third row, 1.25 in the fourth row, and so on, until 5.50 in the 20th row. This column is named as “Hours.” In the second column C2, enter 0 in the first row, 0 in the second row, 0 in the third row, and so on, until 1 in the 20th row. This column is named as “Pass.”
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
327
In the third column C3, enter 1 in all the rows up to 20 th row (as there are 20 participants and every participant has been checked once). This column is named as “Number of trials.” Now go to “Stat” tab above, go to “Regression”, go to “Binary Logistic Regression”, and select “Fit Binary Logistic Model ” In the appeared window, select “Response in event/trial format” from the drop down menu. Write “Pass” in the box with “Event name:”, move “Pass” into the box with “Number of events:”, move “Number of trials” into the box with “Number of trials”, and “Hours” into the box under “Continuous predictors:”. You may also work on “Graphs
”, “Options ” and “Results ”
Click “OK.” Binary Logistic Regression: Pass versus Hours Method Link function Logit Rows used
20
Response Information Variable
Value
Pass
Event
10
Non-event
10
Total
20
Number of trials
Count Event Name Pass
Analysis of Variance Wald Test Source
DF
Chi-Square
P-Value
Regression
1
5.73
0.017
Hours
1
5.73
0.017
Model Summary
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
328
Deviance Deviance R-Sq
R-Sq(adj)
AIC
42.08%
38.47%
20.06
Coefficients Term
Coef
Constant -4.08 Hours
1.505
SE Coef VIF 1.76 0.629
1.00
Odds Ratios for Continuous Predictors Odds Ratio
95% CI
4.5026
(1.3131, 15.4392)
Hours
Regression Equation P(Pass) = exp(Y')/(1 + exp(Y')) Y'
=
-4.08 + 1.505 Hours
Goodness-of-Fit Tests Test
DF Chi-Square P-Value
Deviance
18
16.06
0.588
Pearson
18
14.60
0.689
Hosmer-Lemeshow
8
3.07
0.930
Black-Scholes model It was developed by Fisher Black, Robert Merton and Myron Scholes in the year 1973. It is most commonly used in European financial markets.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
329
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Using Minitab
The “Calculator” function could be used to work on the Black-Scholes model.
Combination A selection of items or set of objects without considering the order of selection of items or objects.
330
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
331
Using Minitab
In order to work on Minitab, we can use the same example of combination of 2 days at a time.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Go to “Calc” tab above, and click “Calculator
332
”
In the window “Calculator”, in the “Store result in variable:” in the right box enter a column, such as C3. In the box below “Expression:” enter “COMBINATIONS(7,2)” (where 7 shows the “number of items” and 2 shows the “number to choose”) and click “OK.” The result will appear in the spreadsheet. In this case, the result (combination) is 21.
Permutation It is an ordered arrangement of items or set of objects.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Using Minitab
In order to work on Minitab, we can use the same example of permutation of 2 days at a time.
333
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Go to “Calc” tab above, and click “Calculator
334
”
In the window “Calculator”, in the “Store result in variable:” in the right box enter a column, such as C3. In the box below “Expression:” enter “PERMUTATIONS(7,2)” (where 7 shows the “number of items” and 2 shows the “number to choose”) and click “OK.” The result will appear in the spreadsheet. In this case, the result (permutation) is 42.
Even and Odd Permutations Even permutation is obtained by composing a zero or even number of inversions or of swaps of two elements. Odd permutation is obtained by composing an odd number of inversions or of swaps of two elements.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
335
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
336
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
337
Circular Permutation It refers to the total number of ways to arrange n distinct objects, such as w, x, y, and z, around a fixed circle.
Survival Analysis A model for time from start of a study (baseline) to the happening of a certain event. For example, the study may start at birth and the event is death, when study ends. Survival analysis gives survival data. This data consists of completely observed data and censored data, which is also known as missing data.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Survival analysis
338
A model for time from start of a study to the happening of a certain event
gives
Also known as
Baseline examples
examples
Death
Birth
Component failure
Manufacture of component
Survival data Consists of 2 types of data
1. Completely observed data
Second heart attack
2. Censored data
Second event
First heart attack First event
Probability of surviving
Probability of survival
Missing data that may occur during the study, when (1) the subject or (also known as) sample Right doesn t have censoring the event of interest (i.e. second event), and/or (2) the data is lost.
to
In the form of graph
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
20
40
60
80
100
120
Time (years)
This graph shows that: At 30 years, the survival probability is about 0.95, i.e. 95%, At 60 years, the survival probability is about 0.4, i.e. 40%, At 100 years, the survival probability is about 0.05, i.e. 5%.
Kaplan-Meier method One of the most commonly used nonparametric methods for survival analysis is Kaplan-Meier method, which is also known as Product Limit method.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
339
Suppose we have the following data in which 12 participants (over the age of 50 years) were studied for 25 years until they die, they lost to follow-up, or the study ends (and remember this data is just for understanding and can’t take the place of original research). Participant number
Year of death
Year of last contact
1
25
2
16
3
4
4
8
5
17
6
25
7
10
8
10
9
14
10
2
11
14
12 18 This table shows that 3 participants died before the completion of the study, and 2 participants completed the study. The Kaplan-Meier method utilizes this formula, St(i) = St(i)-1*((Ni-Di)/Ni) to compute survival probability. In this formula, Ni shows number at risk, Di shows number of deaths, St(i) shows survival probability, and St(i)-1 shows survival probability just previous to the present one. We can use life table approach for Kaplan-Meier method. Time (years)
Number at risk (Ni)
Number of deaths (Di)
Number censored (C)
Survival probability (St(i) = St(i)-1*((NiDi)/Ni))
2
12
1
1*((12-1)/12)=0.917
4
11
1
0.917*((11-1)/11)=0.834
8
10
1
0.834*((10-0)/10)=0.834
10
9
2
0.834*((9-0)/9)=0.834
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
14
7
16
1
340
1
0.834*((7-1)/7)=0.715
5
1
0.715*((5-0)/5)=0.715
17
4
1
0.715*((4-0)/4)=0.715
18
3
1
0.715*((3-0)/3)=0.715
25 2 2 0.715*((2-0)/2)=0.715 Roughly, it can also be represented by the following graph:
Survival probability
Survival probability 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
5
10
15
20
25
30
Time (years)
This shows that at 4 years after the start of study, the survival probability is about 0.834, i.e. 83.4%, and the survival probability at 17 years and 18 years is almost same, i.e. 0.715 or 71.5%. Using Minitab
In order to work on Minitab, we can use the same example as noted above. However, the above table is written as: Observation
Status
25
1
16
0
4
1
8
0
17
0
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
25
1
10
0
10
0
14
1
2
1
14
0
341
18 0 In this table, “0” shows no event, and “1” shows the occurrence of event, i.e. the time related to the participants who either died or completed the study. Enter these values in the first two columns, i.e. C1 and C2. Now go to “Stat”, go to “Reliability/Survival”, go to “Distribution Analysis (Right Censoring)”, and select “Nonparametric Distribution Analysis”. In the window, “Nonparametric Distribution Analysis-Right Censoring”, move C1 to the box below “Variables:”. Press the button “Censor
”, and move C2 to the box below “Use censoring
columns:” and enter “0” in the box with “Censoring value:”, and click “OK”. Click “OK” again. This gives us the results and graph as follows: Kaplan-Meier Estimates 95.0% Normal CI
Number
Number
Survival
Standard
Time
at Risk
Failed
Probability
Error
Lower
Upper
2
12
1
0.916667
0.079786
0.760290
1.00000
4
11
1
0.833333
0.107583
0.622475
1.00000
14
7
1
0.714286
0.143705
0.432629
0.99594
25
2
2
0.000000
0.000000
0.000000
0.00000
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
342
This shows that at 4 years after the start of study, the survival probability is about 0.834, i.e. 83.4%, and the survival probability at 17 years and 18 years is almost same, i.e. 0.715 or 71.5%. This graph assumes that after 25 years, there is no further survival probability.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
343
Bonus Topics Most commonly used non-normal distributions in health, education, and social sciences Most commonly used non-normal distributions in health sciences, education, and social sciences have been obtained from the paper “Non-normal Distributions Commonly Used in Health, Education, and Social Sciences: A Systematic Review” by Roser Bono, María J. Blanca, Jaume Arnau, and Juana Gómez-Benito, published in the journal Frontiers in Psychology.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Circular permutation in Nature For example, proteins contain changed order of amino acids in their peptide sequence.
344
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
345
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
346
Time Series A time series is represented by a series of data points that are recorded at specific times or time intervals. It is represented by a timeplot in which values (variables) are represented on the y-axis and time is displayed on the x-axis. Time series help in showing variations in data with the passage of time. This is useful in finding the patterns related to the data and in making predictions regarding the values (variables) in relation to the time. The patterns of time series are often difficult to analyze due to the reasons, such as the increased number of data points with equal intervals. In these situations, the technique of “smoothing” is used in which a line graph is utilized without the series of dots. This technique is also considered important in the prediction of future events as, for example, predicting whether the stock market would go up or down. It is also helpful in spotting the outliers. One of the simple ways of smoothing the timeplots is by utilizing the moving average. In the time series data, periodic fluctuations often appear and these fluctuations are referred to as “seasonality”. It is important to note that seasonality in time series may occur at any time period in the data as, for example, the daily air temperature may increase in the mid of the day and decrease during the night time.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
347
Using Minitab
Suppose we have the following data: Days
Values
1
3
2
2
3
7
4
5
5
10
6
9
7
8
8
5
Put these values in the columns C1 and C2. Go to “Stat” tab above, go to “Time Series”, and select “Time Series Plot
”
Select “Simple” and click “OK.” Move the columns C1 and C2 to the box under “Series:” and click “Time/Scale ” Select “Calendar” under “Time Scale”, and click “OK.” Again click “OK.” The following graph is obtained:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
348
Time Series Plot of C2 10 9 8
C2
7 6 5 4 3 2 1 26
27
28
29
30
31
1
2
Day
In order to get smoothing (line), go to “Stat” tab above, go to “Time Series”, and select “Time Series Plot
”
Select “Simple” and click “OK.” Move the columns C1 and C2 to the box under “Series:” and click “Time/Scale ” Select “Calendar” under “Time Scale”, and click “OK.” Now click “Data View
”, select “Smoother”, select “Lowess”, and click “OK.”
Again click “OK.” The following graph is obtained:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
349
Time Series Plot of C2 10 9 8
C2
7 6 5 4 3 2 1 26
27
28
29
30
31
1
2
Day
In order to get information about the trend, go to “Stat” tab above, go to “Time Series”, and select “Trend Analysis
”
Move C2 to the box with “Variable:” Click “Time
”, select “Calendar:” (Day) under “Time Scale”, and click OK.”
Again click “OK.” The following graph is obtained:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
350
This graph is showing upward trend. Monte Carlo Simulation The Monte Carlo method is a numerical method of solving mathematical problems by random sampling of a large number of variables. It is used for obtaining numerical solutions to problems that are too complicated to solve analytically.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Density Estimation Density estimation is the process of estimation of a continuous density field from a discretely sampled set of data-points obtained from that density field.
351
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Decision Tree It is used for classification, prediction, estimation, clustering and visualization.
352
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Decision tree for test statistics is as follows:
353
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
354
Minitab 18 is also providing assistance in statistical procedures with the help of decision trees:
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
355
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
356
Meta-analysis The following information has been obtained from “Basics of Meta-analysis with Basic Steps in R” written by Usman Zafar Paracha. It is available here: https://amzn.to/31ff3Kj Meta-analysis is a quantitative and systematic method to combine the results of previous studies to get conclusions.
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Meta-analysis
A quantitative and systematic method to combine the results of previous studies to get conclusions
1
Formulation of research quation
2
Searching the required articles and information
3 4
Statistical analysis
Q, Tau-squared, I-squared Plots, such as Funnel plot, Baujat plot, L Abbe plot
Combined results – Overall Effect size / Pooled Statistical Results
Working on Utilizing Preferred Reporting Items inclusion and for Systematic Reviews and Metaexclusion Analyses (PRISMA) to note the flow criteria of information
Collection of abstract data
Heterogeneity statistics
357
Such as age of participants, sample size, outcomes, etc.
Calculation of effect size
Ratio of observed variation to within-study variance
The size of the difference of effect between two groups
Development of Forest Plot
Proportion of observed variation that can be caused by actual difference between studies
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
358
Important Statistical Techniques/Procedures used in Medical Research Following table shows some important statistical techniques or procedures commonly used in medical research. This information has been taken from the paper titled, “Statistical trends in the Journal of the American Medical Association and implications for training across the continuum of medical education” by Arnold, Braganza, Salih, & Colditz published in PLoS ONE in 2013. Table 5: Important Statistical techniques/procedures used in Medical Research Statistical Techniques/Procedures
Comments (about the Increased, decreased or static use in the past 3 decades)
Pearson correlation coefficient
Its use is declining at a faster rate
Mantel-Haenszel
Its use is declining
ANOVA
Its use is declining
Simple Linear regression
Its use is almost static
Fisher Exact
Its use is almost static
t-test
Its use is almost static
Logistic regression
Its use is almost static
Chi-square
Its use is almost static
Descriptive Statistics
Used most commonly but its use is declining
Morbidity and Mortality
Its use is almost static
Low-level Statistical measures
Used most commonly but its use is almost static
Transformation
Its use is almost static
Multiple comparison
Its use is increasing slowly
Colorful Statistics and Minitab® 19 – Usman Zafar Paracha
Epidemiologic statistics
Its use is increasing slowly
Poisson Regression
Its use is increasing
p-trend
Its use is increasing
Log-rank test
Its use is increasing at a faster rate
Wilcoxon Rank
Its use is increasing
Non-parametric test
Its use is increasing
Intention to treat
Its use is increasing
Kaplan Meier
Its use is increasing
Power
Its use is increasing at a faster rate
Cox models
Its use is increasing
Multi-level modeling
It use is increasing at a faster rate
Survival analysis
Its use is increasing at a faster rate
Multiple regression
Its use is increasing
Sensitivity analysis
Its use is increasing
359