Colorful Statistics with Basic Steps in Minitab® 19

A try to make statistics colorful along with basic Minitab® 19 instructions … It has •Handy illustrations on a huge numb

124 27 22MB

English Pages 359

Recommend Papers

Basic Biostatistics with Basic Steps in Minitab®

This book covers the following topics: •Hypothesis •Frequency •Type I Error and Type II Error, and Sample and Population

104 52 2MB Read more

Colorful Statistics with Basic Steps in Stata® 14

A try to make statistics colorful along with basic Stata® 14 instructions … It has •Handy illustrations on a huge number

172 24 51MB Read more

Basic Biostatistics with Basic Steps in R

This book covers the following topics: •Hypothesis •Frequency •Type I Error and Type II Error, and Sample and Population

115 20 2MB Read more

Basic Biostatistics with Basic Steps in MATLAB®

This book covers the following topics: •Hypothesis •Frequency •Type I Error and Type II Error, and Sample and Population

121 72 2MB Read more

Basic Biostatistics with Basic Steps in Stata®

This book covers the following topics: •Hypothesis •Frequency •Type I Error and Type II Error, and Sample and Population

117 80 2MB Read more

Basic Biostatistics with Basic Steps in Stata®

This book covers the following topics: •Hypothesis •Frequency •Type I Error and Type II Error, and Sample and Population

117 74 2MB Read more

Basic Biostatistics with Basic Steps in Microsoft Excel

This book covers the following topics: •Hypothesis •Frequency •Type I Error and Type II Error, and Sample and Population

110 1 2MB Read more

Basic Statistics with R 9780128209264

742 80 10MB Read more

Statistics for Six SIGMA Green Belts with Minitab and JMP 1116006286, 0132291959, 9780132291958

To make Six Sigma work, executive and managerial 'greenbelts' and 'champions' need to understand cor

444 114 4MB Read more

Modern Industrial Statistics: With Applications in R, MINITAB and JMP [3 ed.] 9781119714927, 1119714923, 9781119714941, 111971494X, 9781119714965, 1119714966

696 121 21MB Read more

Colorful Statistics with Basic Steps in Minitab® 19

Author / Uploaded
Usman Zafar Paracha

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Colorful Statistics with Basic Steps in Minitab® 19

Usman Zafar Paracha M. Phil. Pharmaceutics, Rawalpindi, Pakistan (2019)

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha This Book will help the students to learn and utilize some basic concepts of statistics while utilizing Minitab® 19.

Any Feedback will be Highly Appreciated.

Usman Zafar Paracha Owner of SayPeople.com [email protected] https://www.facebook.com/usmanzparacha

2

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

3

Some words from the author I tried to make this book on statistics as informative and illustrative as possible, especially for beginners in statistics. "Portions of information contained in this publication/book are printed with permission of Minitab Inc. All such material remains the exclusive property and copyright of Minitab Inc. All rights reserved." "MINITAB® and all other trademarks and logos for the Company's products and services are the exclusive property of Minitab Inc. All other marks referenced remain the property of their respective owners. See minitab.com for more information." This book may also have some trademarked names without using trademark symbol. However, they are used only in an editorial context, and there is no intention of infringement of trademark. It is important to note that calculations and examples used in this book could not take the place of actual research. Statistics has to be used under the guidance of experts. People with authentic comments and/or feedbacks (on Amazon) can ask me questions or send me “Message” here: https://www.facebook.com/usmanzparacha, and I will try to answer them.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

4

Contents Hypothesis ................................................................................................................................ 13 Hypothesis testing ..................................................................................................................... 14 Frequency ................................................................................................................................. 15 Using Minitab .................................................................................................................... 15 Sample and Population .............................................................................................................. 16 Type I Error and Type II Error .................................................................................................. 17 Measuring the Central Tendencies (mean, median, and mode), and Range ................................ 20 Using Minitab .................................................................................................................... 20 Geometric Mean........................................................................................................................ 21 Using Minitab .................................................................................................................... 22 Grand Mean .............................................................................................................................. 23 Using Minitab .................................................................................................................... 25 Harmonic Mean ........................................................................................................................ 25 Using Minitab .................................................................................................................... 26 Mean Deviation......................................................................................................................... 27 Using Minitab .................................................................................................................... 28 Mean Difference ....................................................................................................................... 29 Using Minitab .................................................................................................................... 29 Root mean square ...................................................................................................................... 30 Using Minitab .................................................................................................................... 30 Sample mean and population mean ........................................................................................... 31 Types of data in statistics .......................................................................................................... 32 Data Collection and its types ..................................................................................................... 32 Sampling plan ........................................................................................................................... 33

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

5

Sampling Methods .................................................................................................................... 34 Required Sample Size ............................................................................................................... 36 Using Minitab .................................................................................................................... 37 Simple Random Sampling ......................................................................................................... 38 Cluster Sampling ....................................................................................................................... 40 Stratified Sampling ................................................................................................................... 41 Graphs and Plots ....................................................................................................................... 42 Bar Graph .............................................................................................................................. 42 Using Minitab .................................................................................................................... 42 Pie Chart................................................................................................................................ 44 Using Minitab .................................................................................................................... 45 Scatter plot or Bubble chart ................................................................................................... 47 Different patterns of data in bubble charts .............................................................................. 48 Using Minitab .................................................................................................................... 49 Dot plot ................................................................................................................................. 52 Using Minitab .................................................................................................................... 53 Matrix plot............................................................................................................................. 56 Using Minitab .................................................................................................................... 57 Pareto Chart........................................................................................................................... 59 Using Minitab .................................................................................................................... 61 Histogram.............................................................................................................................. 62 Using Minitab .................................................................................................................... 63 Stem and Leaf plot................................................................................................................. 65 Using Minitab .................................................................................................................... 66 Box plot................................................................................................................................. 67

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

6

Using Minitab .................................................................................................................... 69 Outlier....................................................................................................................................... 70 Using Minitab: ................................................................................................................... 71 Quantile .................................................................................................................................... 74 Using Minitab .................................................................................................................... 77 Standard deviation and Variance ............................................................................................... 78 Using Minitab .................................................................................................................... 80 Range Rule of Thumb ............................................................................................................... 81 Probability ................................................................................................................................ 81 Bayesian statistics ..................................................................................................................... 83 Using Minitab .................................................................................................................... 89 Reliability Coefficient ............................................................................................................... 89 Cohen's Kappa Coefficient ........................................................................................................ 89 Fleiss’ Kappa Coefficient .......................................................................................................... 92 Using Minitab .................................................................................................................... 93 Cronbach's alpha ....................................................................................................................... 95 Using Minitab .................................................................................................................... 95 Coefficient of variation ............................................................................................................. 97 Using Minitab .................................................................................................................... 97 Chebyshev's Theorem ............................................................................................................... 99 Using Minitab .................................................................................................................. 100 Factorial .................................................................................................................................. 100 Using Minitab .................................................................................................................. 101 Distribution, and Standardization ............................................................................................ 102 Using Minitab .................................................................................................................. 108

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

7

Prediction interval ................................................................................................................... 110 Using Minitab .................................................................................................................. 112 Tolerance interval ................................................................................................................... 113 Using Minitab .................................................................................................................. 115 Parameters to describe the form of a distribution ..................................................................... 116 Skewness ............................................................................................................................. 117 Kurtosis ............................................................................................................................... 118 Using Minitab .................................................................................................................. 119 Different functions of distributions .......................................................................................... 121 Probability Density Function ............................................................................................... 124 Using Minitab .................................................................................................................. 125 Cumulative Distribution Function ........................................................................................ 127 Using Minitab .................................................................................................................. 127 Types of Distribution .............................................................................................................. 130 Binomial Distribution .......................................................................................................... 130 Using Minitab .................................................................................................................. 132 Chi-squared Distribution...................................................................................................... 134 Using Minitab .................................................................................................................. 134 Continuous Uniform Distribution ........................................................................................ 137 Using Minitab .................................................................................................................. 137 Cumulative Poisson Distribution.......................................................................................... 140 Using Minitab .................................................................................................................. 140 Exponential Distribution ...................................................................................................... 142 Using Minitab .................................................................................................................. 142 Normal Distribution ............................................................................................................. 145

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

8

Using Minitab .................................................................................................................. 146 Poisson Distribution ............................................................................................................ 148 Using Minitab .................................................................................................................. 148 Beta Distribution ................................................................................................................. 151 Using Minitab .................................................................................................................. 153 F Distribution ...................................................................................................................... 154 Using Minitab .................................................................................................................. 157 Gamma Distribution ............................................................................................................ 157 Using Minitab .................................................................................................................. 159 Negative Binomial Distribution ........................................................................................... 161 Using Minitab .................................................................................................................. 164 Gumbel Distribution ............................................................................................................ 165 Using Minitab .................................................................................................................. 167 Hypergeometric Distribution ............................................................................................... 167 Using Minitab .................................................................................................................. 169 Inverse Gamma Distribution ................................................................................................ 169 Using Minitab .................................................................................................................. 172 Log Gamma Distribution ..................................................................................................... 172 Using Minitab .................................................................................................................. 174 Laplace Distribution ............................................................................................................ 174 Using Minitab .................................................................................................................. 177 Geometric Distribution ........................................................................................................ 178 Using Minitab .................................................................................................................. 181 Level of Significance and confidence level.............................................................................. 181 Statistical Estimation ............................................................................................................... 182

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

9

Interval Estimation .................................................................................................................. 183 Using Minitab .................................................................................................................. 184 Best Point estimation............................................................................................................... 185 Using Minitab .................................................................................................................. 186 Correlation .............................................................................................................................. 187 Central Limit Theorem ............................................................................................................ 187 Standard Error of the Mean ..................................................................................................... 188 Using Minitab .................................................................................................................. 189 Statistical Significance ............................................................................................................ 190 Tests for Non-normally distributed data .................................................................................. 190 One tail and two tail tests ........................................................................................................ 191 Mood’s Median Test ............................................................................................................ 192 Using Minitab .................................................................................................................. 195 Goodness of Fit ................................................................................................................... 197 Chi-square test ..................................................................................................................... 197 Using Minitab .................................................................................................................. 202 McNemar Test ..................................................................................................................... 202 Using Minitab .................................................................................................................. 204 Kolmogorov-Smirnov Test (KS-Test) .................................................................................. 206 Using Minitab .................................................................................................................. 208 One-tailed test, two-tailed test, and Wilcoxon Rank Sum Test / Mann Whitney U Test ........ 210 Using Minitab .................................................................................................................. 219 The Sign Test ...................................................................................................................... 220 Using Minitab .................................................................................................................. 226 Wilcoxon Signed Rank Test ................................................................................................ 228

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

10

Using Minitab .................................................................................................................. 234 The Kruskal-Wallis Test ...................................................................................................... 236 Using Minitab .................................................................................................................. 244 Degrees of Freedom ......................................................................................................... 246 The Friedman Test ............................................................................................................... 248 Using Minitab .................................................................................................................. 254 Tests for Normally distributed data ......................................................................................... 257 Unpaired “t” test .................................................................................................................. 258 Using Minitab .................................................................................................................. 262 Paired “t” test ...................................................................................................................... 265 Using Minitab .................................................................................................................. 269 Analysis of Variance (ANOVA) .......................................................................................... 273 Sum of Squares ................................................................................................................ 274 Residual Sum of Squares .................................................................................................. 275 One way ANOVA ............................................................................................................... 275 Using Minitab .................................................................................................................. 283 Two way ANOVA ............................................................................................................... 287 Using Minitab .................................................................................................................. 292 Different types of ANOVAs ................................................................................................ 298 Using Minitab – General MANOVA ................................................................................ 300 Factor Analysis ....................................................................................................................... 302 Path Analysis .......................................................................................................................... 303 Structural Equation Modeling.................................................................................................. 304 Effect size ............................................................................................................................... 306 Using Minitab .................................................................................................................. 308

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

11

Odds Ratio and Mantel-Haenszel Odds Ratio .......................................................................... 308 Using Minitab .................................................................................................................. 309 Correlation Coefficient ............................................................................................................ 313 Using Minitab .................................................................................................................. 317 R-squared and Adjusted R-squared.......................................................................................... 318 Regression Analysis ................................................................................................................ 319 Using Minitab .................................................................................................................. 324 Logistic Regression ................................................................................................................. 326 Using Minitab .................................................................................................................. 326 Black-Scholes model............................................................................................................... 328 Using Minitab .................................................................................................................. 330 Combination ........................................................................................................................... 330 Using Minitab .................................................................................................................. 331 Permutation ............................................................................................................................. 332 Using Minitab .................................................................................................................. 333 Even and Odd Permutations .................................................................................................... 334 Circular Permutation ............................................................................................................... 337 Survival Analysis .................................................................................................................... 337 Kaplan-Meier method .......................................................................................................... 338 Using Minitab .................................................................................................................. 340 Bonus Topics .......................................................................................................................... 343 Most commonly used non-normal distributions in health, education, and social sciences ..... 343 Circular permutation in Nature ............................................................................................ 344 Time Series ......................................................................................................................... 346 Using Minitab .................................................................................................................. 347

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

12

Monte Carlo Simulation....................................................................................................... 350 Density Estimation .............................................................................................................. 351 Decision Tree ...................................................................................................................... 352 Meta-analysis ...................................................................................................................... 356 Important Statistical Techniques/Procedures used in Medical Research ............................... 358

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

13

Hypothesis Statistics is an art as well as science, and needs help of both fields for explanation. You need stronger imagination as an artist and well-designed research as a scientist. Let’s start with a concept of hypotheses. According to an ancient Chinese myth, there is an entirely different world behind mirrors. That world has its own creatures, and is known as Fauna of mirrors. So, if a person is of opinion that fauna of mirrors actually exist, and he wants to perform a research on fauna of mirrors, his hypothesis will be that fauna of mirrors actually exist. This is known as “research hypothesis” or “alternative hypothesis” as the person is doing research on this hypothesis. It is represented by Ha or H1.

On the other hand, there is another opinion showing that there is nothing like fauna of mirrors. This opinion can be considered as “null hypothesis” as it is negating the statement for research hypothesis and showing that research hypothesis is not a commonly observed phenomenon. Null hypothesis is represented by Ho or H0. For a research to be completed successfully, null hypothesis is usually rejected. So, in order to prove the fauna of mirrors after performing a research, null hypothesis has to be rejected.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

14

Shortly, it can be said that according to null hypothesis, nothing is changed or no significant new thing can be found (anywhere in any group), and according to alternative hypothesis, some significant change must have occurred or some significant new thing can be found (somewhere or in some group).

Hypothesis testing

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

15

Frequency

usually

Frequency

Absolute frequency

Number of times an event or incidence occurred

ni

Cumulative Frequency

Relative frequency

i

The number of times a specific event occurred divided by the total number of events

fi = ni / N

N

The sum of all previously presented frequencies

n1+n2+ +ni

Using Minitab

In order to work on the absolute frequency, relative frequency, and cumulative frequency, we can use these numbers: 1,1,1,2,2,3,3,3,3,3,4,5,5,5,5. Enter the values in the first column, i.e. from row 1 to row 15 in this case. Go to “Stat” tab above, go to “Tables”, and click “Tally Individual Variables

”

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

16

In the window “Tally Individual Variables”, move column 1, i.e. C1 in this case, from the left box to the “Variables:” in the right box by clicking on C1 and clicking the “Select” button below. You may also double C1, and this process automatically takes C1 to the box. Check “Counts”, “Percents”, and “Cumulative counts” under “Display.” Click “OK.” Results appear in the screen above the spreadsheet. The results show “Count” for absolute frequency; “Percent” for percentage of relative frequency (i.e. 0.2x100=20 in the first place; 0.13x100=13 in the second place, and so on), and “CumCnt” for Cumulative Frequency.

Sample and Population Sample refers to a small part of something that is used to represent the whole (population).

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

17

Population (N)

Random selection

Samples help in making inferences about population

Sample (n)

Selection of only blue colored samples

Sample (n)

Type I Error and Type II Error Consider the hypotheses mentioned earlier. Suppose initial findings on the hypotheses show that fauna of mirrors actually exist. So, it can be said that the null hypothesis is rejected. However, it is important to note that the results from initial findings could be false (or they could be true) due to the presence of errors in the research. Those errors are known as α error and β error.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

18

α error is also known as Type I error or False positive. In this condition, it is possible that we incorrectly reject the null hypothesis, i.e. the statement “there is nothing like fauna of mirrors” seems wrong after performing an experiment, when in reality it is right. So, it is also considered as False positive as in this condition, we think that alternative hypothesis is right (positive).This type of error can be fixed by performing further tests. Moreover, changing the level of significance can also help in reducing type I error.

On the other hand, β error is also known as Type II Error or False negative. In this condition, it is possible that we incorrectly accept the null hypothesis.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

19

Type II error is more serious as compared to type I error because after this error, nobody would do further research on alternative hypothesis. Errors of the higher kind could also be present. As, for example, type III error occurs when a researcher or an investigator gives the right answer to some wrong question. It is also considered when an investigator or researcher correctly rejects the null hypothesis for some wrong reason/s.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

20

Measuring the Central Tendencies (mean, median, and mode), and Range Population Mean

Sum of all data values

Mean

1 1 + +1 + 9 8 + 8 0 6+ 7+ 1+ 2+ 2+ 3+ 4+ 5+

Ʃx x= n Number of data values in sample

13

50% above

The middle value (or the mean of two middle values) 2

2

3

Ʃx N

Number of data values in population

50% below

1

μ=

Median

5

4

7

6

8

8

9

10

11

Mode The value/s which appear most often

Range = 11 – 1 = 10 Highest value

Lowest value

Using Minitab

In order to find the mean, median, mode, and range, we can use the numbers (as noted above). So, the numbers are 1,2,2,3,4,5,6,7,8,8,9,10, and 11. Enter the values in the first column, i.e. from row 1 to row 13 in this case. Go to “Stat” tab above, go to “Basic Statistics”, and click “Display Descriptive Statistics ” In the window “Display Descriptive Statistics”, move column 1, i.e. C1 in this case, from the left box to the “Variables:” in the right box by clicking on C1 and clicking the “Select” button below.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

21

Click “Statistics ”. A window “Display Descriptive Statistics: Statistics” appears. Check “Mean”, “Median”, “Mode”, and “Range”, and every other descriptive statistics you want to do. Click “OK.” Again click “OK.” Results appear in the screen above the spreadsheet. In this case, the results are mean = 5.846; median = 6; mode = 2,8 and range = 10.

Geometric Mean Geometric means refers to the nth root of the product of a set of data having n numbers as, for example, square root of the product of 2 numbers, cube root of the product of 3 numbers, and so on.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Using Minitab

22

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

23

In order to find the geometric mean, we can use the example of the percentage increase in the price of land (as noted above). So, the increases are 10% in the first instance i.e. (making 110% or) 1.1; than 20% i.e. (making 120% or) 1.2, and finally 50% i.e. (making 150% or) 1.5. Enter the values, 1.1, 1.2, and 1.5, in the first column, i.e. from row 1 to row 3 in this case. Go to “Calc” tab above, and click “Calculator

”

In the window “Calculator”, in the “Store result in variable:” in the right box enter a column, such as C5. In the box below “Expression:” enter “GMEAN(C1)” and click “OK.” The result will appear in the spreadsheet. In this case, the result (geometric mean) is 1.25571. So, this is the geometric mean value for the increase in every year in our example.

Grand Mean The grand mean of a set of samples is the sum of all the values in the data divided by the total number of samples.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

24

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

25

Using Minitab

Although Minitab has no direct function for the calculation of the grand mean, it can easily be calculated by using the “Calculator

”. In order to calculate the grand mean, we can take the

example as noted above. So, City

Person # 1

Person # 2

Person # 3

Person # 4

Person #5

City A

10

60

70

80

40

City B

50

20

80

120

60

City C

80

30

90

100

70

Enter the values in the columns. In this case, it is C1, C2, C3, C4, and C5. Go to “Calc” tab above, and click “Calculator

”

In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C10. In the box below “Expression:” enter “(Mean(C1) + Mean(C2) + Mean(C3) + Mean(C4) + Mean(C5))/5” and click “OK.” The result will appear in the spreadsheet in the column C10. In this case, the result (grand mean) is 64.

Harmonic Mean Harmonic mean is the number of observations in a data set divided by the sum of the reciprocals of the observations, i.e. number in the series.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Using Minitab

In order to calculate the harmonic mean, we can take the example as noted above. So, the numbers are 5, 8, and 13. Enter the values in the column C1 from row 1 to row 3. Go to “Calc” tab above, and click “Calculator

”

In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C10. In the box below “Expression:” enter “N(C1)/(sum(1/C1))”

26

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

27

The result will appear in the spreadsheet in the column C10. In this case, the result (harmonic mean) is 7.46411. In case of weighted harmonic mean, we enter the values 5, 5, 8, and 13 in C1 in the rows from 1 to 4, and use “N(C1)/(sum(1/C1))” in the box below “Expression:” in the window “Calculator.” Click “OK.” The result will appear in the spreadsheet in the specified column, i.e. C10. In this case, the result (weighted harmonic mean) is 6.64537.

Mean Deviation Mean of absolute deviations of observed or variable values from the mean value.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Mean of absolute deviations of observed or variable values from the mean value.

28

The vertical bars (||) represent absolute value, i.e. ignoring minus signs

Mean value of choices

Variable value

x  D  Mean Deviation   N

Example

3 4 6  4.33 3

N

Number of observations or values

Their mean is

Difference

Suppose, we have three values: 3,4,6

 x    (3  4.33)  (4  4.33)  (6  4.33)  3.33 Mean deviation is

Mean Deviation 

3.33  1.11 3

Using Minitab

In order to work on Minitab, we can use the same example as noted above. So, the example numbers are 3, 4, and 6. Enter the values in the column C1 in the rows from 1 to 3.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Go to “Calc” tab above, and click “Calculator

29

”

In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C10. In the box below “Expression:” enter “(ABS(3-Mean(C1))+ABS(4-Mean(C1))+ABS(6Mean(C1)))/3”. Here, ABS is for absolute value function that changes all negative values into positive values. Click “OK.” The result will appear in the spreadsheet in the column C10. In this case, the result (mean deviation) is 1.111.

Mean Difference It is also known as difference in means. It determines the absolute difference between the two groups’ mean values; thereby, helping in knowing the change on an average. So, Mean difference = Mean of one group – Mean of the other group Using Minitab

Mean difference can easily be calculated in Minitab. Suppose in one group, we have the numbers 3, 4, and 6, and in the other group, we have the numbers, 2, 3, and 4. Enter the number of one group in C1 and the other group in C2. Go to “Calc” tab above, and click “Calculator

”

In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C10. In the box below “Expression:” enter “C1-C2”. Click “OK.” The result will appear in the spreadsheet in the column C10. In this case, the results are 1, 1, and 2.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Root mean square Root mean square is the square root of the sum of the squares of the observations or values divided by the total number of values.

Using Minitab

In order to work on Minitab, we can use the numbers 3, 4, and 6 as an example. Enter the values in the column C1 in the rows from 1 to 3. Go to “Calc” tab above, and click “Calculator

”

In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C10. In the box below “Expression:” enter “sqrt(Mean(C1^2))”. Click “OK.”

30

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha The result will appear in the spreadsheet in the column C10. In this case, the result (root mean square) is 4.50925.

Sample mean and population mean Sample mean and population mean are two different things. Sample mean is represented by x̄, and population mean is represented by μ. Population consists of all the elements from a collection of data (N), and sample consists of some observations from that data (n).

31

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

32

Types of data in statistics

Data Collection and its types Data collection is the process of collecting information in relation to selected variables. There are two main types: primary data and secondary data.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Sampling plan It is the detailed outline of study measurements.

33

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Sampling Methods Sampling methods are of two main types: (1) Probability sampling and (2) Non-probability sampling.

34

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

No known probability or chance of samples to be selected

Considered as “sampling disasters”

Random selection of nth sample and then selecting every nth sample in succession

35

Sampling Methods Non-probability Sampling

Known probability or chance of Probability Sampling samples to be selected

Each sample has an equal Simple chance of Random being Sampling selected

Volunteer Samples Convenience (haphazard) Samples

Stratified Sampling

Proportionate stratified sampling

Systematic Sampling

Disproportion ate stratified sampling

Cluster Sampling

Multistage Sampling Combination of two or more sampling methods

One stage cluster sampling

Two stage cluster sampling

Multi-stage cluster sampling

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Required Sample Size Sample size can be determined either through a subjective approach or through a mathematical approach.

36

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

37

Using Minitab

Cochran’s sample size formula In order to work on Cochran’s sample size formula in Minitab, we can develop an example. So, a researcher wants to know how many people in a city of 10,000 people, with at least 5% precision, have cars. If we consider with 95% confidence that 60% of all the people have cars. Then the z-value at 95% confidence level will be 1.96 per the normal table. Go to “Calc” tab above, and click “Calculator

”

In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C10. In the box below “Expression:” enter “(((1.96)^2)*0.6*0.4)/0.05^2”. Click “OK.” The result will appear in the spreadsheet in the column C10. In this case, the result (sample size) is 368.794 or simply 369. This result is according to the Cochran’s sample size formula. Cochran’s sample size formula when the population size is small In order to work on Cochran’s sample size formula when the population size is small as, for example, 1000 people Go to “Calc” tab above, and click “Calculator

”

In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C11. In the box below “Expression:” enter “369/(1+(368/1000))”. Click “OK.” The result will appear in the spreadsheet in the column C11. In this case, the result (sample size) is 269.737 or simply 270. This result is according to the modification of the Cochran’s sample size formula when the population size is small. Slovin’s formula In order to work on Slovin’s formula in Minitab, go to “Calc” tab above, and click “Calculator

”

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

38

In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C12. In the box below “Expression:” enter “10000/(1+10000*0.05^2)”. Click “OK.” The result will appear in the spreadsheet in the column C12. In this case, the result (sample size) is 384.615 or simply 385. This result is according to the Slovin’s formula.

Simple Random Sampling It has a population with N items or objects, and sample with n items or objects. It can be done with replacement or without replacement.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

39

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Cluster Sampling In cluster sampling, a population is divided into separate groups known as clusters, and then simple random sampling of clusters (i.e. sampled clusters) is done.

40

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Stratified Sampling In stratified sampling, a population is divided into separate groups known as strata, and then simple random sampling of units/items from each strata is done.

41

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Graphs and Plots Bar Graph A bar graph is represented by bars for different categories on graph.

Using Minitab

In order to develop Bar Graph, we can use the example (as noted above).

42

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Reason of travelling

Percentage of people

Visiting family

25

Spending time with friends

15

Work-related

18

Personal issues

10

Escaping the colder climate

6

Discovering a new culture

11

Leisure

15

43

Enter the values in the columns. In this case, we have “Reason of travelling” in the first column and “Percentage of people” in the second column. Now go to “Graph” tab above and go to “Bar Chart

” in the 4th section. A window “Bar Charts”

appears. Under “Bars represent:”, select “A function of a variable” from drop down menu. Under “One Y”, select “Simple”, and click “OK.” A window “Bar Chart: A function of a variable, One Y, Simple” appears. Under “Function”, choose the function of the graph variable. In this case, “Mean” is selected as a function. Click in the box below the “Graph variables:”, move the variables having numerical data. In this case, C2 is moved to this box. Click in the box below the “Categorical variable:”, move the categorical data. In this case, C1 is moved to this box by clicking on the “Select” button below. You can also work on “Labels Now click “OK.”

” , “Chart Options

” etc.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

44

A chart appears as follows:

Chart of Mean( Percentage of people ) Mean of Percentage of people

25 20 15 10 5 0

g in er v o i sc D

a

w ne

re ltu u c

n pi ca s E

g

t

he

c

r de ol

m cli

e at

r isu e L

e

s ue ss i l na so r Pe

g in nd e Sp

e tim

w

ith

i fr

ds en s it Vi

g in

m fa

ily W

o

d te la e r rk

Reason of travelling

Pie Chart Pie chart, also known as pie graph, is a statistical graphical chart that is presented in a circular form in which numerical proportions are divided into slices.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

45

Using Minitab

In order to develop Pie Chart, we can use the example (as noted above). Cost of construction of house

Supposed amount spent in

Percentages

USD Labor

60000

20

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

46

Timber

18000

6

Electrical works

18000

6

Supervision

90000

30

Steel

30000

10

Bricks

30000

10

Cement

27000

9

Design and fee for

15000

5

12000

4

Engineer/Architect Other

Enter the values in the columns. In this case, we have “Cost of construction of house” (representing categorical variable) in the first column, “Supposed amount spent in USD” in the second column, and “Percentages” in the third column. Now go to “Graph” tab above and go to “Pie Chart

” in the 4th section. A window “Pie Chart”

appears. Select “Chart values from a table”, and in the box below, “Categorical variable:” move the categorical data. In this case, C1 is moved to this box by clicking on the “Select” button below. In the box below “Summary variables:”, move the variables having numerical data. In this case, C2 is moved to this box. You can also work on “Labels Now click “OK.” A chart appears as follows:

” and “Data Options

” etc.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

47

Pie Chart of Cost of construction of house Category Labor Timber Electrical works Supervision Steel Bricks Cement Design and fee for Engineer/Architect other

In the Minitab®, user can move the cursor above the color to know the percentage.

Scatter plot or Bubble chart Scatter plot or scatter chart graphically displays the relationship between two variables (two sets of data) in the form of dots.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

48

Scatter chart is called as bubble chart, if the data points (dots) in the graph are replaced with circles (bubbles) of variable size and characteristics as, for example, circles with only outline and no colors inside.

Different patterns of data in bubble charts Bubble charts could be of different types depending on linearity, slope, and strength.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Using Minitab

In order to develop Scatter Plot, we can use the example (as noted above). Year

Number of tourists

2011

1167000

2012

966000

2013

565212

2014

530000

2015

563400

49

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

2016

965498

2017

1750000

Enter the values in the columns. In this case, we have “Year” in the first column (i.e. C1), and “Number of tourists” in the second column (i.e. C2). Now go to “Graph” tab above and go to “Scatterplot

” in the 1st section. A window

“Scatterplots” appears. Select “Simple” and click “OK.” Now, click on the box below “X variables” in front of “1”; click on “C1 Year”, and press “Select.” Similarly, click on the box below “Y variables” in front of “1”; click on “C2 Number of tourists”, and press “Select.” You can also work on “Labels

” and “Data Options

Now click “OK.” A simple chart appears as follows:

” etc.

50

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

51

Scatterplot of Number of tourists vs Year 1750000

Number of tourists

1500000

1250000

1000000

750000

500000 2011

2012

2013

2014

2015

2016

2017

Year

In order to draw Bubble chart, go to “Graph” tab above and go to “Bubble plot

” in the 1st

section. A window “Bubble plots” appears. Select “Simple” and click “OK.” Now, click on the box below “Y variable:”, click on “C2 Number of tourists”, and press “Select.” Similarly, click on the box below “X variable:” click on “C1 Year”, and press “Select.” Now, click on the box below “Bubble size variable:”, click on “C2 Number of tourists”, and press “Select.” You can also work on “Labels

” and “Data Options

Now click “OK.” A simple chart appears as follows:

” etc.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Dot plot Dot plot displays graphical data in the form of dots.

52

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Using Minitab

In order to develop Dot Plot, we can use the example (as noted above). House number

Number of siblings

53

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

1

4

2

3

3

1

4

0

5

5

6

2

7

2

We have to slightly modify this as follows:

The no. of siblings in houses 1 1 1 1 2 2 2 3 5 5 5

54

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

55

5 5 6 6 7 7 In this table, the house numbers are mentioned according to the number of siblings. For example, there are four siblings in the first house, so the house number, i.e. 1 is mentioned four times; there are three siblings in the second house, so the house number, i.e. 2 is mentioned three times; there is no sibling in the fourth house, so the house number, i.e. 4 is not mentioned, and there are five siblings in the fifth house, so the house number, i.e. 5 is mentioned five times. Enter the values in the column. In this case, we have “House number” in the first column. Now go to “Graph” tab above and go to “Dotplot

” in the 2nd section. A window “Dotplots”

appears. Select “Simple” under “One Y,” and click “OK.” (Though other options are also available) Now, click on “C1 House number”, and press “Select” to move it to the box under “Graph variables:”. You can also work on “Labels

” and “Data Options

Now click “OK.” A simple chart appears as follows:

” etc.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Matrix plot Matrix plot is used to study the association of several pairs of variables at a time.

56

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

57

Matrix plot is used to study the association of several pairs of variables at a time. Suppose we have the following data with four variables (A, B, C, and D):

A, (other variable, such as B, C, or D)

B, (other variable)

Matrix plot is as follows:

D, (other variable)

Graphical representation of different variables in the form of dots

A, B

The position of a dot refers to its value on x-axis and yaxis of the two specific variables

B, C

Using Minitab

In order to develop Matrix Plot, we can use the same example (as noted above). A

B

C

D

1

3.6

4.9

4.1

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

58

1

5.0

3.5

4.1

1

4.7

4.1

4.4

1

6.1

7.3

5.3

2

7.6

5.2

2.4

2

6.9

4.7

4.0

2

5.6

5.0

5.4

2 5.0 5.7 5.8 Enter the values in the column. In this case, we have “A” in the first column, “B” in the second column, “C” in the third column, and “D” in the fourth column. Now go to “Graph” tab above and go to “Matrix Plot

”. A window “Matrix Plots” appears.

Select “Simple” under “Matrix of plots,” and click “OK.” (Though other options are also available) Now, select the columns “A”, “B”, “C”, and “D” and press “Select” to move them to the box under “Graph variables:”. You can also work on “Labels

” and “Data Options

Now click “OK.” A simple chart appears as follows:

” etc.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

59

Pareto Chart Pareto chart, also known as Pareto distribution diagram, is used for improving the most probably advantageous or beneficial areas with the help of the identification of commonly experienced defects or causes of customer complaints.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Pareto Chart

Improving the most advantageous or beneficial areas

Used for

With the help of

Also known as

Pareto distribution diagram

60

Identification of commonly experienced defects or causes of customer complaints

Based on Pareto principle showing that about 80% of the output is associated with 20% of the input

Example

Suppose, a hotel gets following complaints along with their counts:

This graph shows that improvements in Difficult parking , Cleaning issues , and Too noisy could solve about 80% of the complaints.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

61

Using Minitab

In order to develop Pareto Chart, we can use the same example (as noted above). Complaint

Counts

Difficult parking

175

Bad behavior of receptionists

25

Poor lightning

11

Room service is not good

21

Cleaning issues

145

Overpriced

12

Too noisy

32

Confusing layout 16 Enter the values in the column. In this case, we have “Complaint” in the first column, and “Counts” in the second column. Now go to “Stat” tab above, go to “Quality Tools”, and go to “Pareto Chart

”. A window

“Pareto Chart” appears. Move “Complaint” to the box with “Defects or attribute data in:” by selecting it and pressing the “Select” button, and move “Counts” to the box with “Frequencies in:” by selecting it and pressing the “Select” button. You can also work on labels and title by pressing the “Options ” button. Now click “OK.” A simple chart appears as follows:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Histogram A histogram is also a graph that groups numbers into ranges.

62

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Using Minitab

In order to develop Histogram, we can use the example (as noted above). House number

Number of siblings

1

4

63

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

2

3

3

1

4

0

5

5

6

2

7

2

Enter the values in the columns. In this case, we have “House number” in the first column, and “Number of siblings” in the second column. Now go to “Graph” tab above and go to “Histogram ” in the second section. A window “Histograms” appears. Select “Simple” from the appeared window, and click “OK.” In the box of “Graph variables:” move the column having numerical data. In this case, “C2 Number of siblings” has the numerical data, so it is moved to the box by clicking on the button “Select.” You can also work on “Labels

” and “Data Options

” etc.

Click “OK” again in the window of “Histogram: Simple.” A simple graph appears as shown below:

64

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

65

Histogram of Number of siblings 2.0

Frequency

1.5

1.0

0.5

0.0

0

1

2

3

4

5

Number of siblings

Stem and Leaf plot In a stem and leaf plot, the data values are split into the first digit or digits showing “stem” and the last digit showing “leaf”.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Suppose we have the following numbers...

15,14,17,12,23,24,29,35,47,41,59,62,67,69

15 splits into 1 (stem) and 5 (leaf)

Stem and leaf plot is as follows:

69 splits into 6 (stem) and 9 (leaf) In this way, large amounts of data can quickly be organized... Using Minitab

In order to develop Stem and Leaf plot, we can use the example (as noted above): 15,14,17,12,23,24,29,35,47,41,59,62,67,69 Enter the values in the first column, i.e. C1 in the rows from 1 to 14.

66

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

67

Now go to “Graph” tab above and go to “Stem-and-Leaf ” in the 2nd section. A window “Stemand-Leaf” appears. In the box of “Graph variables:” move the column C1 by clicking on the button “Select.” Click “OK”. Results appear in the screen above the spreadsheet. Stem-and-leaf of C1 N = 14 4

1

2457

7

2

349

7

3

5

6

4

17

4

5

9

3

6

279

Leaf Unit = 1

Box plot The box plot (also known as box and whisker diagram) is a graphical way of representing the distribution of data on the basis of five number summary: (1) minimum, (2) first quartile, (3) median, (4) third quartile, and (5) maximum.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

68

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

69

Using Minitab

In order to generate boxplot, we can use the following example: School 1

School 2

School 3

Number of desks in class 1

54

42

56

Number of desks in class 2

60

48

68

Number of desks in class 3

65

53

78

Number of desks in class 4

66

54

80

Number of desks in class 5

67

55

82

Number of desks in class 6

69

57

86

Number of desks in class 7

70

58

88

Number of desks in class 8

72

60

92

Number of desks in class 9

73

61

94

Number of desks in class 10

75

63

98

Enter the values in the first four columns, i.e. from C1 to C4. Now go to “Graph” tab above and go to “Boxplot

” in the 3rd section. A window “Boxplots”

appears. Select “Simple” under “Multiple Y’s” and click “OK.” (Though other options are also available) Now, click on “C2 School 1”, and press “Select” to move it to the box under “Graph variables:”. Now, click on “C3 School 2”, and press “Select” to move it to the box under “Graph variables:”. Now, click on “C4 School 3”, and press “Select” to move it to the box under “Graph variables:”. You can also work on “Labels

” and “Data Options

” etc. For example, click “Data View

In the window “Boxplot: Data View”, select “Interquartile range box”, “Outlier symbols”, and “Median symbol.” Now click “OK.” Again click “OK.” A simple chart appears as follows:

”

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

70

Boxplot of School 1, School 2, School 3 100

90

Data

80

70

60

50

40 School 1

School 2

School 3

Outlier A value that lies at an abnormal distance from (outside of the) other values.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

71

An value that lies at an abnormal distance from (outside of the) other values

Outlier example

Showing some sort of problem in data such as

Supercomputer in a lab of personal computers would be considered as an outlier

A case (associated with that value at abnormal distance) that is not according to the model under study

example

120 100

An error occurred during measurement

80 60 40

example

20 0 0

2

4

6

8

10

12

12 11 10 9 8 Time 7 worke 6 d 5 (hours 4 ) 3 2 1 0

Box plots are commonly used to find or display outliers Monday Tuesday Wednesday Thursday Friday Saturday Sunday

Using Minitab:

Suppose we have the following values: 342.962 346.033 345.917 344.717 343.048

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

72

345.855 344.717 395.548 345.464 345.703 Put these values in the first column, i.e. C1. Now go to “Stat” tab above, go to “Basic Statistics”, and go to “Outlier Test

” in the 6th

section. Move the column, i.e. C1 to the box under “Variables:” by selecting it and pressing the button “Select”. Now click “OK.” The following results and graph appear. Grubbs' Test Variable C1

N 10

Mean 350.00

StDev 16.04

Min 342.96

Max 395.55

G 2.84

P 0.000

Outlier Variable C1

Row 8

Outlier 395.548

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

73

These results show that 395.548 is an outlier. Another way is to use the box plot. Now go to “Graph” tab above and go to “Boxplot

” in the

3rd section. A window “Boxplots” appears. Select “Simple” under “One Y” and click “OK.” (Though other options are also available) Move C1 (column containing all the values) to the box under “Graph variables:” You can also work on “Labels

” and “Data Options

” etc. For example, click “Data View

In the window “Boxplot: Data View”, select “Interquartile range box”, and “Outlier symbols.” Now click “OK.” Again click “OK.” A simple chart appears as follows:

”

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

74

This graph shows that one value, which is close to 395, is far away from other values, i.e. an outlier.

Quantile This word is from “quantity”. Usually, a quantile is obtained by dividing a sample into adjacent and equal-sized subgroups, OR it is obtained by dividing a probability distribution into areas or intervals of equal probability.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

75

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

76

Example for Quantiles

Suppose, we have ten numbers: 3,2,5,6,7,8,1,10,9,4 Arrangement of these numbers is as follows:

1

3

2

4

1

Percentile

2

1st

3

2nd 3rd

7

6

Lower quartile Q1

Quartile

Decile

5

4

Median Q2

5

4th 5th

8

9

10

Upper quartile Q3

7

6

6th

8

7th

9

8th

10

9th

5th 45th 95th 10th 20th 30th 40th 50th 60th 70th 80th 90th

There are  Three quartiles, i.e. Q1, Q2, and Q3  Nine deciles, i.e. D1, D2, D3, ,D9  Ninety nine percentiles, i.e. P1, P2, ,P99 Each showing a specific percentage of data, e.g. below Q3, 75% of data falls; below D3, 30% of data falls, and below P80, 80% of data falls, and so on...

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

77

Using Minitab

Quartiles In order to work on quartiles, we can use the example (as noted above), i.e. 3,2,5,6,7,8,1,10,9,4. Enter the values in the first column, i.e. from rows 1 to 10 in C1. Go to “Stat” tab above, go to “Basic Statistics”, and click “Display Descriptive Statistics ” In the window “Display Descriptive Statistics”, move column 1, i.e. C1 in this case, from the left box to the “Variables:” in the right box by clicking on C1 and clicking the “Select” button below. Click “Statistics ”. A window “Display Descriptive Statistics: Statistics” appears. Check “First quartile”, “Median”, “Third quartile” and every other descriptive statistics you want to do. Click “OK.” Again click “OK.” Results appear in the screen above the spreadsheet. In this case, the results are Q1 = 2.750; median = 5.5, and Q3 = 8.250. Deciles In order to work on decile, go to “Calc” tab above, and click “Calculator

”

In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C7. In the box below “Expression:” enter “PERCENTILE (C1,0.10)”. Click “OK.” The result will appear in the spreadsheet in the column C7. In this case, the result is 1.1. For specific decile as, for example, 7th decile, enter “PERCENTILE(C1,0.7)”. Click “OK.” The result will appear in the spreadsheet in the column C7. In this case, the result is 7.7. Percentiles In order to work on percentile, go to “Calc” tab above, and click “Calculator

”

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

78

In the window “Calculator”, in the “Store result in variable:” in the right box, enter a column, such as C7. In the box below “Expression:” enter “PERCENTILE(C1,0.01)”. Click “OK.” The result will appear in the spreadsheet in the column C7. In this case, the result is 1. For specific decile as, for example, 70th percentile, enter “PERCENTILE(C1,0.7)”. Click “OK.” The result will appear in the spreadsheet in the column C7. In this case, the result is 7.7. Standard deviation and Variance Standard deviation helps in knowing the variability of the observation or spread out of numbers about the mean. It is represented by the Greek letter sigma (σ). Low standard deviation shows that the values are close to mean, i.e. close to normal range or required range. σ can be obtained by taking the square root of variance, which is represented by σ 2. So, we have to calculate variance to calculate standard deviation. Variance is the average of squared deviation about the mean. It is also important to note that standard deviation and variance are of two types, i.e. population standard deviance and population variance, and sample standard deviation and sample variance, respectively.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

79

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

80

Using Minitab

In order to calculate sample variance and standard deviation, we can use the following example: Number of flowering plant

Number of flowers on the plant

1

5

2

7

3

6

4

4

5

7

6

9

7

11

Enter the data in the columns. In this case, we enter the data in the first two columns, i.e. “Number of flowering plant” in the first column and “Number of flowers on the plant” in the second column. Now go to “Stat” tab above, go to “Basic Statistics”, and click “Display Descriptive Statistics ” Move the variables’ column for which variance and standard deviation have to be calculated, in this case it is “C2 Number of flowers on the plant”, to the box of “Variables:” and click “Statistics ” Select “Standard deviation” and “Variance” from the appeared window. You can also select several other options for Descriptive Statistics. Click “OK.” Click “OK” again on the Display Descriptive Statistics (window). This shows the results of StDev (Standard Deviation), i.e. 2.380, and Variance, i.e. 5.667. However, it is important to note that these results are showing sample variance and sample standard deviation.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

81

Range Rule of Thumb It helps in rough estimation of the standard deviation from a data obtained from known samples or population.

Probability Probability is a measure of how likely it is that some event will occur. Some of the basic rules of probability are as follows: Probability

For any event A,

Probability can be as low as zero or as

Rule # 1

the probability

high as 1.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

82

P(A) satisfies 0 ≤ P(A) ≤ 1. Here P(A)=Number of favorable outcomes / Total number of outcomes = m/n Probability

Probability model The sum of the probabilities of all possible

Rule # 2

for a sample space S satisfies P(S)=1.

outcomes is 1. S

not A not B A

Probability Complement Rule # 3

Rule

B

For an event A,

The probability of the happening of an

P(not A) = 1 –

event that is complement of event A (i.e.

P(A)

not A) is similar to 1 minus the probability of happening of event A. not A

Probability The Addition Rule # 4

If A and B are

Rule for

two disjoint or

Disjoint

mutually

Events

exclusive events,

A

A and B are disjoint

A

B

P(A or B) = P(A) + P(B) Probability The General Rule # 5

P(A or B) =

Addition Rule P(A∪B)= P(A) +

The probability that event A or event B will occur is similar to the probability that event A occurs plus the probability that event B

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

83

P(B) – P(A and

occurs minus the probability of event A

B)

and event B occurring simultaneously (which is also equal to A∩B, i.e. P(A and B) = P(A∩B) ) A and B are not disjoint

A

Probability The Rule # 6

If A and B are

Multiplication two independent Rule for

events, then P(A

Independent

and B) = P(A∩B)

Events

= P(A) × P(B).

Probability Conditional Rule # 7

B

The conditional

Probability

probability of

Rule

event B, given

Given event or already occurred event

P(B | A)

event A, is P(B | A) = P(A and B) / P(A) Probability General Rule # 8

For any two

Event that may occur or may not occur

This line shows Given

A

B

Multiplication events A and B, Rule

P(A and B) =

A

B

P(A∩B) = P(A) × P(B | A)

Bayesian statistics Bayesian statistics has been named for Thomas Bayes (1702-1761) English priest. Bayes came to a theorem with the help of which a hypothesis could be established based on the observation of consequences.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

84

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

85

An easy example for Bayes Theorem

P(Burger|Restaurant) =0.4

Probability of burger given that there is restaurant

But burgers are commonly found (about 80% of all markets contain burgers)

P(Burger)=0.8

Probability of burgers

And restaurants are fairly common (suppose 70% of all markets have a restaurant )

P(Restaurant)=0.7

Probability of restaurant

What is the probability that we see restaurant when we see burgers?

P(Restaurant|Burger) =?

Suppose about 40% of restaurants serve burgers

P(Restaurant|Burger)=

Probability of restaurant given burgers

P(Restaurant) P(Burger|Restaurant) P(Burger)

P(Restaurant|Burger)=

0.7  0.4  0.35 0.8

35% chance of finding restaurants when there are burgers

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

86

Another example: Suppose seeds are common (15%) but plants are scarce (5%) due to small area of land, and about 85% of all seeds produce plants then P(seed|plant)= [P(seed) P(plant|seed)]/P(plant) P(seed|plant)=[15% x 85%]/5% P(seed|plant)=255% So the probability of seed when there is plant is 255%.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

87

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

88

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

89

Using Minitab

The “Calculator” function could be used for Bayesian statistics, i.e. enter the formula associated with Bayesian statistics in the box below “Expression:” in the “Calculator”.

Reliability Coefficient Reliability refers to consistency of a measure, or trust or dependence on a measure. Reliability coefficient refers to the degree of consistency or accuracy of a measuring instrument or a test. Usually, reliability coefficient is determined by the correlation of the two different sets of measures. Different ways are used to assess reliability coefficients as, for example, inter-rater reliability that is usually measured by Cohen’s kappa coefficient; internal consistency reliability that is usually measured by Cronbach's alpha, and test-retest reliability that can be measured by Pearson's correlation coefficient.

Cohen's Kappa Coefficient It is used to determine inter-rater or interobserver agreement for qualitative/categorical items.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

90

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

91

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

92

Fleiss’ Kappa Coefficient It is used to determine inter-rater or inter-observer agreement for Likert scale data, nominal scale data, or ordinal scale data.

Fleiss Kappa Coefficient

Used to determine inter-rater agreement/ interobserver agreement for Likert scale data, nominal scale data, or ordinal scale data Also considers the chance agreement

is determined when

A trial on each sample is rated by three or more different observers/raters

Range 0-1

Perfect agreement Coefficient > 0.75 => good

Agreement similar to chance

Acceptable level of agreement depends on the field of research

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

93

Using Minitab

In order to work on Minitab, we can use the same example as noted above. B

A

Yes

No

Yes

40

10

No

20

30

We set the codes for observers, A and B (assessors or appraisers) and children, 1 to 100 (samples or attributes). So, in this example, observer A has the code 1, and observer B has the code 2. Children have the codes from 1 to 100. Response or Assessment is noted as “Yes” or “No”. In the first column, i.e. C1, enter “Assessor” and enter the codes 1 in the first 100 rows, and 2 in the rows from 101 to 200. In order to repeat the number 1 in the first 100 rows, enter 1 in the first cell, take the pointer or cursor to the lower right corner of the cell, where the shape of the pointer changes to Autofill handle, now click and drag the Autofill handle down the column to the 100th row, and after that, release the mouse button. Do the same thing after writing the number 2 in the 101st row and dragging it to the 200th row. In the second column, i.e. C2, enter “Children” and enter the numbers from 1 to 100 in the rows from 1 to 100, and again enter the numbers from 1 to 100 in the rows from 101 to 200. In order to add the series of numbers from 1 to 100 in the first 100 rows, enter 1 in the first cell, take the pointer or cursor to the lower right corner of the cell, where the shape of the pointer changes to Autofill handle, now click and press “Ctrl” key on keyboard and drag the Autofill handle down the column to the 100th row, and after that, release the mouse button. Do the same thing after writing the number 1 in the 101st row and dragging it to the 200th row. Press “home” button on keyboard. It will take you back to the top. In the third column, i.e. C3, enter “Response”, and enter “Yes” in the first 40 rows (i.e. 40 rows), “No” in the rows from 41 to 70 (i.e. 30 rows), “Yes” in the rows from 71 to 80 (i.e. 10 rows), and

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

94

“No” in the rows from 81 to 100 (i.e. 20 rows). These responses are for observer A. Now, enter “Yes” in the rows from 101 to 140 (i.e. 40 rows), “No” in the rows from 141 to 180 (i.e. 30+10 rows), and “Yes” in the rows from 181 to 200 (i.e. 20 rows). You can enter “Yes” or “No” in the cells, and drag the Autofill handle in the lower right corner of the cell to the required number of cells (rows) in the column. Now go to “Stat” tab above, go to “Quality Tools”, and click “Attribute Agreement Analysis

”

in the 5th section. In the appeared window (i.e. Attribute Agreement Analysis), move “C3 Response” to the box with “Attribute Column:”, move “C2 Children” to the box with “Samples:”, and move “C1 Assessor” to the box with “Appraisers:” by clicking the “Select” button below. Now click the button “Options ”, select “Calculate Cohen’s kappa if appropriate” by checking the box, and click “OK”, and again click “OK” on the “Attribute Agreement Analysis” (window). This gives the results for Fleiss’ Kappa Statistics (0.39) and Cohen’s Kappa Statistics (0.4) as follows: Fleiss’ Kappa Statistics Response

Kappa

SE Kappa

Z

P(vs > 0)

No

0.393939

0.1

3.93939

0.0000

Yes

0.393939

0.1

3.93939

0.0000

Cohen’s Kappa Statistics Response

Kappa

SE Kappa

Z

P(vs > 0)

No

0.4

0.0979796

4.08248

0.0000

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Yes

0.4

0.0979796

95 4.08248

0.0000

Cronbach's alpha It is used to measure the internal consistency reliability.

Using Minitab

Suppose we have 7 questions (or 7 items) in a questionnaire. A total of 18 participants completed the questionnaire. Each question is measured utilizing a 5-point Likert item ranging from “strongly agree” (5) to “strongly disagree” (1). Suppose we got the following results: Q1

Q2

Q3

Q4

Q5

Q6

Q7

Individual1

5

1

3

2

1

4

2

Individual2

1

5

4

3

5

5

4

Individual3

4

2

1

3

5

3

4

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

96

Individual4

4

4

3

5

3

4

5

Individual5

2

3

3

5

2

5

4

Individual6

5

1

1

2

4

5

1

Individual7

2

1

1

3

1

5

2

Individual8

3

2

2

5

2

5

4

Individual9

2

1

5

3

1

3

4

Individual10

4

4

1

4

2

5

4

Individual11

4

3

4

3

4

4

5

Individual12

5

1

1

3

3

5

3

Individual13

5

4

3

2

5

3

4

Individual14

2

4

4

2

4

4

5

Individual15

2

4

3

2

2

3

2

Individual16

4

5

2

3

4

2

3

Individual17

3

4

4

1

4

2

4

Individual18

2

4

3

5

1

3

5

Put these numerical values below Q1 to Q7 in Minitab in the columns C1 to C7. Now go to “Stat” tab above, go to “Multivariate”, and click “Item Analysis

” in the 2nd section.

In the appeared window, move all the items (C1 to C7) to the box under “Variables:” and the items “Q1-Q7” appear in the box. Click the button “Results ” and select “Cronbach’s Alpha”. Some other items may also be selected. Click “OK.” Click the button “Graphs ” and deselect “Matrix plot of data with smoother”. Click “OK”, and click “OK” again in the window of “Item Analysis.” This gives the results for Cronbach’s Alpha as follows: Cronbach’s Alpha Alpha 0.09139

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

97

Interpretation of result: This value of Cronbach’s Alpha (0.091) is unacceptable as it is lower than 0.5. Thereby, showing a low level of internal consistency for our scale.

Coefficient of variation It is a measure of dispersion or relative variability.

Using Minitab

In order to work on Minitab, we can use the following example: Weight loss program X

Weight loss program Y

Individual1

1

1

Individual2

2

1

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

98

Individual3

1

1

Individual4

6

2

Individual5

7

8

Individual6

4

6

Individual7

8

2

Individual8

7

11

Individual9

9

22

Put these values of “Weight loss program X” and “Weight loss program Y” in the columns C1 and C2, respectively. Now go to “Stat” tab above, go to “Basic Statistics”, and click “Display Descriptive Statistics ” in the 1st section. In the window “Display Descriptive Statistics”, move the columns C1 and C2 to the box under “Variables:” by selecting them and clicking the button “Select.” Click “Statistics ”. A window “Display Descriptive Statistics: Statistics” appears. Check “Coefficient of variation”, and every other descriptive statistics you want to do. Click “OK.” Again click “OK” in the “Display Descriptive Statistics” (window). Results appear in the screen above the spreadsheet, and the results for the “Mean”, “Standard deviation” (StDev), and “Coefficient of variation” (CoefVar) are as follows: Variable

Mean

StDev

CoefVar

Weight loss program X

5.00

3.08

61.64

Weight loss program Y

6.00

7.00

116.67

Interpretation of results: These results show that the program Y has more variability relative to its mean as compared to program X.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

99

Chebyshev's Theorem This theorem states that for any set of observations or numbers, the minimum proportion of the values that lie within k standard deviations of the mean is 1-1/k2, where k is a positive number greater than 1. This theorem applies to all types of distributions of data.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Using Minitab

Suppose k=2 as noted in the above example. Go to “Calc” tab above, and click “Calculator

”

In the window “Calculator”, in the “Store result in variable:” in the right box enter a column, such as C5. In the box below “Expression:” enter “1-(1/2^2)” and click “OK.” The result will appear in the spreadsheet. In this case, the result is 0.75 (or 75%).

Factorial Factorial is related to the product of all the integers less than and equal to a given integer (n). Integers are greater than zero.

100

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Using Minitab

Take the example as noted above, i.e. 4!. Go to “Calc” tab above, and click “Calculator

”

101

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

102

In the window “Calculator”, in the “Store result in variable:” in the right box enter a column, such as C5. In the box below “Expression:” enter “FACTORIAL(4)” and click “OK.” The result will appear in the spreadsheet. In this case, the result is 24.

Distribution, and Standardization Suppose, there is a place known as “Virana” where beneficial viruses are living and fighting against some harmful bacteria. Those harmful bacteria are assigned “value-of-danger” from 1 to 9 according to their harmfulness to viruses, i.e. higher the number, the more harmful the bacteria are. Suppose we have different number of bacteria (frequency) according to their value-ofdanger as shown in the following table: Frequency (number of bacteria)

Value-of-danger

200,000

1

300,000

2

500,000

3

900,000

4

1000,000

5

900,000

6

500,000

7

300,000

8

200,000

9

Then a frequency distribution plot could be developed as follows:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

103

Frequency distribution plot Frequency (number of bacteria)

1,200,000 1,000,000 800,000

600,000 400,000 200,000 0 0

1

2

3

4

5

6

7

8

9

10

Value-of-danger

This frequency distribution plot shows “bell-curve” as it looks like a bell. Technically, this type of distribution is known as normal or Gaussian distribution as noted by statistician professor Gauss. This type of distribution is commonly found in nature as, for example, blood sugar levels and heart rates follow this type of distribution. Normal distribution has three important characteristics. 1. In a normal distribution, mean, median, and mode are equal to each other. 2. In this type of distribution, there is symmetry about the central point. 3. Half of the values, i.e. 50% are less than the mean value, and half of the values, i.e. 50% are more than the mean value. In the above illustration, the value-of-danger “5” is the mean and median, and as it is related to the most commonly found value (highest or largest number of bacteria that could be frequently found), so it is also mode. Disturbance in the values would result in the disturbance of the normal distribution; thereby, leading to non-normal or non-Gaussian distribution in which there is no appropriate bell-shaped curve. Frequency distribution has a close relationship with standard deviations:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 

104

About 68.27% of all values lie within one standard deviation of the mean on both sides, i.e., total of two standard deviations,



About 95% of all values lie within two standard deviations of the mean on both sides, i.e. total of four standard deviations, and



About 99.7% of all values lie within three standard deviations of the mean on both sides, i.e., total of six standard deviations.

The number of standard deviations from the mean is known as “Standard Score”, “z-score”, or “sigma”, and in order to convert a value to a z-score or Standard Score, subtract the mean and then divide the value by Standard Deviation. It is represented as

in which z is showing the z-score; x is showing the value that had to be standardized; μ is showing the mean value, and σ is showing the Standard Deviation. This process of getting a zscore is known as “Standardizing.” Let’s consider the frequency distribution and frequency distribution plot again:

Frequency distribution plot Frequency (number of bacteria)

1,200,000

1,000,000 800,000 600,000

400,000 200,000 0 0

1

2

3

4

5

6

Value-of-danger

7

8

9

10

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha And its values: Frequency (number of bacteria)

Value-of-danger

200,000

1

300,000

2

500,000

3

900,000

4

1000,000

5

900,000

6

500,000

7

300,000

8

200,000

9

In this table, the mean value is shown by the value 5, and its Standard Deviation could be calculated from variance. So, variance is calculated as

105

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

106

So, Standard Deviation = σ = 2.58 In order to get the z-score of the first value, i.e. 1, first subtract the mean. So, it will be 1-5 = -4. Then the value will be divided by the Standard Deviation. So, it will be -4/2.58 = -1.55. So, zscore will be -1.55, i.e. the value-of-danger 1 will be -1.55 Standard Deviations from the mean. If we calculate the z-scores of all the values and placed the values in the table, we get: Frequency (number of

Value-of-danger

z-score

200,000

1

(1-5)/2.58 = -1.55

300,000

2

(2-5)/2.58=-1.16

500,000

3

(3-5)/2.58=-0.78

900,000

4

(4-5)/2.58=-0.39

1000,000

5

(5-5)/2.58=0

bacteria)

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

107

900,000

6

(6-5)/2.58=0.39

500,000

7

(7-5)/2.58=0.78

300,000

8

(8-5)/2.58=1.16

200,000

9

(9-5)/2.58=1.55

From this table, Standard Normal Distribution Graph could also be obtained in which z-scores are along x-axis and frequency (number of bacteria) is along y-axis.

The graph shows that nearly 68.27% of the values in the “value-of-danger” are present within one standard deviation of the mean on both side, i.e. from -1 to 1. In that case, 68.27% is also considered as a confidence interval between upper limit of z-score=1 and lower limit of zscore=-1. On a further note, nearly 95% of all values are present within two standard deviations of the mean on both sides, i.e. from -2 to 2.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

108

In case of disturbance in the values, the normal graph could be changed into non-normal and start showing binomial or Poisson distribution. Using Minitab

In order to calculate z-score, enter the values in the columns. We can take the example of “Frequency (number of bacteria)” and their “Value-of-danger”. Frequency (number of bacteria)

Value-of-danger

200,000

1

300,000

2

500,000

3

900,000

4

1000,000

5

900,000

6

500,000

7

300,000

8

200,000

9

In this case, we can enter “Frequency (number of bacteria)” in the first column (C1) ranging from rows 1 to 9, and “Value-of-danger” in the second column (C2) ranging from rows 1 to 9. Now go to “Calc” tab above and click “Standardize” in the 1st section. In the appeared window, in the “Input column(s):” enter the name of the column for which zscores have to be calculated. In this case, we have C2 to enter into this box. In the box below “Store results in:” enter the name of column in which you want to show your results. In this case, we can enter C5.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

109

Click “OK.” Results appear in Column C5. Results are as follows: -1.46059 -1.09545 -0.73030 -0.36515 0.00000 0.36515 0.73030 1.09545 1.46059 These are the values of z-scores, so you can give the C5 the name of “z-scores”. It is important to note that these z-scores are calculated according to the sample standard deviation. It is not according to the population standard deviation as calculated above manually. A graph of z-scores on x-axis and frequency (number of bacteria) on y-axis can also be plotted. Go to the “Graph”, click “Scatterplot

”, select “With Connect Line”, and click “OK.”

A window “Scatterplot: With Connect Line” appears. Move C1 to the cell under “Y variables” and C5 under “X variables”. Click “OK.” Following graph is obtained:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

110

Scatterplot of Frequency (number of bacteria) vs z-scores

Frequency (number of bacteria)

1000000 900000 800000 700000 600000 500000 400000 300000 200000 100000 -1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

z-scores

Prediction interval It is a modified form of confidence interval that covers a range of values likely containing a future value. It works on the basis of pre-existing values and some sort of regression analysis.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Prediction interval

A modified form of confidence interval

111

An interval covering a range of values likely containing a future value

Utilizing pre-existing values and some sort of regression analysis

formula

Sample estimate (predicted value or fitted ± t-multiplier × value)

When predictor is

xh

Standard error of the prediction

 1 ( xh  x) 2  yˆ h  t( /2,n 2)  MSE 1    n  ( x  x) 2  i   Similar to standard error of the fit Confidence interval formula

Check the difference from confidence interval

1 ( xh  x) 2  yˆ h  t( /2,n 2)  MSE    n  ( x  x) 2  i   Prediction intervals are wider as compared to confidence intervals

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

112

Using Minitab

In order to work on prediction interval, we can use the following example of weights and heights of 8 males in the age range of 18 years to 28 years: Serial no.

Weight of person (kg)

Height of person (cm)

1

84.2

189

2

98.8

178

3

62.6

176

4

73.6

172

5

67.4

168

6

77.9

179

7

72.9

186

8

68.6

164

In order to find 95% prediction interval for the weight of an individual who is randomly selected and who is 168 cm tall, and to find 95% confidence interval for the average weight of all individuals who are 168 cm tall, we enter the values in the columns. In this case, the values of “Serial no.” are entered in the first column, i.e. C1; the values of “Weight of person (kg)” are entered in the second column, i.e. C2, and the value of “Height of person (cm)” are entered in the third column, i.e. C3. Now go to “Stat”, go to “Regression”, go to “Regression”, and click “Fit Regression Model ” In the box under “Responses:” move C2, and in the box under “Continuous predictors:” move C3. Click “OK.” This gives the following results: Regression Equation

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Weight of person (kg)

=

113

-27.6 + 0.586 Height of person (cm)

Now again go to “Stat”, go to “Regression”, go to “Regression”, and click “Predict

” In the box

with “Response:” select (or already selected) “Weight of person (kg)”. In drop down menu, select “Enter individual values”, and in the top box under “Height of person (cm)”, enter 168. Click “OK.” This gives the following results: Prediction Fit

SE Fit

95% CI

95% PI

70.7731

5.75709

(56.6860, 84.8602)

(40.1473, 101.399)

Here, 95% CI is 95% Confidence Interval, and 95% PI is 95% Prediction Interval. So, 95% prediction interval for the weight of an individual who is randomly selected and who is 168 cm tall is 40.1473-101.399, and to find 95% confidence interval for the average weight of all individuals who are 168 cm tall is 56.6860-84.8602. In other words, it can be said that we can be 95% confident that the single future value of 168 cm will fall within the weight range of 40.1473-101.399 kg.

Tolerance interval An interval, also known as enclosure interval, covering a fixed proportion (or percent) of the population with a particular confidence. The endpoints, also known as a maximum value and a minimum value, are known as tolerance limits.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

An interval covering a fixed proportion (or percent) of the population with a particular confidence

Tolerance interval Also known as

Enclosure interval

For a normally distributed population (due to Howe)

k  z( p 1)/2

Critical value of the normal distribution related to cummulative probability (p+1)/2

Different from confidence limits

12 ,df Critical value of the chisquare distribution with df degrees of freedom exceeded with probability α Tolerance limit

A limit within which a specific proportion (or percentage) of population may lie with stated confidence

A limit within which a given population Mean such as parameter may lie with stated confidence

95% probability mean lies here

x

The endpoints, also known as a maximum value and a minimum value, are known as tolerance limits.

df (1  1/ n)

Confidence limit

Lower confidence limit

114

95% probability 95% population lies here

Upper confidence limit

Lower tolerance limit

x

Upper tolerance limit

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

115

Using Minitab

In order to work on tolerance interval, we can use the following example: Serial no.

Weight of person (kg)

1

84.2

2

98.8

3

62.6

4

73.6

5

67.4

6

77.9

7

72.9

8

68.6

Suppose we want to know the tolerance interval, i.e. the weight range within which the weight of at least 95% of all men lie in a community, we can enter the values in the columns. In this case, the values of “Serial no.” are entered in the first column, i.e. C1, and the values of “Weight of person (kg)” are entered in the second column, i.e. C2. Now go to “Stat”, go to “Quality Tools”, and click “Tolerance Intervals (Normal Distribution)

”. From the drop down menu, select “One or more samples, each in a column”,

and in the box, enter C2, i.e. “Weight of person (kg)”. Click “OK.” This gives the following results: 95% Tolerance Interval Variable Weight of person (kg)

Normal Method (32.955, 118.545)

Nonparametric Method (62.600, 98.800)

Achieved confidence level applies only to nonparametric method.

Achieved Confidence 5.7%

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

116

In this graph, the Normal Probability Plot shows that the plotted points are following an approximate straight line. Furthermore, the p-value for the Normality Test is 0.364 that is >0.05. Therefore, the data follows normal distribution, and the Normal Method results (obtained in the table above) can be used. According to these results, it can be said with 95% confidence that the weight of 95% of all men lie in the range of 32.955 kg to 118.545 kg.

Parameters to describe the form of a distribution Different parameters to describe the form of a distribution are (1) location parameters, (2) scale parameters, and (3) shape parameters.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

117

Parameters to describe the function of a distribution x0 Location parameter

To determine the shift or location of a distribution

f x 0 ( x)  f ( x  x0 ) examples

 Mean  Median  Mode

s Scale parameter

To describe the width of a probability distribution

f s ( x)  f ( x / s ) / s examples

 Large s, more spread out distribution  Small s, more concentrated distribution

Shape parameters

To describe all other parameters beyond location parameter and scale parameter

examples

 Skewness  Kurtosis

Skewness Skewness helps in knowing about the symmetry (or the lack of symmetry) of distribution or data set. A distribution is said to be symmetric, if it appears same to the right and left of the center point.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

118

Kurtosis It is a measure of the degree of peakedness or flatness of a frequency-distribution curve, i.e. the extent to which a distribution is showing a peak or flatness.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

119

Using Minitab

Suppose in a company 14 people work. They have different salaries as noted in the table below: Salary (USD) per month

Number of people

500

1

750

2

1000

3

1250

5

1500

2

1750

1

So, the table means, the salary USD 500 is repeated 1 time; the salary USD 750 is repeated 2 times; the salary USD 1000 is repeated 3 times, and so on from the above mentioned information. Salary (USD) per month 500 750

So, we can make the following table

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

120

750 1000 1000 1000 1250 1250 1250 1250 1250 1500 1500 1750

Enter the values in this way in the first column, i.e. from row 1 to row 14 in this case. Go to “Stat” tab above, go to “Basic Statistics”, and click “Display Descriptive Statistics ” In the window “Display Descriptive Statistics”, move column 1, i.e. C1 in this case, from the left box to the “Variables:” in the right box by clicking on C1 and clicking the “Select” button below. Click “Statistics ”. A window “Display Descriptive Statistics: Statistics” appears. Check “Skewness” and “Kurtosis”, and every other descriptive statistics you may want to do. Click “OK.” Click “Graphs

” and select “Histogram of data, with normal curve”. Click “OK.”

Again click “OK” in the “Display Descriptive Statistics” (window). Results appear in the screen above the spreadsheet. In this case, the results are as follows: Variable Salary (USD) per month And the following graph appears:

Skewness

Kurtosis

-0.18

-0.09

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

121

Histogram (with Normal Curve) of Salary (USD) per month Mean 1143 StDev 335.6 N 14

5

Frequency

4

3

2

1

0

400

600

800

1000

1200

1400

1600

1800

Salary (USD) per month

So, the results are showing that the graph is representing slightly negative skewness and slightly negative kurtosis.

Different functions of distributions Different functions of probability distributions represent different aspects of those distributions. These functions may include, (1) probability density function (PDF), (2) cumulative distribution function (CDF), (3) survival function (SF), (4) percentile- point function (PPF), and (5) inverse survival function (ISF).

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

122

Different functions of probability distributions

Probability density function (PDF)

What is the probability To get the that the age of a man is probability of example a value in a between 40 to 45 certain interval years?

Cumulative distribution function (CDF)

To obtain the probability of a value less than a given value

example

1 - CDF Survival function (SF)

Percentile point function (PPF)

Inverse survival function (ISF)

To obtain the probability of a value larger than a given value Inverse of CDF To obtain a value less than a given value when probability is given

Inverse of SF To obtain a value more than a given value when probability is given

What is the probability that the age of a man is less than 45 years? What is the probability that the age of a man is more than 45 years? The probability that the age of a man is less than 95% of other men, what is the age of the man?

The probability that the age of a man is more than 95% of other men, what is the age of the man?

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

123

Different functions of probability distributions

Probability density function (PDF)

example

Cumulative distribution function (CDF)

example

a

b

b Survival function (SF)

example

b Percentile point function (PPF)

Inverse survival function (ISF)

example

b

example

b

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Probability Density Function It shows probability distribution for a continuous random variable.

124

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

125

Using Minitab

In order to work on probability density function, we can use the same example as noted above. . In the first column, i.e. C1, enter the values 0 in the first row, 1 in the second row, 2 in the third row, and so on, until 24 in the 25th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Uniform ” in the first section. Select “Probability density”. In the “Lower endpoint:” enter 0, and “Upper Endpoint:” enter 24, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Continuous uniform on 0 to 24 x

f( x )

0 0.0416667 1 0.0416667 2 0.0416667 3 0.0416667 4 0.0416667 5 0.0416667 6 0.0416667 7 0.0416667 8 0.0416667 9 0.0416667 10 0.0416667 11 0.0416667 12 0.0416667

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

126

13 0.0416667 14 0.0416667 15 0.0416667 16 0.0416667 17 0.0416667 18 0.0416667 19 0.0416667 20 0.0416667 21 0.0416667 22 0.0416667 23 0.0416667 24 0.0416667 As we have to check for the hours between 4 pm and 6 pm, so there are 2 hours. This is equal to 2×0.0416667=0.083333.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

127

Cumulative Distribution Function

Using Minitab

In order to work on probability density function, we can use the same example as noted above. . In the first column, i.e. C1, enter the values 0 in the first row, 1 in the second row, 2 in the third row, and so on, until 24 in the 25th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Uniform ”

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

128

Select “Cumulative probability”. In the “Lower endpoint:” enter 0, and “Upper Endpoint:” enter 24, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Cumulative Distribution Function Continuous uniform on 0 to 24 x P( X ≤ x ) 0

0.00000

1

0.04167

2

0.08333

3

0.12500

4

0.16667

5

0.20833

6

0.25000

7

0.29167

8

0.33333

9

0.37500

10

0.41667

11

0.45833

12

0.50000

13

0.54167

14

0.58333

15

0.62500

16

0.66667

17

0.70833

18

0.75000

19

0.79167

20

0.83333

21

0.87500

22

0.91667

23

0.95833

24

1.00000

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

129

25 1.00000 So, cumulative distribution function for 2 hours is 0.08333. With probability density function, it can be found that every hour has equal probability, and with cumulative distribution function, it can be found that with the addition of every hour, the probability increases.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Types of Distribution

Binomial Distribution

130

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

131

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

132

Using Minitab

In order to work on Binomial Probability Distribution, we can use the same example as above. In the first column, i.e. C1, enter the values 1 in the first row, 2 in the second row, 3 in the third row, and so on, until 8 in the 8th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Binomial

” in the

2nd section. Select “Probability”. In the “Number of trials:” enter 8, and “Event probability:” enter 0.5, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Binomial with n = 8 and p = 0.5 x P( X = x ) 1

0.031250

2

0.109375

3

0.218750

4

0.273438

5

0.218750

6

0.109375

7

0.031250

8

0.003906

As the likelihood of no less than 6, so we consider the values for 6, 7, and 8. These are 0.109375, 0.031250, and 0.003906, respectively. These values on addition gives 0.144531 that is equal to 37/256 (as noted in the example above). In order to make a graph, go to the “Graph” tab above, go to “Probability Distribution Plot” in the 2nd section, select “View Single”, and click “OK.”

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

133

Now select, “Binomial” under “Distribution:”. In the “Number of trials:” enter 8, and in the “Event Probability:” enter 0.5. This gives the following graph: Distribution Plot

Binomial, n=8, p=0.5 0.30

0.25

Probability

0.20

0.15

0.10

0.05

0.00

0

2

4

X

6

8

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

134

Chi-squared Distribution

Using Minitab

In order to work on Minitab, we can take the example of a classroom having a teacher and students on their seats. It is reasonable to suppose that the students having the seats in the first row will get more information from the teacher as compared to the students with seats in the 7th row. In this case, it can be said that obtaining the information is a function of distance of the row from the teacher. It can be considered as a Chi-Squared distribution with 1 degree of freedom,

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

135

i.e. parameter k=1. On the basis of this information, suppose, there are 7 rows in the classroom with 1 degree of freedom. In the first column, i.e. C1, enter the values 1 in the first row, 2 in the second row, 3 in the third row, and so on, until 7 in the 7th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Chi-square ” A window “Chi-Square Distribution” appears. Select “Probability density”. In the “Degrees of freedom:” enter 1, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Chi-Square with 1 DF x

f( x )

1 0.241971 2 0.103777 3 0.051393 4 0.026995 5 0.014645 6 0.008109 7 0.004553 This result shows that the probability of learning is highest in the first row. In order to make a graph, go to the “Graph” tab above, go to “Probability Distribution Plot”, select “View Single”, and click “OK.”

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

136

Now select, “Chi-Square” under “Distribution:” In the “Degrees of freedom:” enter 1, and click “OK”. This gives the following graph:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

137

Continuous Uniform Distribution

Using Minitab

In order to work on Continuous Uniform Distribution, we can use the same example as noted above. In the first column, i.e. C1, enter the values 0 in the first row, 1 in the second row, 2 in the third row, and so on, until 30 in the 31 st row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Uniform ” Select “Probability density”. In the “Lower endpoint:” enter 0, and “Upper Endpoint:” enter 30, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Continuous uniform on 0 to 30

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha x

f( x )

0

0.0333333

1

0.0333333

2

0.0333333

3

0.0333333

4

0.0333333

5

0.0333333

6

0.0333333

7

0.0333333

8

0.0333333

9

0.0333333

10 0.0333333 11 0.0333333 12 0.0333333 13 0.0333333 14 0.0333333 15 0.0333333 16 0.0333333 17 0.0333333 18 0.0333333 19 0.0333333 20 0.0333333 21 0.0333333

138

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

139

22 0.0333333 23 0.0333333 24 0.0333333 25 0.0333333 26 0.0333333 27 0.0333333 28 0.0333333 29 0.0333333 30 0.0333333

As we have to check for inside of 5 seconds, so the values below 5 seconds are added. This is equal to 5×0.033333=0.166665 (=1/6). For 20 contenders, 20×0.166665=3.3333 ≈ 3 contenders have the probability to react within 5 seconds.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

140

Cumulative Poisson Distribution

Using Minitab

In order to work on Cumulative Poisson Distribution, we can use the same example as noted above. In the first column, i.e. C1, enter the values 0 in the first row, 1 in the second row, 2 in the third row, and so on, until 7 in the 8th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Poisson ” Select “Cumulative probability”. In the box with “Mean:” enter 7, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Poisson with mean = 7 x P( X ≤ x )

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 0

0.000912

1

0.007295

2

0.029636

3

0.081765

4

0.172992

5

0.300708

6

0.449711

7

0.598714

141

Here, P(X=2) that can be calculated by subtracting P(X ≤ 1), i.e. 0.007295 from P(X ≤ 2), i.e. 0.029636. So, the value is 0.022341. Based on this result, it can be said that there is about 2.23% chance that 2 errors occur in 5000 lines of randomly selected lines of code.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Exponential Distribution

Using Minitab

Suppose a particular type of light bulb has a mean lifetime of about 1200 hours. Suppose, we have 8 bulbs of that type with the following determined lifetimes. Bulb number

Lifetime

1

3783.74

2

2243.01

3

1027.74

4

1202.26

5

778.17

142

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

6

7471.94

7

1834.41

8

528.42

143

In order to work on Exponential Distribution, we can use this example. In the first column, i.e. C1, enter the values 3783.74 in the first row, 2243.01 in the second row, 1027.74 in the third row, and so on, until 528.42 in the 8th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Exponential ” Select “Probability density”. In the “Scale:” enter 1200, in the “Threshold:” enter 0.0. In the “Input column:” enter C1. Click “OK.” This gives the following result: Exponential with mean = 1200 x

f( x )

3783.74 0.0000356 2243.01 0.0001285 1027.74 0.0003539 1202.26 0.0003060 778.17

0.0004357

7471.94 0.0000016 1834.41 0.0001807 528.42

0.0005365

This shows that probability is high below 1200. In order to make a graph, go to the “Graph” tab above, go to “Probability Distribution Plot select “View Single”, and click “OK.”

”,

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

144

Now select, “Exponential” under “Distribution:”. In the “Scale:” enter 1200, and in the “Threshold:” enter 0.0. Click “OK.” This gives the following graph: Distribution Plot

Exponential, Scale=1200, Thresh=0 0.0009 0.0008 0.0007

Density

0.0006 0.0005 0.0004 0.0003 0.0002 0.0001 0.0000

0

5000

10000

15000

20000

X

You can see the X and Y values of the graph by double clicking on the graph, selecting the third option that looks like a + sign, and that is “Pinpoint X and Y Coordinates”, and taking the cursor to any point on the graph.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Normal Distribution

145

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

146

Using Minitab

In order to work on Normal Distribution, we can use the same example as noted above. In the first column, i.e. C1, enter the values 26 in the first row, 33 in the second row, 65 in the third row, and so on, until 34 in the 20th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Normal ” Select “Probability density”. In the “Mean:” enter 38.8, and in the “Standard deviation:” enter 11.4. In the “Input column:” enter C1. Click “OK.” This gives the following result: Normal with mean = 38.8 and standard deviation = 11.4 x

f( x )

26 0.0186315 33 0.0307466 65 0.0024949 28 0.0223416 34 0.0320264 55 0.0127497 25 0.0168191 44 0.0315373 50 0.0215978 36 0.0339551 26 0.0186315 37 0.0345614 43 0.0326987 62 0.0044124

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

147

35 0.0331038 38 0.0349089 45 0.0301840 32 0.0292917 28 0.0223416 34 0.0320264 This shows that highest probability is at 37 minutes. In order to make a graph, go to the “Graph” tab above, go to “Probability Distribution Plot select “View Single”, and click “OK.” Now select, “Normal” under “Distribution:”. In the “Mean:” enter 38.8, and in the “Standard Deviation:” enter 11.4. Click “OK.” This gives the following graph: Distribution Plot

Normal, Mean=38.8, StDev=11.4 0.04

Density

0.03

0.02

0.01

0.00

0

10

20

30

40

X

50

60

70

80

”,

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

148

Poisson Distribution

Using Minitab

In order to work on Poisson Distribution, we can use the same example as noted above. In the first column, i.e. C1, enter the values 0 in the first row, 1 in the second row, 2 in the third row, and so on, until 11 in the 12th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Poisson ” Select “Probability”. In the “Mean:” enter 8, and in the “Input column:” enter C1. Click “OK.” This gives the following result: Poisson with mean = 8 x P( X = x ) 0

0.000335

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 1

0.002684

2

0.010735

3

0.028626

4

0.057252

5

0.091604

6

0.122138

7

0.139587

8

0.139587

9

0.124077

10

0.099262

11

0.072190

149

This shows that on a given weekday, the probability of 11 cars visiting the park is 0.072. In order to make a graph, go to the “Graph” tab above, go to “Probability Distribution Plot

”,

select “View Single”, and click “OK.” Now select, “Poisson” under “Distribution:”. In the “Mean:” enter 8. Click “OK.” This gives the following graph:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

150

Distribution Plot Poisson, Mean=8

0.14 0.12

Probability

0.10 0.08 0.06 0.04 0.02 0.00

0

5

10

X

15

20

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Beta Distribution

151

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Some example plots for beta probability density function and beta cumulative distribution function are presented below:

152

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

153

Using Minitab

Suppose the bulbs in a shipment have a defect with a beta distribution with α = 3 and β = 6, and we want to know the probability that the shipment could have 30% to 40% of defective bulbs. In the first column, i.e. C1, enter the values 0 in the first row, 0.1 in the second row, 0.2 in the third row, and so on, until 1.0 in the 11th row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Beta ” Select “Cumulative probability”. In the “First shape parameter:” enter 3, and in the “Second shape parameter:” enter 6. In the “Input column:”, enter C1. Click “OK.” This gives the following result: Beta with first shape parameter = 3 and second = 6 x P( X ≤ x ) 0.0

0.00000

0.1

0.03809

0.2

0.20308

0.3

0.44823

0.4

0.68461

0.5

0.85547

0.6

0.95019

0.7

0.98871

0.8

0.99877

0.9

0.99998

1.0

1.00000

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

154

The probability that the shipment could have 30% to 40% of defective bulbs could be calculated by subtracting the value 0.68461 (for 0.4) from 0.44823 (for 0.3). So, the answer (probability) is 0.23638.

F Distribution F Distribution is most commonly used in the Analysis of Variance and the F test.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Some example plots for probability density function of F distribution are presented below:

155

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

156

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

157

Using Minitab

Suppose 2 follows F-distribution, and independent random samples have size n1=27 and n2= 99999. This 99999 represents an infinite sample size. So, we have the degrees of freedom as n11=26, and n2-1=99999. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “F ” Select “Cumulative probability”. In the “Numerator degrees of freedom:” enter 26, and in the “Denominator degrees of freedom:” enter 99999. Select “Input constant:” and enter 2 in the box with it. Click “OK.” This gives the following result: Cumulative Distribution Function F distribution with 26 DF in numerator and 99999 DF in denominator x P( X ≤ x ) 2

0.998196

This value shows the area under the curve (i.e. area under the curved line in the graph) up to 2.

Gamma Distribution Gamma distribution is used for queuing models, and for financial and climatology services.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

158

Some example plots for probability density function of gamma distribution are presented below:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

159

Using Minitab

Suppose we don’t know the survival time of an experimental model (animal) that was exposed to a high dose of radiation. Again suppose the survival time is in weeks, and follows the gamma distribution with α=9 and β=14.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

160

In the first column, i.e. C1, enter the values 10 in the first row, 20 in the second row, 30 in the third row, and so on, until 140 in the 14 row. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Gamma Select “Cumulative probability”. In the “Shape parameter:” enter 9, and in the “Scale parameter:” enter 14. In the “Input column:”, enter C1. Click “OK.” This gives the following result: Cumulative Distribution Function Gamma with shape = 9 and scale = 14 x P( X ≤ x ) 10

0.000000

20

0.000019

30

0.000390

40

0.002776

50

0.011135

60

0.031140

70

0.068094

80

0.124706

90

0.200011

100

0.289715

110

0.387520

120

0.486695

130

0.581352

”

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 140

161

0.667180

In order to find the probability that the experimental model (animal) will survive a minimum of 40 weeks, we will subtract the value from 1. Therefore, 1-0.002776=0.997224 is the probability of survival for a minimum of 40 weeks. In order to find the probability that the experimental model (animal) will survive between 70 weeks to 130 weeks, we will subtract their respective values. Therefore, 0.5813520.068094=0.513258 is the probability of survival between 70 weeks and 130 weeks.

Negative Binomial Distribution

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Some example plots for probability density function of negative binomial distribution are presented below:

162

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

163

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

164

Using Minitab

Suppose we want to know the probability that the second six will be scored in the fifth throw in the consecutive and independent throws of one die. The event probability (probability of success – p) is 1/6, number of trials (x) are 5, and the number of success (r) is 2. Now go to the “Calc” tab above, go to “Probability Distributions”, and click “Negative Binomial ” in the 2nd section. Select “Probability”. In the “Event probability:” enter 0.167 (i.e. =1/6), and in the “Number of events needed:” enter 2. Select “Input constant:”, and enter 5 in the box with it. Click “OK.” This gives the following result: Probability Density Function Negative binomial with p = 0.167 and r = 2 x

P( X = x )

5 0.0644804 * NOTE * X = total number of trials. Now, suppose that Peter is a high school footballer and he is successful at 65% in penalties. What is the probability that Peter to hit second penalty in the fifth attempt? In this example: p=0.65, x=5, r=2 By entering these values, we get the following results: Probability Density Function Negative binomial with p = 0.65 and r = 2 x

P( X = x )

5 0.0724587 * NOTE * X = total number of trials.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

165

Gumbel Distribution It is used to model the distribution of extreme values, i.e. maximum or minimum values as, for example, peak temperatures of the year or maximum streamflow.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

166

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Using Minitab

The “Calculator” function could be used for the Gumbel Distribution, i.e. enter the formula associated with Gumbel Distribution in the box below “Expression:”.

Hypergeometric Distribution It is similar to binomial distribution.

167

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

168

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

169

Using Minitab

Suppose a deck of cards has 30 (N) cards in which there are 12 (M) red cards and 18 black cards. Suppose 7 (n) cards are taken without replacement. What would be the probability that exactly 6 (x) red cards are taken? Go to the “Calc” tab above, go to “Probability Distributions”, and click “Hypergeometric ” Select “Probability”. In the “Population size (N):” enter 30, in the “Event count in population (M):” enter 12, and in the “Sample size (n):” enter 7. Select “Input constant:”, and enter 6 in the box with it. Click “OK.” This gives the following result: Hypergeometric with N = 30, M = 12, and n = 7 x

P( X = x )

6 0.0081698 So, the probability is 0.0081698.

Inverse Gamma Distribution Inverse Gamma Distribution is a Pearson Type 5 Distribution that is most commonly used in Bayesian statistics.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

170

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

171

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Using Minitab

The “Calculator” function could be used to work on the Inverse Gamma Distribution.

Log Gamma Distribution

172

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

173

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Using Minitab

The “Calculator” function could be used to work on the Log Gamma Distribution.

Laplace Distribution Laplace distribution is double exponential distribution.

174

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

175

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

176

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

177

Using Minitab

Suppose the return of a stock follows Laplace distribution with location (μ) = 7 and scale (λ) = 4. What would be the probability that the stock will return a value between 7 and 12? In the first column, i.e. C1, enter the value 1 in the first row, 2 in the second row, 3 in the third row, and so on, until 12 in the 12th row. Go to the “Calc” tab above, go to “Probability Distributions”, and click “Laplace ” Select “Cumulative probability”. In the “Location:” enter 7, in the “Scale:” enter 4. Select “Input column:”, and enter C1. Click “OK.” This gives the following result: Cumulative Distribution Function Laplace with location = 7 and scale = 4 x P( X ≤ x ) 1

0.111565

2

0.143252

3

0.183940

4

0.236183

5

0.303265

6

0.389400

7

0.500000

8

0.610600

9

0.696735

10

0.763817

11

0.816060

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 12

0.856748

The probability that the stock will return a value between 7 and 12 could be calculated by subtracting the value 0.856748 (for 12) from 0.500000 (for 7). So, the answer is 0.356748.

Geometric Distribution Geometric distribution is related to two probable outcomes in a trial.

178

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

179

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

180

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

181

Using Minitab

Suppose a person’s success probability is 0.3. What is the probability that the person has to select 5 other people until he finds his success? Go to the “Calc” tab above, go to “Probability Distributions”, and click “Geometric

”

Select “Probability”. In the “Event probability:” enter 0.3. Select “Input constant:”, and enter 5. Click “OK.” This gives the following result: Probability Density Function Geometric with p = 0.3 x P( X = x ) 5

0.07203

* NOTE * X = total number of trials. So, the answer is 0.07203.

Level of Significance and confidence level Some events occur commonly, and they may have decreased significance; whereas, some events or outcomes occur rarely and they are of significant nature. The same thing applies in biostatistics. We, usually, say that the probability, which is denoted by capital ‘P’, of an event is low, if it is a rare event, and P-value, which can be considered in percentages, represents the value at which a rare event can be separated from other normal events. For example, researchers usually consider the value of less than 5% or 0.05 as a level of significance, which is a P-value. It can be shown by the sign of ‘less than’, i.e. P < 0.05. So, an event is of significant nature in biostatistics, if it has less than 5% probability. Significance level also refers to the probability of incorrectly rejecting a null hypothesis, i.e. probability of committing a Type I error

and the chances, which can be represented in

percentages, of rejecting null hypothesis, refers to the level of significance. Again take a look at the Type I error.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

182

Suppose, in reality H0 is true but if we reject it incorrectly False Positive Because we think that alternative hypothesis is right (positive)

Type I error α error

As level of significance is related to α-error, so it is also known as α level. So, by changing the level of significance, the chances of Type I error can also be changed. Whereas, the remaining level, which is obtained after removing the level of significance, is considered as level of confidence, which is represented by γ, and is obtained by the equation γ = 100% - α level or γ = 1 - α level. Usually, 5% or 0.05 is considered as the level of significance, so 95% or 0.95 is considered as the level of confidence.

Statistical Estimation

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

183

Interval Estimation It refers to the use of sample data for the calculation of an interval of most probable values (two values, i.e. range of values) of an unknown population parameter.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Using Minitab

The “Calculator” function could be used to work on the interval estimation.

184

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

185

Best Point estimation Point estimation is the utilization of sample data for the calculation of single value, which is also referred to as statistic. This single value is considered as a “best estimate” or “best guess” of population parameter that is not known. For example, 1. sample variance is a point estimate of population variance, 2. sample mean is the point estimate of population mean.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Using Minitab

Simple mean and variance can be calculated using descriptive statistics.

186

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

187

Correlation Correlation refers to systematic changes in the amount of one variable in relation to systematic changes in the amount of another variable. The correlation coefficient is represented by ‘r’ and it ranges from -1 to 0 to +1. In case of -1, the effect of one variable is completely negative on the other, whereas in case of +1, which is the highest value, the effect of one variable is completely positive on the other, i.e. with increase in one variable, second variable also increases. Those two variables can be plotted on x-axis and y-axis on a graph, and their correlation can be checked with the help of line in the plot. (Correlation coefficient is further explained later in the book.)

Central Limit Theorem According to Central Limit Theorem, the shape of the sampling distribution of the mean becomes almost normal with increase in sample size, i.e. n≥30, irrespective of the distribution in the population.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

188

Standard Error of the Mean Standard Error of Mean is represented by the symbol for Standard deviation, i.e. σ along with a subscript M representing mean. So, σM is the symbol for Standard Error of Mean. It could be calculated by the formula:

Where σ is standard deviation and n is the amount of numbers.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

189

It is helpful in knowing about all the population, i.e. the smaller the standard error of mean, the more accurately the sample would represent the entire population. Standard error of mean helps in working on confidence intervals.

Using Minitab

In order to calculate standard error of mean, we can use the same example as noted above. So, in the first column, i.e. C1, enter the value 14 in the first row; 36 in the second row; 45 in the third row; 70 in the fourth row, and 105 in the fifth row. Now go to “Stat”, go to “Basic Statistics”, and click “Display Descriptive Statistics ” Move the column name having variables for which standard error of mean has to be calculated. In this case, we have C1. Click “Statistics ” and check “SE of mean” and/or any other Descriptive statistics, such as “Mean” and “Standard deviation”, you want to do. Click “OK.” Now Click “OK.” This gives us the following results:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

190

Statistics Variable Mean SE Mean StDev C1

54.0

15.6

34.9

Statistical Significance

Tests for Non-normally distributed data Non-parametric testing can be applied to ranked, ordinal, or continuous outcome variables, and for those variables that do not follow normal distribution, i.e. non-normally distributed data. For example, severity of pain is an ordinal outcome and can be ordered from no pain to severe pain to agonizing pain on a scale from 1 to 12.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha One tail and two tail tests

191

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

192

Mood’s Median Test It is a nonparametric test to determine whether the medians of two or more samples or populations or groups are equal or not. Suppose we are working on a disease and the pain associated with the disease, and we have worked on two different medicines, i.e. A and B. After receiving the treatments, we assessed the quality-of-life and pain through the Quality-ofLife Scale and overall ranking of pain respectively. After getting the scores for quality-of-life and pain, those scores are joined and an overall score out of 10 is developed. Higher scores show improvement, while lower scores show no effect of treatment. Overall score out of 10

Medicines

7

A

2

A

5

A

4

A

6

B

8

B

6

B

9

B

Calculate overall median of 8 values, and in this case it is “6”. Now, for each medicine, count the number of observations that are equal to or less than the overall median (≤6) and greater than the overall median (>6). These values are placed in a 2×2 contingency table. A

B

>median

1

2

≤median

3

2

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

193

Now calculate the chi-square by using the equation χ2 = ((O - E)2 / E), where O shows observed values and E shows Expected values. The expected values can be considered as the average values as follows:

>median

Treatment group

Placebo Group

1.5

1.5

≤median 2.5 Now the Chi-square values are as follows:

2.5

χ2 = ((O - E)2 / E) = ((1 – 1.5)2 / 1.5) = 0.17 χ2 = ((O - E)2 / E) = ((3 – 2.5)2 / 2.5) = 0.1 χ2 = ((O - E)2 / E) = ((2 – 1.5)2 / 1.5) = 0.17 χ2 = ((O - E)2 / E) = ((2 – 2.5)2 / 2.5) = 0.1 Total χ2 = 0.167 + 0.1 + 0.167 + 0.1 = 0.534 Now we will check the Degrees of freedom (df). For χ2, we will calculate df by the following formula, df = k-1 = 2 - 1 = 1. So, we have df = 1. Now we will compare our calculated χ2 value, i.e. 0.534 with χ2 value in a table of χ2 with 1 df. If the calculated value of the Chi-square test is more than the value in the Chi-square table, null hypothesis will be rejected. χ2 value at α = 0.05 and df=1 is 3.841. Our observed χ2 value is 0.534 that is less than 3.841. This is showing that the median of A medicine has a nonsignificant difference from the B medicine.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Mood s median test (Part 1)

1

Determination of hypothesis

2

The overall median of the data is calculated

3

For each group, count the number of observations that are

4

The counts from step 3 are placed into a 2×k contingency table

A nonparametric test

194

It is used to determine whether the medians of two or more samples or populations are equal or not

Null Hypothesis – H0: Median of all groups is same Alternative Hypothesis – H1: Median of at least one of the groups is different

Equal to or less than the overall median ( median)

Greater than the overall median (>median)

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Mood s median test (Part 2)

195

Observed cases in row i at column j

Expected cases in row i at column j

Formula

5

Calculate a Chisquare statistic

6

Determination of degree of freedom (DF)

2

(O  

2  E ) ij ij

Eij

DF = k - 1 Number of levels of categorical variables, or (simply) categories, or sample, or populations, etc.

7

Compare χ2 test statistic (i.e. calculated χ2 Comparison of values value) with determined DF with χ2 value in a table of χ2

8

If the calculated value of the χ2 test is more than the value in the χ2 table, null hypothesis will be rejected.

Interpretation of results

Using Minitab

In order to work on Mood’s Median test, we can use the same example as noted above: Overall score out of 10

Medicines

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

7

A

2

A

5

A

4

A

6

B

8

B

6

B

9

B

196

Enter the values in columns. In this case, we have “Overall score out of 10” in first column, i.e. C1, and “Medicines” in C2. Now go to “Stat”, go to “Nonparametrics”, and click “Mood’s Median Test

” in the 3rd

section. In the appeared window (Mood’s Median Test), move C1, i.e. “Overall score out of 10” in the box with “Response:”, and move C2, i.e. “Medicines” in the box with “Factor:”. Click “OK.” Following results appear: Descriptive Statistics

Medicines A B Overall

Median 4.5 7.0 6.0

N Overall Median Q3 – Q1 CI 1 4.00 (2, 7) 2 2.75 (6, 9)

Levels with < 6 observations have confidence < 95.0% 95.0% CI for median(A) - median(B): (-7,1) Test Null hypothesis

H₀: The population medians are all equal

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Alternative hypothesis DF Chi-Square 1 0.53

197

H₁: The population medians are not all equal P-Value 0.465

Our observed χ2 value is 0.53 and p-value is 0.465. This p-value>0.05, thereby showing that the median of A medicine has a nonsignificant difference from the B medicine.

Goodness of Fit A goodness of fit test is used to check how well an observed frequency distribution (or observed set of data) fits a claimed distribution of a population (or an expected outcome). A goodness of fit test is among the most commonly used non-parametric tests. Chi-square test is commonly utilized for goodness of fit tests.

Chi-square test Chi-square (χ2) test is a very simple non-parametric test. If we consider the pain as a variable, this test can be used to show an association between the efficacies of treatments in reducing the pain. Testing Assumptions: 

Two variables have to be measured as categories, usually at a nominal or ordinal level.



The study groups have to be independent of each other.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

198

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

199

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

200

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

201

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

202

Using Minitab

In order to work on chi-square test, we can use the following data as an example; Groups

Outcomes

Total

Like Mathematics

Don’t like Mathematics

Boys

40

10

50

Girls

20

30

50

Total

60

40

100

Enter the values in columns. In this case, we have “Like Mathematics” in first column, i.e. C1, “Dont like Mathematics” in C2, and “Total” in C3. Now go to “Stat”, go to “Tables”, and click “Chi-Square Test for Association ” in the 2nd section. In the appeared window (Chi-Square Test for Association), select “Summarized data in a twoway table” from the drop down menu. Move C1 and C2 to the box under “Columns containing the table:”, and click “OK.” Results appear, and you may find that the expected values have also been generated below observed values. In the results, Pearson Chi-Square is 16.667 with 2 degrees of freedom (DF) and p-value is 0.000. This p-value is less than 0.05, i.e. p 5 P-Value

Ranking of pain before treatmen

2

0

6

0.289

Ranking of pain after 1 week of

3

3

2

1.000

C3

0

1

0

1.000

According to the results, p-value for C1 (Ranking of pain before treatment) is 0.289. This calculated p-value is more than the alpha level of 0.05, so we cannot reject the null hypothesis and say that the results are not statistically significant. Even if we calculate the median value for “Ranking of pain before treatment”, we get 7, and with this median value, the p-value for C2 is 0.1250 that is again more than 0.05, so we cannot reject the null hypothesis and say that the results are not statistically significant.

Wilcoxon Signed Rank Test It is nonparametric test. Testing Assumptions:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 

229

Two variables in which one is dependent variable that is measured on an ordinal or continuous level, and the other is independent variable that has two groups, or pairs (as you can see in the example of treatment of pain).

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

230

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

231

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

232

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

233

Wilcoxon Signed Rank Test can also be performed on the same data (as mentioned in The Sign Test). We will rank the difference scores from 1 through 8 after ordering the absolute values of the difference scores. In case of the same absolute values of the difference scores, we will assign the mean rank. Then we will attach the positive or negative signs of the observed differences to the ranks. So, we can get the following table: Difference, i.e.

Ordered absolute

Ranks or Mean

Signed ranks

Before treatment –

values of differences

Ranks

1

1

(1+2+3)/3 = 2

2

2

-1

(1+2+3)/3 = 2

-2

-1

1

(1+2+3)/3 = 2

2

3

2

(4+5+6)/3 = 5

5

6

-2

(4+5+6)/3 = 5

-5

1

2

(4+5+6)/3 = 5

5

2

3

7

7

-2

6

8

8

After treatment

We will work with the same hypotheses as in The Sign Test. However, capital W is the test statistic for the Wilcoxon Signed Rank Test. W is the smaller value of the sum of the positive ranks that could be represented by W+ and the sum of the negative ranks that could be represented by W-. Null hypothesis is accepted, if W+ is similar to W-, whereas null hypothesis is rejected, if W+ is much larger in value than W-. From the data, we have found that W+ = 29 and W- = 7. As the sum of the ranks must always by equal to n(n+1)/2. So, 8(8+1)/2 = 72/2 = 36

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

234

that is equal to 29 + 7 = 36. Our test statistic is 7, i.e. W =7, which is the smaller value in 29 and 7. Now we will check the table of critical values of W by considering the sample size, i.e. n = 8, and level of significance of 5%, i.e. α=0.05. If the observed value, i.e. 7 is less than or equal to the critical value, null hypothesis is rejected, whereas if the observed value is greater than the critical value, null hypothesis is not rejected. In the table, the critical value of W in two-tailed test is 4 that is smaller than 7. This is showing that the null hypothesis cannot be rejected. In short, in Wilcoxon Signed Rank Test, the test statistic is W. The null hypothesis is that the median difference is zero or W+ is equal to W-. W is equal to the smaller value in W+ and W-. In this case, calculated value of W is 7, which is more than the critical value of 4. So, the null hypothesis cannot be rejected; thereby, showing that the new treatment is not working. Using Minitab

In order to work on Wilcoxon Signed Rank Test, we can use the following data as an example; Participants

Ranking of pain before

Ranking of pain after 1

treatment

week of treatment

1

9

8

2

7

5

3

4

5

4

7

4

5

8

2

6

8

7

7

6

4

8

3

5

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

235

Enter the values in the columns. In this case, we enter the values of “Ranking of pain before treatment” in the first column, i.e. C1, and “Ranking of pain after 1 week of treatment” in the second column, i.e. C2. Now go to “Calc” and click “Calculator

”

In the box with “Store result in variable:” enter C3 (i.e. third column). In the box under “Expression:” enter C1-C2, and click “OK.” This generates another column, i.e. C3 as shown below: Ranking of pain before

Ranking of pain after 1

C3

treatment

week of treatment

9

8

1

7

5

2

4

5

-1

7

4

3

8

2

6

8

7

1

6

4

2

3

5

-2

Now go to “Stat”, go to “Nonparametrics”, and click “1-Sample Wilcoxon ” Move the third column, i.e. C3, from the left box to the box under “Variables:” Select “Test Median:” and enter “0.0” in the box. Select “not equal” from the drop down menu with “Alternative:” Click “OK.”

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

236

This gives the results in which p-value is 0.141. Test Null hypothesis

H₀: η = 0

Alternative hypothesis H₁: η ≠ 0 N for Wilcoxon Sample C3

Test 8

Statistic P-Value 29.00

0.141

This p-value is more than 0.05, so the results are statistically not significant, and null hypothesis could not be rejected.

The Kruskal-Wallis Test It is a nonparametric test. Testing Assumptions: 

Two variables in which one is dependent variable that is measured on an ordinal or continuous level, and the other is independent variable that has two or more categorical groups.



Observations are independent of each other, i.e. they are not related to each other.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

237

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

238

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

239

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Suppose, some people are feeling pain due to some disease and we want to know the effect of different concentrations of a new drug on the treatment of pain. We will get 12 participants

240

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

241

(samples) from those people and check the efficacy of the new drug and its different concentrations in reducing the pain. The concentrations of the new drug include 5%, 10% and 15% of drug in the solution, and the response of the people will be taken in the form of ranking of pain on a scale from 1 to 12 in which 1 shows least pain and 12 shows highest level of pain. Suppose, we give 5% solution to 3 people (i.e. n1), 10% solution to 5 people (i.e. n2), and 15% solution to 4 people (i.e. n3). In this situation, we will perform Kruskal-Wallis Test as the sample sizes are small and they are not equal, i.e. n1 = 3, n2 = 5, and n3 = 4. Moreover, they are not normally distributed. Kruskal-Wallis Test is useful as it helps in comparing the outcomes of more than two independent groups. Our null hypothesis is that the population medians of all three groups are equal, whereas alternative hypothesis is that the population medians of the groups are not equal at 5% level of significance. Suppose, following responses are obtained: 5% solution

10% solution

15% solution

7

6

5

6

5

2

9

7

3

8

4

5

This table is showing that 15% solution of the drug is more helpful in reducing pain as compared to 5% solution. However, it is important to check whether this observed difference is statistically significant or not. For Kruskal Wallis Test, we have to order the data of 12 subjects from smallest value to largest value while keeping the track of group assignments in the total sample.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

242

First we will check whether the total value of ranks, i.e. 29+38+11=78 is equal to n(n+1)/2 or not. So, n(n+1)/2= 12(13)/2 = 78. So, these are equal. The test statistic for the Kruskal Wallis test is represented by H and can be calculated by using the equation,

Where n is the total number of subjects or samples, i.e. 12; R1 is the sum of the ranks in the first group, i.e. 29;

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

243

R2 is the sum of the ranks in the second group, i.e. 38; R3 is the sum of the ranks in the third group, i.e. 11; n1 is the sample size of the first group, i.e. 3; n2 is the sample size of the second group, i.e. 5, and n3 is the sample size of the third group, i.e. 4.

Now we will check whether this test statistic, i.e. H=7 is in favor of null hypothesis or it rejects the null hypothesis. So, we will check it by considering the critical value of H. If this value is more than or equal to the critical value, null hypothesis will be rejected, whereas if this value is less than the critical value null hypothesis will not be rejected. From the table, the critical value with sample sizes of 3, 5, and 4, and α=0.05 is 5.656. Our observed H value is 7 and it is greater than the critical value, so we can reject null hypothesis. It can also be said that the test is statistically significant as the null hypothesis is rejected. It is quite good news as we can use increased concentration of the drug to reduce the pain.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

244

In short, in Kruskal Wallis Test, the test statistic is H. The null hypothesis is that the medians of all the populations are equal. Our calculated value of H is 7, which is more than the critical value of 5.656. In this case, the null hypothesis is rejected. It cannot be rejected only when calculated H value is less than the critical value. Using Minitab

In order to work on Kruskal Wallis Test, we can use the following data as an example: 5% solution

10% solution

15% solution

7

6

5

6

5

2

9

7

3

8

4

5 These are three groups. So, we have to assign them numbers, i.e. the group of 5% solution is considered as “1”, the group of 10% solution is considered as “2”, and the group of 15% solution is considered as “3”. Then we place these assigned numbers in the first column according to the number of results. In this case, we place number “1” in the first column in first three rows; number “2” in the next five rows, i.e. from 4th to 8th row, and number “3” in the next four rows, i.e. from 9th to 12th row. In the next column, we place the results obtained according to the assigned numbers (to groups). So, in the second column in the first three rows, we put 7, 6, and 9; in the next five rows, we put 6, 5, 7, 8, and 5, and in the next four rows, we put 5, 2, 3, and 4. We get the following table: 1

7

1

6

1

9

2

6

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

2

5

2

7

2

8

2

5

3

5

3

2

3

3

3

4

245

Now go to “Stat”, go to “Nonparametrics”, and click “Kruskal-Wallis ” In the box with “Response:” move the column containing responses, i.e. C2 in this case, and in the box with “Factor:” move the column containing groups, i.e. C1 in this case. Click “OK.” This gives the Kruskal-Wallis H, which is equal to 7.26 in this case, and p-value, which is 0.027 in this case. Test Null hypothesis

H₀: All medians are equal

Alternative hypothesis H₁: At least one median is different Method

DF H-Value P-Value

Not adjusted for ties

2

7.11

0.029

Adjusted for ties

2

7.26

0.027

The chi-square approximation may not be accurate when some sample sizes are less than 5. This p-value is less than 0.05, so the results are statistically significant.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Degrees of Freedom

The number of values or scores in a data set (or distribution) that are free to vary while considering the final calculation.

246

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

247

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

248

The Friedman Test It is a nonparametric test that is considered as an alternative to Two Way ANOVA. Testing Assumptions: 

Data, especially related to dependent variable, has to be measured at the continuous or ordinal level.



Data is related to a single group measured at a minimum of three different occasions.



Random sampling strategy for the group from the selected population.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

249

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

250

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

251

Suppose, there is a treatment for pain (in some disease) that works equally on pain-related outcomes after the use of 15% solution of that treatment in all age groups. We give the new treatment to four different groups and get their responses. Further suppose that the four groups are A, B, C, and D. Group A has 6 participants having 15 years of age. Group B has 6 participants having 25 years of age. Group C has 6 participants having 35 years of age. Group D has 6 participants having 45 years of age. This difference of years is helpful in knowing the differences of the new treatment on different ages of samples. In this case, Friedman test can help as it is used to compare three or more groups, and all the groups have same number of sample sizes. Our null hypothesis is that median rankings of the four groups are equal, while the

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

252

alternative hypothesis is that median ranking of at least one of the groups is different from the median ranking of at least one of the other groups at α = 0.05. After giving the treatment to the participants (blocks of raters), suppose we get the following results:

Now, we need to convert the data to ranks within blocks.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

253

These Rank Totals are showing that there are differences in the rankings of pain of different age groups. So, we have to test the statistical significance of these results. First, we will check the rankings in the Friedman Test, so

Where r = number of participants in every group = 6 c = number of groups = 4 So, we have

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

254

For the Friedman test, we have the following equation,

Now, we have to check whether the calculated value is greater than, equal to, or smaller than the critical value of the chi-square distribution with c -1 = 3 degrees of freedom and α = 0.05. Critical value in the table is 7.815, and our calculated value is 13.35, which is greater than the critical value. Therefore, we can reject the null hypothesis, and there are significant differences in the median rankings of participants from different age groups. In short, in Friedman test, the test statistic is F. The null hypothesis is that the median rankings of all the groups are equal. Our calculated value of F is 13.35, which is more than the critical value of 7.815. In this case, we can reject the null hypothesis. It cannot be rejected only when calculated F value is less than the critical value. Using Minitab

In order to work on the Friedman Test, we can work on the following data:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

255

Enter the data in three columns, namely “Groups”, “Raters”, and Responses. We assign the numbers to Groups. So, in the column of Groups, i.e. C1 in this case, enter 1 (for A) in first six rows; 2 (for B) in the next six rows; 3 (for C) in the next six rows, and 4 (for D) in the next six rows. In the column of Raters, i.e. C2 in this case, enter 1, 2, 3, 4, 5, 6 in first six rows, and repeat these numbers in the next six rows. Overall, repeat this four times, i.e. up to the row number 24. In the column of Responses, i.e. C3 in this case, enter the respective responses. We get the following table: Groups

Raters

Responses

1

1

8

1

2

9

1

3

7

1

4

6

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

256

1

5

9

1

6

8

2

1

7

2

2

8

2

3

5

2

4

7

2

5

6

2

6

7

3

1

6

3

2

7

3

3

6

3

4

5

3

5

4

3

6

8

4

1

5

4

2

6

4

3

4

4

4

3

4

5

4

4

6

5

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

257

Now go to “Stat”, go to “Nonparametrics”, and click “Friedman ” Move the columns to their respective boxes on the right side. In this case, we move C3 to “Response:”, C1 to “Treatment:” and C2 to “Blocks.” Click “OK.” This gives the results in which Chi-Square (Not adjusted for ties) is 13.35, and p-value is 0.004. Test Null hypothesis

H₀: All treatment effects are zero

Alternative hypothesis H₁: Not all treatment effects are zero Method

DF Chi-Square P-Value

Not adjusted for ties

3

13.35

0.004

Adjusted for ties

3

13.81

0.003

This p-value is less than 0.05, so the results are statistically significant.

Tests for Normally distributed data Unpaired “t” test, Paired “t” test, One Way ANOVA, and Two Way ANOVA can be used to test Normally distributed data.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

258

Unpaired “t” test Testing Assumptions: 

A dependent continuous variable and an independent categorical variable with two groups, or levels.



The dependent variable follows a normal distribution or at least approximately normal distribution.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 

259

The variances or standard deviation of the two groups are equal, i.e. homogeneity of variance or homogeneity, respectively. N.B.: If these assumptions are not followed, you may consider using non-parametric tests.

Suppose we are working on a disease and the pain associated with the disease, and we have developed two groups of 8 participants each. One group receives the new treatment, while the other group receives placebo. After receiving the treatments, those participants are assessed for the quality-of-life and pain through the Quality-of-Life Scale and overall ranking of pain respectively. After getting the scores for quality-of-life and pain, those scores are joined and an overall score out of 10 is developed. Higher scores show improvement, while lower scores show no effect of treatment. Overall scores out of 10 For Treatment Group

For Placebo Group

6

4

3

4

4

1

5

3

5

5

9

6

7

2

8

7

In this case, we will use student’s t-test as the number of sample is less than 30, i.e. overall sample size is equal to 16. Moreover, we will use unpaired or independent t-test as the

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

260

comparison has been made on the outcomes in two different groups. Now we have to perform some calculations on these results as follows:

Where s2 is pooled sample variance. Proceeding with this data, we have

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

261

Now, we have to look at the t table with two tailed test as we are not sure about the efficacy of the new treatment on human beings. We will check the table at degrees of freedom = number of samples – 2 = 14 and at α = 0.05.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Table 3: t table (Source: Anonymous)

In the table, the critical value is 2.145, but our calculated value is 1.875 that is less than the critical value of the table. This is showing that the differences of the two groups are not statistically significant. Using Minitab

In order to work on unpaired t-test, we can use the following data as an example, Overall scores out of 10

262

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

For Treatment Group

For Placebo Group

6

4

3

4

4

1

5

3

5

5

9

6

7

2

8

7

263

Put these values in the first two columns, i.e. C1 showing “For Treatment Group”, and C2 showing “For Placebo Group.” Normality test First, we have to work on the testing assumptions. In order to know the normality of the data, we can use Kolmogorov-Smirnov Test (KS-Test). For this test, open a new Worksheet, and combine the values of both groups in a single column, i.e. C1, as the groups (two groups) are independent variables, and the values are dependent variables. Now go to “Stat” tab above, go to “Basic Statistics”, and click “Normality Test

”. Move C1 to

the box with “Variable:” Select “Kolmogorov-Smirnov” under “Tests for Normality”, and click “OK.” The following graph and results are obtained:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

264

The data points are comparatively closer to the fitted normal distribution line. The p-value (more than 0.150) is more than 0.05. So, it can be said that the null hypothesis could not be rejected, and the data follow a normal distribution. Homogeneity of variance In order to test the homogeneity of variance, put these values in the first two columns, i.e. C1 showing “For Treatment Group”, and C2 showing “For Placebo Group.” Now go to “Stat” tab above, go to “ANOVA”, and click “Test for Equal Variances

”. From the

drop down menu, select “Response data are in a separate column for each factor level”. In the box under “Responses:” move both the columns, i.e. C1 and C2. Click “OK.” The following results are obtained:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

265 Test

Method

Statistic

P-Value

Multiple comparisons

0.00

0.961

Levene

0.05

0.833

The analysis of homogeneity of variances shows that the findings meet the assumption of homogeneity of variance as p-value is more than 0.05, and there is no statistically significant difference between the groups. So, now we can conduct “t” test after looking at other assumptions. “t” test Now go to “Stat”, go to “Basic Statistics”, and click “2-Sample t

”

Select “Each sample is in its own column” in the appeared window (Two-Sample t for the Mean), move the first column, i.e. C1 in this case, into the box with “Sample 1:” and move the second column, i.e. C2 in this case, into the box with “Sample 2:”. Click “OK.” This gives the t-test results. Test Null hypothesis

H₀: μ₁ - µ₂ = 0

Alternative hypothesis H₁: μ₁ - µ₂ ≠ 0 T-Value DF P-Value 1.86

13

0.086

In this case, we get T-Value as 1.86 and P-Value as 0.086, which is more than 0.05 that is also showing that the results are not statistically significant.

Paired “t” test Testing Assumptions:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha



266

There is a dependent continuous variable and an independent variable with two groups, or pairs.



The observations or values are independent of each other.



The dependent variable shows approximate normal distribution.



There must not be any significant outlier in the dependent variable. N.B.: If these assumptions are not fulfilled, you may consider using non-parametric tests.

Suppose we are working on the same treatment for the disease as mentioned above on 20 other participants. This time, we are checking the efficacy of the new treatment on 20 participants, i.e. n = 20 before the start of treatment and after the start of treatment, i.e. after one month of starting the treatment. After working on the participants, suppose we have obtained the following data that is showing the combined value of the quality-of-life and pain ranking out of 10. Higher scores show improvement, while lower scores show no effect of treatment. Overall score out of 10 Participants

Before treatment

After treatment

1

3

8

2

1

6

3

5

7

4

6

9

5

1

10

6

1

7

7

1

5

8

3

6

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

267

9

0

4

10

5

8

11

5

7

12

5

9

13

3

6

14

2

8

15

6

6

16

2

7

17

2

9

18

4

8

19

1

6

20

7

5

From this data, we develop another table: Overall score out of 10 Participants

Before treatment

After treatment

Difference

1

3

8

5

2

1

6

5

3

5

7

2

4

6

9

3

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

268

5

1

10

9

6

1

7

6

7

1

5

4

8

3

6

3

9

0

4

4

10

5

8

3

11

5

7

2

12

5

9

4

13

3

6

3

14

2

8

6

15

6

6

0

16

2

7

5

17

2

9

7

18

4

8

4

19

1

6

5

20

7

5

-2

After calculating the differences, it can be found that those differences are following approximately normal distribution and there are almost no extreme outliers. Therefore, paired ttest could be performed on the data. Mean difference calculated from this data is equal to 3.9, and standard deviation is 2.343. Therefore, standard error of the mean difference is equal to 0.52 (i.e. standard deviation / square

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

269

root of total number of participants = 2.343 / square root of 20). In order to calculate the tstatistic, we have, t = mean difference / standard error of the mean difference = 3.9 / 0.52 = 7.5. Now we will look at the t table with two tailed test. For the table, degrees of freedom = number of samples – 1 = 19, and α = 0.05. Critical value in the t table is 2.093 and our calculated value is 7.5, i.e. larger than the critical value. So, there is strong evidence that the new treatment would work effectively. Using Minitab

In order to work on paired t-test, we can use the following data, Overall score out of 10 Participants

Before treatment

After treatment

1

3

8

2

1

6

3

5

7

4

6

9

5

1

10

6

1

7

7

1

5

8

3

6

9

0

4

10

5

8

11

5

7

12

5

9

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

270

13

3

6

14

2

8

15

6

6

16

2

7

17

2

9

18

4

8

19

1

6

20

7

5

Enter the values in the columns. In this case, values of “Before treatment” are entered in the first column, i.e. C1, and the values of “After treatment” are entered in the second column, i.e. C2. Normality test First, we have to work on the testing assumptions. In order to know the normality of the data, we can use Kolmogorov-Smirnov Test (KS-Test). For this test, open a new Worksheet, and combine the values of pairs or groups in a single column, i.e. C1, as the groups (two groups) or pairs are independent variables, and the values are dependent variables. Now go to “Stat” tab above, go to “Basic Statistics”, and click “Normality Test

”. Move C1 to

the box with “Variable:” Select “Kolmogorov-Smirnov” under “Tests for Normality”, and click “OK.” The following graph and results are obtained:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

271

The data points are comparatively closer to the fitted normal distribution line. The p-value (0.065) is more than 0.05. So, it can be said that the null hypothesis could not be rejected, and the data follow a normal distribution. Outlier Test Utilizing the same Worksheet, go to “Stat” tab above, go to “Basic Statistics”, and go to “Outlier Test

” in the 6th section.

Move the column, i.e. C1 to the box under “Variables:” by selecting it and pressing the button “Select”. Now click “OK.” The following results and graph appear. Grubbs' Test Variable C1

N 40

Mean 5.100

StDev 2.687

Min 0.000

Max 10.000

G 1.90

P 1.000

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

272

* NOTE * No outlier at the 5% level of significance

These results show that there is no significant outlier. So, paired “t” test could be conducted after looking at other assumptions. “t” test Now go to “Stat”, go to “Basic Statistics”, and click “Paired t

”

Enter the first column, i.e. C1 in “Sample 2:” and C2 in “Sample 1:”. Click “OK.” We get the results showing T-value (t-stat) of 7.26 and p-value of 0.000. Test Null hypothesis

H₀: μ_difference = 0

Alternative hypothesis H₁: μ_difference ≠ 0 T-Value P-Value

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 7.26

273

0.000

This p-value is less than 0.05, so we can say that the results are statistically significant. If we look at the mean values, we can find that mean values increase “After treatment”. Sample

N

Mean

StDev

SE Mean

After treatment

20

7.050

1.572

0.352

Before treatment

20

3.150

2.084

0.466

Analysis of Variance (ANOVA)

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Sum of Squares

274

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Residual Sum of Squares

One way ANOVA Testing Assumptions: 

There is a continuous dependent variable and an independent variable with two or more categorical groups.



The data must represent independent observations.



There must not be any significant outliers.

275

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

276



The dependent variable must be normally distributed.



The dependent variable must have the variance equal in each population, i.e. homogeneity of variance. N.B.: If these assumptions are not fulfilled, you may consider using non-parametric tests.

Suppose we have worked on the same disease and the treatment as mentioned above, but now we have divided the participants of the study into four different groups receiving four different concentrations of the treatment and placebo for three months. Among those four groups, one group receives 25% solution of the new treatment; second group receives 15% solution of the new treatment; third group receives 5% solution of the new treatment, and fourth group receives placebo and is considered as control group. Every group has 5 samples or participants. Those groups give scores for quality-of-life and pain, and those scores are joined and an overall score out of 10 is developed. Group A receives

Group B receives 5%

Group C receives

Group D receives

placebo

solution of treatment

15% solution of

25% solution of

treatment

treatment

3

4

3

9

3

6

5

2

5

5

4

7

1

3

6

8

4

4

2

4

These numbers are following approximately normal distribution as shown by the following graph:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

277

We will use the Analysis of Variance (ANOVA) procedure, i.e. One way ANOVA, in this case as we have more than 3 sets of data. In order to perform ANOVA, our null hypothesis is that means of all the groups are equal, and alternative hypothesis is that means of all the groups are not equal at α=0.05. The test statistic for this process is F statistic for ANOVA, where F is equal to Mean Squares Between Treatments divided by Mean Squares Error (or Residual). The appropriate critical value will be noted from the table of probabilities for the F distribution. Our degrees of freedom will be degree of freedom one (df1) that is equal to total number of groups (k) minus one, i.e. df1 = k -1 = 4-1 =3, and degree of freedom two (df2) that is equal to total number of samples in all groups (N) minus number of groups (k), i.e. df2 = N – k = 20 – 4 = 16. With these values of degrees of freedom and at α=0.05, the critical value is 3.24. Therefore, we have to reject the null hypothesis if the observed F value will be greater than or equal to 3.24.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Table 4: F distribution table at alpha level of 0.05 (Source: Anonymous)

278

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

279

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

280

Now we will calculate the F statistic. For this calculation, it is important to take the sample mean for each group and then the overall mean on the basis of the total sample.

Number of

Group A

Group B

Group C

Group D

received placebo

received 5%

received 15%

received 25%

solution of

solution of

solution of

treatment

treatment

treatment

5

5

5

5

3.2

4.4

4

6

samples (n) Group mean

If we consider all N=20 observations, the overall mean is equal to 4.4, i.e. 88/20 = 4.4. Now Sums of Squares Between Treatments (SSB) is calculated by the following formula, SSB = number of samples in Group A (Group A mean – Overall mean)2 + number of samples in Group B (Group B mean – Overall mean)2 + number of samples in Group C (Group C mean – Overall mean)2 + number of samples in Group D (Group D mean – Overall mean)2 So, SSB = 5 (3.2 – 4.4)2 + 5 (4.4 – 4.4)2 + 5 (4 – 4.4)2 + 5 (6 – 4.4)2

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

281

SSB = 5 (1.44) + 5 (0) + 5(0.16) + 5(2.56) SSB = 7.2 + 0 + 0.8 + 12.8 SSB = 20.8 Now, we will calculate Sums of Squares for Errors (or Residuals) [SSE]. In order to calculate SSE, squared differences are required between each observation and its group mean, i.e. SSE = Total value of (Score – 3.2)2 of Group A + Total value of (Score – 4.4)2 of Group B + Total value of (Score – 4.0)2 of Group C + Total value of (Score – 6.0)2 of Group D. So, it will be calculated in parts. For the samples in Group A, Group A

(Score – 3.2)

(Score – 3.2)2

3

-0.2

0.04

3

-0.2

0.04

5

1.8

3.24

1

-2.2

4.84

4

0.8

0.64

Total

0

8.8

Group B

(Score – 4.4)

(Score – 4.4)2

4

-0.4

0.16

6

1.6

2.56

5

0.6

0.36

For the samples in Group B,

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

282

3

-1.4

1.96

4

-0.4

0.16

Total

0

5.2

Group C

(Score – 4)

(Score – 4)2

3

-1

1

5

1

1

4

0

0

6

2

4

2

-2

4

Total

0

10

Group D

(Score – 6)

(Score – 6)2

9

3

9

2

-4

16

7

1

1

8

2

4

4

-2

4

For the samples in Group C,

For the samples in Group D,

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Total

0

283

34

Now SSE = Total value of (Score – 3.2)2 of Group A + Total value of (Score – 4.4)2 of Group B + Total value of (Score – 4.0)2 of Group C + Total value of (Score – 6.0)2 of Group D SSE = 8.8 + 5.2 + 10 + 34 SSE = 58 Now, the ANOVA Table is as follows: Source of

Sums of

Degrees of

Means Squares

Variation

Squares (SS)

Freedom (df)

(MS)

Between

20.8 = SSB

4-1=3 = df1

20.8/3=6.93 =

MSB / MSE =

MSB

1.91

Treatments Error (or

58 = SSE

20-4=16 = df2

58/16=3.63 = MSE

Residual) Total

F

78.8

20-1=19

From this calculated F value, i.e. 1.91, we can conclude that we cannot reject the null hypothesis as 1.91 is smaller than 3.24. We don’t have statistically significant evidence at α=0.05 to show that there is a difference in the mean score among the four groups or treatment levels. In short, the new treatment at any concentration below 30% in solution is not much effective as compared to placebo. Using Minitab

In order to work on one way ANOVA, we can use the following data:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

284

Group A receives

Group B receives 5%

Group C receives

Group D receives

placebo

solution of treatment

15% solution of

25% solution of

treatment

treatment

3

4

3

9

3

6

5

2

5

5

4

7

1

3

6

8

4

4

2

4

Enter these values in four columns, i.e. C1 for “Group A receives placebo”, C2 for “Group B receives 5% solution of treatment”, C3 for “Group C receives 15% solution of treatment”, and C4 for “Group D receives 25% solution of treatment”.

Normality test First, we have to work on the testing assumptions. In order to know the normality of the data, we can use Kolmogorov-Smirnov Test (KS-Test). For this test, open a new Worksheet, and combine the values of all the groups in a single column, i.e. C1, as the groups (four groups) are independent variables, and the values are dependent variables. Now go to “Stat” tab above, go to “Basic Statistics”, and click “Normality Test

”. Move C1 to

the box with “Variable:” Select “Kolmogorov-Smirnov” under “Tests for Normality”, and click “OK.” The following graph and results are obtained:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

285

The data points are comparatively closer to the fitted normal distribution line. The p-value (0.094) is more than 0.05. So, it can be said that the null hypothesis could not be rejected, and the data follow a normal distribution. Outlier Test Utilizing the same Worksheet, go to “Stat” tab above, go to “Basic Statistics”, and go to “Outlier Test

” in the 6th section.

Move the column, i.e. C1 to the box under “Variables:” by selecting it and pressing the button “Select”. Now click “OK.” The following results and graph appear. Grubbs' Test Variable C1

N 20

Mean 4.400

StDev 2.037

Min 1.000

Max 9.000

G 2.26

P 0.317

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

286

* NOTE * No outlier at the 5% level of significance

These results show that there is no significant outlier. Homogeneity of variance In order to test the homogeneity of variance, enter these values in four columns, i.e. C1 for “Group A receives placebo”, C2 for “Group B receives 5% solution of treatment”, C3 for “Group C receives 15% solution of treatment”, and C4 for “Group D receives 25% solution of treatment”.

Now go to “Stat” tab above, go to “ANOVA”, and click “Test for Equal Variances

”. From the

drop down menu, select “Response data are in a separate column for each factor level”. In the box under “Responses:” move all the four columns, i.e. C1, C2, C3, and C4. Click “OK.” The following results are obtained:

Method Multiple comparisons

Test Statistic —

P-Value 0.243

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Levene

287 1.27

0.319

The analysis of homogeneity of variances shows that the findings meet the assumption of homogeneity of variance as p-value is more than 0.05, and there is no statistically significant difference between the groups. So, now we can conduct One way ANOVA after looking at other assumptions. One Way ANOVA

Now go to “Stat”, go to “ANOVA”, and click “One-Way ” Select “Response data are in a separate column for each factor level” from the drop down menu; move all the columns, i.e. C1, C2, C3, and C4 to the box under “Responses:” Click “OK.” This gives us the results in which F-value is equal to 1.91 and p-value is equal to 0.168. Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Factor

3

20.80

6.933

Error

16

58.00

3.625

Total

19

78.80

1.91

0.168

This p-value is more than 0.05 showing that the results are not statistically significant.

Two way ANOVA Two-Factor ANOVA procedure could be performed by different variables at the same time as, for example, we can consider different ages and different concentrations of treatments at the same time as opposed to only different concentrations as shown above in One way ANOVA. Testing Assumptions: 

There is a continuous dependent variable and two independent variables with two or more categorical groups.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha 

Observations are independent of each other.



Samples represent normal distribution.



Samples show homogeneity of variance.



There are no significant outliers in the data.

288

N.B.: If these assumptions are not fulfilled, you may consider using non-parametric tests.

Suppose we have participants belonging to two different age groups, i.e. Group A having 15 samples (participants) in the age range of 10-25 years, and Group B having 15 samples (participants) in the age range of 25-40 years. Each group of 15 samples were randomly assigned to three different solutions of the new treatment, i.e. Treatment X received 25% of the solution, Treatment Y received 15% of the solution, and Treatment Z received 5% of the solution. Suppose, we have obtained the following data: Treatment

X receiving 5% of the

Group A of participants in the Group B of participants in the age range of 10-25 years

age range of 26-40 years

3

4

4

5

4

5

5

6

4

7

5

6

6

5

5

6

solution

Y receiving 15% of the solution

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Z receiving 25% of the

289

8

7

6

7

7

7

7

9

8

9

6

7

6

8

solution

Here, we will use two way ANOVA as we have two groups, i.e. Group A of participants in the age range of 10-25 years and Group B of participants in the age range of 26-40 years, and Three treatment, i.e. X receiving 25% of the solution, Y receiving 15% of the solution, and Z receiving 5% of the solution. In order to perform two way ANOVA, we have the following ANOVA table: Source of

Sums of

Degrees of

Mean Squares

F

Variation

Squares (SS)

freedom (df)

(MS)

Total number

SSN

df1=Total

SSN/df1 = MSN

MSN/MSE

df2 = Total

SST / df2 =

MST/MSE

number of

MST

number of

of groups

groups – 1 = 6-1 =5 Treatment

SST

treatment groups – 1 = 3 -1 = 2

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Groups

SSG

df3 = Total

SSG / df3 =

number groups

MSG

290

MSG/MSE

by age range – 1 = 2 -1 = 1 Treatment

SSTG = SSN –

df4 = df2 * df3 = SSTG / df4 =

versus Group

(SST + SSG)

2*1=2

MSTG

SSE

df5 = Total

SSE / df5 =

number of

MSE

MSTG/MSE

interaction Error (or Residual)

samples – (Total number groups by age range * Total number of treatment groups) = 30 – (2*3) = 30-6 = 24

Total

After thorough calculation, we can make another table with final values. Source of

Sums of

Degrees of

Mean Squares

Variation

Squares (SS)

freedom (df)

(MS)

Total number

45

df1 = 5

45/5 = 9

of groups

F

9/0.95 = 9.47

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Treatment

36.5

df2 = 2

291

36.5 / 2 = 18.3

18.3/0.95 = 19.26

Groups

6.5

df3 = 1

6.5 / 1 = 6.5

6.5/0.95 = 6.84

Treatment

2

df4 = 2

2/2=1

1/0.95 = 1.05

22.8

df5 = 24

22.8 / 24 = 0.95

versus Group interaction Error (or Residual) Total

In this table, there are four statistical tests. The first test is an overall test to check whether a difference exists between 6 group means. The F statistic is 9.47, which is greater than the critical value of 2.62 at alpha level of 0.05, so it is statistically significant. After looking the significance of the overall test, it is important to check the factors that may result in the significance, i.e. treatment, group, or their interaction. F statistic for the treatment is 19.26, which is greater than the critical value of 3.40, and it is significant. Similarly, F statistic for the groups is 6.84, which is greater than the critical value of 4.26, and it is significant. However, the F statistic for the treatment versus group interaction is 1.05, which is lower than the critical value of 3.40, and it is non-significant. Now, we will look at the mean values of different groups and treatments. So, we have the following table, Treatment

Group A

Group B

X

4

5.4

Y

6

6.2

Z

6.8

8

Treatment Z appears to be the best treatment in different treatments and groups. Among all treatments and groups, Group B shows better results.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

292

Using Minitab

In order to work on two-way ANOVA, we can use the following data as an example: Treatment

X receiving 5% of the

Group A of participants in the Group B of participants in the age range of 10-25 years

age range of 26-40 years

3

4

4

5

4

5

5

6

4

7

5

6

6

5

5

6

8

7

6

7

7

7

7

9

8

9

6

7

6

8

solution

Y receiving 15% of the solution

Z receiving 25% of the solution

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

293

Normality test First, we have to work on the testing assumptions. In order to know the normality of the data, we can use Kolmogorov-Smirnov Test (KS-Test). For this test, open a new Worksheet, and combine the values of all the groups in a single column, i.e. C1, as the groups are independent variables, and the values are dependent variables. Now go to “Stat” tab above, go to “Basic Statistics”, and click “Normality Test

”. Move C1 to

the box with “Variable:” Select “Kolmogorov-Smirnov” under “Tests for Normality”, and click “OK.” The following graph and results are obtained:

The data points are comparatively closer to the fitted normal distribution line. The p-value (>0.150) is more than 0.05. So, it can be said that the null hypothesis could not be rejected, and the data follow a normal distribution. Outlier Test

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

294

Utilizing the same Worksheet, go to “Stat” tab above, go to “Basic Statistics”, and go to “Outlier Test

” in the 6th section.

Move the column, i.e. C1 to the box under “Variables:” by selecting it and pressing the button “Select”. Now click “OK.” The following results and graph appear. Grubbs' Test Variable N Mean StDev C1 30 6.067 1.530 * NOTE * No outlier at the 5% level of significance

Min 3.000

Max 9.000

These results show that there is no significant outlier. Homogeneity of variance, normality, and independence of residuals

G 2.00

P 1.000

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

295

We have to assign numbers to the groups and treatments. So, we assign number “1” to “Group A of participants in the age range of 10-25 years” and 2 to “Group B of participants in the age range of 26-40 years.” Similarly, we assign number “1” to “X receiving 5% of the solution”, “2” to “Y receiving 15% of the solution”, and “3” to “Z receiving 25% of the solution.” After assigning the numbers, we enter these numbers and the outcomes according to the assigned numbers in Minitab. For example, Group 1, Treatment 1, Outcome 3; Group 1, Treatment 1, Outcome 4;

. Group 2, Treatment 1, Outcome 4;

. Group 2, Treatment 2, Outcome 6 etc.

Enter the values of Groups in the first column, i.e. C1; treatments in the second column, i.e. C2, and outcomes in the third column, i.e. C3. Now go to “Stat”, go to “ANOVA”, go to “General Linear Model”, and click “Fit General Linear Model ” Move the columns to the respective boxes. In this case, C3 shows “Responses:”, C2 shows “Factors:”, and C1 shows “Covariates:”. Now click “Graphs ”, select “Four in one”, and click “OK.” Click “OK” again. Following graphs are obtained:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

296

In these graphs, “Normal Probability Plot” shows that the residuals follow an approximately straight line, thereby, showing that the residuals are approximately normally distributed. The “Versus Fits” plot shows that none of the groups have substantially different variability (almost all of them are overlapping), and there is not significant outlier. This shows constant variance. The “Histogram” shows that none of the bars is significantly far from the other bars that is representative of no outliers. Furthermore, there is no significantly long tail in one direction that is representative of no skewness. The “Versus Order” plot shows that the residuals fall randomly about the centerline that is indicative of the residuals being independent from each other. These tests show that Two Way ANOVA can be conducted on the data. In fact, it has already been conducted (above the graphs in Minitab). Two Way ANOVA Analysis of Variance Source

DF Adj SS Adj MS F-Value P-Value

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Groups

1

6.5333

6.83

0.015

Treatments

2 36.467 18.2333

19.06

0.000

1.09

0.353

Error

6.533

297

26 24.867

Lack-of-Fit Pure Error Total

2

0.9564

2.067

1.0333

24 22.800

0.9500

29 67.867

In the obtained results, p-values of groups (C1) and treatments (C2) are less than 0.05, i.e. they are showing statistically significant results. In order to look at the mean values of different groups and treatments, go to “Stat”, go to “Basic Statistics”, and go to “Display Descriptive Statistics ” In the box under “Variables:” move the column “C3 Outcomes” and in the box under “By variables (optional):” move the columns “C1 Groups” and “C2 Treatments”. Click “Statistics” and select only “Mean.” Click “OK.” Click “OK” again. This gives the following results: Descriptive Statistics: Outcomes Results for Groups = 1 Statistics Variable Outcomes

Treatments Mean 1

4.000

2

6.000

3

6.800

Results for Groups = 2 Statistics

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Variable Outcomes

298

Treatments Mean 1

5.400

2

6.200

3

8.000

Treatment # 3 (i.e. Z) appears to be the best treatment in different treatments and groups. Among all treatments and groups, Group # 2 (i.e. B) shows better results.

Different types of ANOVAs

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

299

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

300

Using Minitab – General MANOVA

Suppose researchers want to check the effect of a Drug on heart rate and mobility, and perform an experiment to check the difference of the Drug from that of Standard drug and placebo. In order to perform the experiment, researchers work with 36 participants, and expose 12 participants to the Drug, 12 participants to the Standard drug, and 12 participants to the placebo. The results are obtained in the form of ratings from 1 to 10, where higher numbers show increased heart rate and mobility. In Minitab, assign C1 to the Intervention, i.e. Drug, Standard Drug, and Placebo. Enter “Drug” in the first 12 rows, “Standard drug” in the rows from 13 to 24, and “placebo” in the rows from 25 to 36. Assign C2 to the rating for Heart rate, and C3 to the rating for mobility. Write the obtained outcomes for the different interventions and effects. For example,

Now go to “Stat” tab above, go to “ANOVA”, and go to “General MANOVA ”

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

301

In the appeared window, move “Heart rate” and “Mobility” to box with “Responses:”, and move “Intervention” to the box under “Model:” Click “OK.” The following results are obtained: General Linear Model: Heart rate, Mobility versus Intervention MANOVA Tests for Intervention DF

Test Criterion Wilks’

Statistic

F

Num Denom

P

0.63686 4.049

4

64

0.005

Lawley-Hotelling 0.56570 4.384

4

62

0.003

4

66

0.009

Pillai’s

0.36601 3.696

Roy’s

0.55762 s=2

m = -0.5

n = 15.0

These results show that one-way MANOVA is statistically significant as the p-values are less than 0.05. Therefore, it can be said that the there is a statistically significant difference between the interventions on the outcomes. In order to look at the mean values of different treatments (interventions) and outcomes, go to “Stat”, go to “Basic Statistics”, and go to “Display Descriptive Statistics

”

In the box under “Variables:” move the columns “Heart rate” and “Mobility”, and in the box under “By variables (optional):” move the column “Intervention”. Click “Statistics” and select only “Mean.” Click “OK.” Click “OK” again. This gives the following results: Descriptive Statistics: Heart rate, Mobility Statistics

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

302

Variable

Intervention

Mean

Heart rate

Drug

2.583

placebo

6.500

Standard Drug 3.500 Mobility

Drug

2.917

placebo

5.417

Standard Drug 2.917 The “Drug” shows better results as compared to “Standard Drug” and “Placebo” in case of Heart rate. However, the “Drug” is almost similar to “Standard Drug” in case of Mobility, but they show useful results as compared to “placebo.”

Factor Analysis Factors analysis is used as Data reduction method OR Structure detection method.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Path Analysis It is an extension of multiple regression.

303

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Structural Equation Modeling It is a confirmatory technique that is an advancement of path analysis.

304

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

305

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

306

Effect size Effect size shows the magnitude of the difference between two different variables or groups indicating the strength of the difference between groups on numeric scale.

Magnitude of the difference between two different variables or groups

Effect size (ES) (Part 1)

Absolute effect size

Standardized means difference

Odds ratio

Showing the strength of the difference between groups on numeric scale Difference between the mean or average outcomes of two groups

Obtained by dividing the difference of means of two groups by their standard deviation



Odds of success in one group relative to the odds of success in the other group

ES 

Cohen s d effect size

1  2 

Obtained by dividing the difference of means of two groups by the standard deviation from the data

d

ad bc

x1  x2 s

(n1  1) s12  (n2  1) s2 2 s n1  n2

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

307

Effect size (ES) (Part 2)

Pearson r correlation

Hedges g method of effect size

Association of two variables (x and y)

Modified method of Cohen s d

g

x1  x2 s*

(n1  1) s12  (n2  1) s2 2 s*  n1  n2  2 Glass s Δ method of effect size

Modified method of Cohen s d



Standard deviation of control group or second group

x1  x2 s2

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

308

Using Minitab

The “Calculator” function could be used to work on effect size.

Odds Ratio and Mantel-Haenszel Odds Ratio Odds Ratio (OR) is used to determine the effect size of the difference in two interventions or treatments. Suppose there is a bacterial disease that could be treated by fish meat, especially rainbow trout. So, we can develop two groups of mice (animal models) to know and compare the treatments’ efficacy. One group receives the standard treatment with the commonly available fishes, and the other group receives rainbow trout. After giving the treatment with fishes, suppose we get the following results. Standard treatment

Treatment with

Odds

with fishes

rainbow trout

Mouse Died

454

19

454/19 = 23.89

Mouse Survived

358

105

358/105=3.41

812

124

Odds Ratio = 23.89/3.41 = 7

This table is showing that the animals, who received standard treatment with the fishes died 7 times more often as compared to the animals, who received the treatment with rainbow trout. Now suppose, we perform experiment on male and female animal models and get the following results: Standard

Treatment

treatment

with

with fishes

rainbow

Totals

OR

60 = tfd

2.67

trout Animals died

53 = a

7=b

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Female

Animals

animal

survived

models

Male animal models

309

102 = c

36 = d

138 = tfs

Totals

155 = nfs

43 = nfr

198 = nf

Animals died

111 = e

13 = f

124 = tmd

Animals

152 = g

71 = h

223 = tms

363 = nms

84 = nmr

447 = nm

3.99

survived Totals

The table is showing that the impact of treatment in males is more as OR in male animal models is higher as compared to OR in female animal models. In order to check the impact of treatment on different sexes, “Weighted” OR, or MantelHaenszel OR (ORMH) has been used. ORMH is as follows:

This is showing that weighted chance of death related to the standard treatment is 3.4 times the chance of death of animal models having treatment with rainbow trout. Using Minitab

In order to work on Odds Ratio, we can use the following data as an example;

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Standard treatment with

310

Treatment with rainbow trout

fishes Mouse Died

454

19

Mouse Survived

358

105

In Minitab, we have to put this table as follows: Standard Treatment

Mouse Died

454.00

Treatment with Rainbow

Mouse Died

19.00

Standard Treatment

Mouse Survived

358.00

Treatment with Rainbow

Mouse Survived

105.00

Trout

Trout Now go to “Stat”, go to “Regression”, go to “Binary Logistic Regression”, and click “Fit Binary Logistic Model ” In “Response:” enter C2; in “Frequency:” enter C3, and in “Categorical predictors:” enter C1. Click “OK.” Results appear in which “Odds Ratio” is 7.0082. Odds Ratios for Categorical Predictors Level A

Level B

Odds Ratio

95% CI

C1 Treatment with Rainbow Trout Standard Treatment

7.0082 (4.2173, 11.6462)

Odds ratio for level A relative to level B In order to work on ORMH, we can use the following data:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Standard

Treatment

treatment

with

with fishes

rainbow

311

Totals

OR

2.67

trout Female animal models

Male animal models

Animals died

53 = a

7=b

60 = tfd

Animals

102 = c

36 = d

138 = tfs

Totals

155 = nfs

43 = nfr

198 = nf

Animals died

111 = e

13 = f

124 = tmd

Animals

152 = g

71 = h

223 = tms

363 = nms

84 = nmr

447 = nm

survived

3.99

survived Totals

However, in Minitab we have to put the data in the following way female

standard treatment

animal died

53.00

male

standard treatment

animal died

111.00

female

treatment with rai

animal died

7.00

male

treatment with rai

animal died

13.00

female

standard treatment

animal surv

102.00

male

standard treatment

animal surv

152.00

female

treatment with rai

animal surv

36.00

male

treatment with rai

animal surv

71.00

Now go to “Stat”, go to “Tables”, go to “Cross Tabulation and Chi-Square ” Select “Raw data (categorical variables)” from the drop down menu.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

312

Move C2 to “Rows:”, C3 to “Columns:”, C1 to “Layers:”, and C4 to “Frequencies:”. Click “Other Stats ” and select “Cochran-Mantel-Haenszel test for multiple tables” and click “OK.” Click “OK.” This gives us the result for Mantel-Haenszel chi-squared test. Cochran-Mantel-Haenszel Test Common Odds Ratio CMH Statistic DF 3.47808

23.3116

P-Value

1 0.0000014

Results for all 2x2 tables

In this result, we have “Common odds ratio” equal to 3.478.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

313

Correlation Coefficient

Suppose research is performed on young animal models of approximately same weight. Initially, animals were injected with the disease causing bacteria. Those animals were initially placed in the lab for five days. After 5 days, their physical condition was checked, and their weight was assessed. After thorough assessment, those animals were provided with a specific amount of rainbow trout, and after 5 days their physical condition was again checked and their weight was assessed. Suppose thorough assessment helps in giving the following results: Animal Model #

1

Gram of rainbow trout per

Weight of animals (gms) after

day (i.e. feed of animal

providing the specific amount

models for 5 days)

of rainbow trout per day

1

7

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

314

2

2

10

3

3

15

4

4

16

5

5

17

6

6

18

This information helps in providing the following graph:

Correlation coefficient, which is also represented by ‘r’, can help in finding that increased grams of rainbow trout help in increased level of improvement in animal models. The sample correlation coefficient varies from -1 to +1. With every increase in value from -1 to +1, the strength and direction of the linear association also increase among the two variables, i.e. grams of rainbow trout given to the animal models and improvement in their condition, and if the correlation is close to zero, it means that there is no relation between the two variables. However, before going further, it is important to know that in this case, grams of rainbow trout is the independent variable and presented on x-axis, and weight of animals is dependent variable and

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

315

presented on y-axis. So, we develop a scatter diagram as shown here. Each point on the diagram is showing an (x,y) pair. This scatter diagram is apparently showing a positive or direct relation between two variables, i.e. increased amount of fish taken per day can increase the level of improvement. We can also develop a table showing the total and mean values of grams of rainbow trout given to the animal models and their condition. Animal Model #

Gram of rainbow trout per

Weight of animals after

day (i.e. feed of animal

providing a specific amount

models for 5 days)

of rainbow trout

1

1

7

2

2

10

3

3

15

4

4

16

5

5

17

6

6

18

Total

21

83

Mean value

3.5 = X-mean

13.8 = Y-mean

In order to work on sample correlation coefficient, variance of the values on Y-axis, variance of the values of X-axis, and the covariance of the values on both X-axis and Y-axis [Cov(x,y)] are required. Variance of the values of X-axis is as follows: Animal Model #

Gram of rainbow trout per day (i.e.

Grams – X-mean

(Grams – X-mean)2

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

316

feed of animal models for 5 days) 1

1

-2.5

6.25

2

2

-1.5

2.25

3

3

-0.5

0.25

4

4

0.5

0.25

5

5

1.5

2.25

6

6

2.5

6.25

Total

21

0

17.5

The variance of the values on X-axis = 17.5/6 = 2.9. Variance of the values of Y-axis is as follows: Animal Model #

Weight of animals

Weight – Y-mean

(Weight – Y-mean)2

after providing a specific amount of rainbow trout 1

7

-6.83

46.69

2

10

-3.83

14.69

3

15

1.17

1.36

4

16

2.17

4.69

5

17

3.17

10.03

6

18

4.17

17.36

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Total

83

≈0

317

94.83

The variance of the values on Y-axis = 94.83/6 = 15.8. The covariance of the values of X-axis and Y-axis is presented as: Animal Model #

Grams – X-mean

Weight – Y-mean

(Grams – X-mean)( Weight – Y-mean)

1

-2.5

-6.83

17.075

2

-1.5

-3.83

5.745

3

-0.5

1.17

-0.585

4

0.5

2.17

1.085

5

1.5

3.17

4.755

6

2.5

4.17

10.425

Total

38.5

The covariance of the values on X-axis and the values on Y-axis = Cov (x,y) = 38.5/6 = 6.4. The formula for calculation of the sample correlation coefficient is as follows:

r = 6.4 / 6.8 r = 0.94 This value of sample correlation coefficient is clearly showing a strong positive correlation.

Using Minitab

In order to work on Correlation Coefficient, we can use the following data:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Animal Model #

318

Gram of rainbow trout per

Weight of animals (gms) after

day (i.e. feed of animal

providing the specific amount

models for 5 days)

of rainbow trout per day

1

1

7

2

2

10

3

3

15

4

4

16

5

5

17

6

6

18

Enter the values in the columns. In this case, the values of “Gram of rainbow trout per day (i.e. feed of animal models for 5 days)” are entered in the first column, i.e. C1, and the values of “Weight of animals (gms) after providing the specific amount of rainbow trout per day” are entered in the second column, i.e. C2. Now go to “Stat”, go to “Basic Statistics”, and click “Correlation ” In the box under “Variables ” move the variables, i.e. C1 and C2. Click the button “Options ” and select “Pearson correlation” from the drop down menu with “Method:” click “OK.” This gives the result of 0.945. This value of sample correlation coefficient is clearly showing a strong positive correlation.

R-squared and Adjusted R-squared R-squared estimates the differences in one variable (dependent variable) in relation to differences in a second variable (independent variable). Adjusted R-squared adjusts the measurements/statistic on the basis of a number of independent variables.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Regression Analysis Simple linear regression on the obtained data can be performed to estimate the relationships between variables.

319

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

320

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

321

For simple linear regression analysis, we can consider the data as mentioned above and develop the following table: Animal

Values on X- (Values on

Model #

axis

X-axis)2

Values on Y- (Values on axis

Y-axis)2

(Values on Xaxis)(Values on Y-axis)

1

1

1

7

49

7

2

2

4

10

100

20

3

3

9

15

225

45

4

4

16

16

256

64

5

5

25

17

289

85

6

6

36

18

324

108

Mean

3.5

Sum Total

21 = Ʃx

1243 = Ʃy2

329 = Ʃxy

Square of

441 = (Ʃx)2

13.83 91 = Ʃx2

83 = Ʃy 6889 = (Ʃy)2

the sum total

Now we need three corrected sums of squares, i.e. SST, SSX, and SSXY. So,

Then,

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

322

Then,

Now the slope of the best fit line, which is represented by b, and the intercept, which is represented by a, can be calculated by the following equations:

And a = Mean of the values on Y-axis – b × Mean of the values on X-axis a = 13.8 – 2.2 × 3.5 = 6.1 Now, we will calculate the regression sum of squares (SSR) and the error sum of squares (SSE) as follows: SSR = b × SSXY SSR = 2.2 × 38.5 = 84.7 And SSE = SST – SSR

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

323

SSE = 94.8 – 84.7 = 10.13 A completed ANOVA table with all these values can be represented as follows: Source

SS

df

MS

F

Regression

84.7

1

84.7

33.48

Error

10.13

4

2.53

Total

94.83

5

The calculated F value of 33.48 is much larger than the critical values of 7.71 at α=0.05 and 1 degree of freedom in the numerator and 4 degrees of freedom in the denominator. This is clearly showing that we have to reject the null hypothesis that the slope of the line is zero, and we can confidently work on the increased amount of rainbow trout for increasing the level of improvement in case of the disease. Standard errors of the slope (SEb) and intercept (SEa) can be calculated as follows:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

324

In order to calculate 95% confidence intervals for the slope and intercept, 2-tailed value of Student’s t with 4 degrees of freedom is required. So, it is 2.776. This value is then multiplied with SEa and SEb to get the 95% confidence interval for intercept and slope respectively. So, a = 6.1 ± (SEa x 2.776) = 6.1 ± (1.48 x 2.776) = 6.1 ± 4.11 b = 2.2 ± (SEb x 2.776) = 2.2 ± (0.37 x 2.776) = 2.2 ± 1.03 = 3.23, 1.17 This is showing that we are 95% confident that the average weight of animal models increases between 3.23 and 1.17 grams per one gram increase of rainbow trout in the diet. Moreover, this range of confidence interval does not contain zero; thereby, showing a significant relationship between the two variables at alpha level of 0.05. Using Minitab

In order to work on Regression Analysis, we can use the following data: Animal Model #

Gram of rainbow trout per

Weight of animals (gms) after

day (i.e. feed of animal

providing the specific amount

models for 5 days)

of rainbow trout per day

1

1

7

2

2

10

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

325

3

3

15

4

4

16

5

5

17

6

6

18

Enter the values in the columns. In this case, the values of “Gram of rainbow trout per day (i.e. feed of animal models for 5 days)” are entered in the first column, i.e. C1, and the values of “Weight of animals (gms) after providing the specific amount of rainbow trout per day” are entered in the second column, i.e. C2. Now go to “Stat”, go to “Regression”, go to “Regression”, and click “Fit Regression Model ” In the box under “Responses:” move C2, and in the box under “Continuous predictors:” move C1. Click “OK.” This gives the results. Regression Analysis: C2 versus C1 Analysis of Variance Source

DF Adj SS Adj MS F-Value P-Value

Regression

1

84.70

84.700

33.43

0.004

C1

1

84.70

84.700

33.43

0.004

Error

4

10.13

2.533

Total

5

94.83 Coefficients

Term

Coef

SE Coef T-Value P-Value VIF

Constant

6.13

1.48

4.14

0.014

C1

2.200

0.380

5.78

0.004

1.00

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

326

In the results, F-Value is 33.43 and p-value is 0.004. Coefficient of Constant is 6.133 and Coefficient of C1, i.e. “Gram of rainbow trout per day (i.e. feed of animal models for 5 days)” is 2.2. With p-value less than 0.05, they are showing statistical significance.

Logistic Regression

Using Minitab

In order to work on logistic regression, we can take the same example as noted above: In the first column C1, enter 0.50 in the first row, 0.75 in the second row, 1.00 in the third row, 1.25 in the fourth row, and so on, until 5.50 in the 20th row. This column is named as “Hours.” In the second column C2, enter 0 in the first row, 0 in the second row, 0 in the third row, and so on, until 1 in the 20th row. This column is named as “Pass.”

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

327

In the third column C3, enter 1 in all the rows up to 20 th row (as there are 20 participants and every participant has been checked once). This column is named as “Number of trials.” Now go to “Stat” tab above, go to “Regression”, go to “Binary Logistic Regression”, and select “Fit Binary Logistic Model ” In the appeared window, select “Response in event/trial format” from the drop down menu. Write “Pass” in the box with “Event name:”, move “Pass” into the box with “Number of events:”, move “Number of trials” into the box with “Number of trials”, and “Hours” into the box under “Continuous predictors:”. You may also work on “Graphs

”, “Options ” and “Results ”

Click “OK.” Binary Logistic Regression: Pass versus Hours Method Link function Logit Rows used

20

Response Information Variable

Value

Pass

Event

10

Non-event

10

Total

20

Number of trials

Count Event Name Pass

Analysis of Variance Wald Test Source

DF

Chi-Square

P-Value

Regression

1

5.73

0.017

Hours

1

5.73

0.017

Model Summary

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

328

Deviance Deviance R-Sq

R-Sq(adj)

AIC

42.08%

38.47%

20.06

Coefficients Term

Coef

Constant -4.08 Hours

1.505

SE Coef VIF 1.76 0.629

1.00

Odds Ratios for Continuous Predictors Odds Ratio

95% CI

4.5026

(1.3131, 15.4392)

Hours

Regression Equation P(Pass) = exp(Y')/(1 + exp(Y')) Y'

=

-4.08 + 1.505 Hours

Goodness-of-Fit Tests Test

DF Chi-Square P-Value

Deviance

18

16.06

0.588

Pearson

18

14.60

0.689

Hosmer-Lemeshow

8

3.07

0.930

Black-Scholes model It was developed by Fisher Black, Robert Merton and Myron Scholes in the year 1973. It is most commonly used in European financial markets.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

329

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Using Minitab

The “Calculator” function could be used to work on the Black-Scholes model.

Combination A selection of items or set of objects without considering the order of selection of items or objects.

330

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

331

Using Minitab

In order to work on Minitab, we can use the same example of combination of 2 days at a time.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Go to “Calc” tab above, and click “Calculator

332

”

In the window “Calculator”, in the “Store result in variable:” in the right box enter a column, such as C3. In the box below “Expression:” enter “COMBINATIONS(7,2)” (where 7 shows the “number of items” and 2 shows the “number to choose”) and click “OK.” The result will appear in the spreadsheet. In this case, the result (combination) is 21.

Permutation It is an ordered arrangement of items or set of objects.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Using Minitab

In order to work on Minitab, we can use the same example of permutation of 2 days at a time.

333

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha Go to “Calc” tab above, and click “Calculator

334

”

In the window “Calculator”, in the “Store result in variable:” in the right box enter a column, such as C3. In the box below “Expression:” enter “PERMUTATIONS(7,2)” (where 7 shows the “number of items” and 2 shows the “number to choose”) and click “OK.” The result will appear in the spreadsheet. In this case, the result (permutation) is 42.

Even and Odd Permutations Even permutation is obtained by composing a zero or even number of inversions or of swaps of two elements. Odd permutation is obtained by composing an odd number of inversions or of swaps of two elements.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

335

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

336

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

337

Circular Permutation It refers to the total number of ways to arrange n distinct objects, such as w, x, y, and z, around a fixed circle.

Survival Analysis A model for time from start of a study (baseline) to the happening of a certain event. For example, the study may start at birth and the event is death, when study ends. Survival analysis gives survival data. This data consists of completely observed data and censored data, which is also known as missing data.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Survival analysis

338

A model for time from start of a study to the happening of a certain event

gives

Also known as

Baseline examples

examples

Death

Birth

Component failure

Manufacture of component

Survival data Consists of 2 types of data

1. Completely observed data

Second heart attack

2. Censored data

Second event

First heart attack First event

Probability of surviving

Probability of survival

Missing data that may occur during the study, when (1) the subject or (also known as) sample Right doesn t have censoring the event of interest (i.e. second event), and/or (2) the data is lost.

to

In the form of graph

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

20

40

60

80

100

120

Time (years)

This graph shows that:  At 30 years, the survival probability is about 0.95, i.e. 95%,  At 60 years, the survival probability is about 0.4, i.e. 40%,  At 100 years, the survival probability is about 0.05, i.e. 5%.

Kaplan-Meier method One of the most commonly used nonparametric methods for survival analysis is Kaplan-Meier method, which is also known as Product Limit method.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

339

Suppose we have the following data in which 12 participants (over the age of 50 years) were studied for 25 years until they die, they lost to follow-up, or the study ends (and remember this data is just for understanding and can’t take the place of original research). Participant number

Year of death

Year of last contact

1

25

2

16

3

4

4

8

5

17

6

25

7

10

8

10

9

14

10

2

11

14

12 18 This table shows that 3 participants died before the completion of the study, and 2 participants completed the study. The Kaplan-Meier method utilizes this formula, St(i) = St(i)-1*((Ni-Di)/Ni) to compute survival probability. In this formula, Ni shows number at risk, Di shows number of deaths, St(i) shows survival probability, and St(i)-1 shows survival probability just previous to the present one. We can use life table approach for Kaplan-Meier method. Time (years)

Number at risk (Ni)

Number of deaths (Di)

Number censored (C)

Survival probability (St(i) = St(i)-1*((NiDi)/Ni))

2

12

1

1*((12-1)/12)=0.917

4

11

1

0.917*((11-1)/11)=0.834

8

10

1

0.834*((10-0)/10)=0.834

10

9

2

0.834*((9-0)/9)=0.834

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

14

7

16

1

340

1

0.834*((7-1)/7)=0.715

5

1

0.715*((5-0)/5)=0.715

17

4

1

0.715*((4-0)/4)=0.715

18

3

1

0.715*((3-0)/3)=0.715

25 2 2 0.715*((2-0)/2)=0.715 Roughly, it can also be represented by the following graph:

Survival probability

Survival probability 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10

15

20

25

30

Time (years)

This shows that at 4 years after the start of study, the survival probability is about 0.834, i.e. 83.4%, and the survival probability at 17 years and 18 years is almost same, i.e. 0.715 or 71.5%. Using Minitab

In order to work on Minitab, we can use the same example as noted above. However, the above table is written as: Observation

Status

25

1

16

0

4

1

8

0

17

0

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

25

1

10

0

10

0

14

1

2

1

14

0

341

18 0 In this table, “0” shows no event, and “1” shows the occurrence of event, i.e. the time related to the participants who either died or completed the study. Enter these values in the first two columns, i.e. C1 and C2. Now go to “Stat”, go to “Reliability/Survival”, go to “Distribution Analysis (Right Censoring)”, and select “Nonparametric Distribution Analysis”. In the window, “Nonparametric Distribution Analysis-Right Censoring”, move C1 to the box below “Variables:”. Press the button “Censor

”, and move C2 to the box below “Use censoring

columns:” and enter “0” in the box with “Censoring value:”, and click “OK”. Click “OK” again. This gives us the results and graph as follows: Kaplan-Meier Estimates 95.0% Normal CI

Number

Number

Survival

Standard

Time

at Risk

Failed

Probability

Error

Lower

Upper

2

12

1

0.916667

0.079786

0.760290

1.00000

4

11

1

0.833333

0.107583

0.622475

1.00000

14

7

1

0.714286

0.143705

0.432629

0.99594

25

2

2

0.000000

0.000000

0.000000

0.00000

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

342

This shows that at 4 years after the start of study, the survival probability is about 0.834, i.e. 83.4%, and the survival probability at 17 years and 18 years is almost same, i.e. 0.715 or 71.5%. This graph assumes that after 25 years, there is no further survival probability.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

343

Bonus Topics Most commonly used non-normal distributions in health, education, and social sciences Most commonly used non-normal distributions in health sciences, education, and social sciences have been obtained from the paper “Non-normal Distributions Commonly Used in Health, Education, and Social Sciences: A Systematic Review” by Roser Bono, María J. Blanca, Jaume Arnau, and Juana Gómez-Benito, published in the journal Frontiers in Psychology.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Circular permutation in Nature For example, proteins contain changed order of amino acids in their peptide sequence.

344

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

345

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

346

Time Series A time series is represented by a series of data points that are recorded at specific times or time intervals. It is represented by a timeplot in which values (variables) are represented on the y-axis and time is displayed on the x-axis. Time series help in showing variations in data with the passage of time. This is useful in finding the patterns related to the data and in making predictions regarding the values (variables) in relation to the time. The patterns of time series are often difficult to analyze due to the reasons, such as the increased number of data points with equal intervals. In these situations, the technique of “smoothing” is used in which a line graph is utilized without the series of dots. This technique is also considered important in the prediction of future events as, for example, predicting whether the stock market would go up or down. It is also helpful in spotting the outliers. One of the simple ways of smoothing the timeplots is by utilizing the moving average. In the time series data, periodic fluctuations often appear and these fluctuations are referred to as “seasonality”. It is important to note that seasonality in time series may occur at any time period in the data as, for example, the daily air temperature may increase in the mid of the day and decrease during the night time.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

347

Using Minitab

Suppose we have the following data: Days

Values

1

3

2

2

3

7

4

5

5

10

6

9

7

8

8

5

Put these values in the columns C1 and C2. Go to “Stat” tab above, go to “Time Series”, and select “Time Series Plot

”

Select “Simple” and click “OK.” Move the columns C1 and C2 to the box under “Series:” and click “Time/Scale ” Select “Calendar” under “Time Scale”, and click “OK.” Again click “OK.” The following graph is obtained:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

348

Time Series Plot of C2 10 9 8

C2

7 6 5 4 3 2 1 26

27

28

29

30

31

1

2

Day

In order to get smoothing (line), go to “Stat” tab above, go to “Time Series”, and select “Time Series Plot

”

Select “Simple” and click “OK.” Move the columns C1 and C2 to the box under “Series:” and click “Time/Scale ” Select “Calendar” under “Time Scale”, and click “OK.” Now click “Data View

”, select “Smoother”, select “Lowess”, and click “OK.”

Again click “OK.” The following graph is obtained:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

349

Time Series Plot of C2 10 9 8

C2

7 6 5 4 3 2 1 26

27

28

29

30

31

1

2

Day

In order to get information about the trend, go to “Stat” tab above, go to “Time Series”, and select “Trend Analysis

”

Move C2 to the box with “Variable:” Click “Time

”, select “Calendar:” (Day) under “Time Scale”, and click OK.”

Again click “OK.” The following graph is obtained:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

350

This graph is showing upward trend. Monte Carlo Simulation The Monte Carlo method is a numerical method of solving mathematical problems by random sampling of a large number of variables. It is used for obtaining numerical solutions to problems that are too complicated to solve analytically.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Density Estimation Density estimation is the process of estimation of a continuous density field from a discretely sampled set of data-points obtained from that density field.

351

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Decision Tree It is used for classification, prediction, estimation, clustering and visualization.

352

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Decision tree for test statistics is as follows:

353

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

354

Minitab 18 is also providing assistance in statistical procedures with the help of decision trees:

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

355

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

356

Meta-analysis The following information has been obtained from “Basics of Meta-analysis with Basic Steps in R” written by Usman Zafar Paracha. It is available here: https://amzn.to/31ff3Kj Meta-analysis is a quantitative and systematic method to combine the results of previous studies to get conclusions.

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Meta-analysis

A quantitative and systematic method to combine the results of previous studies to get conclusions

1

Formulation of research quation

2

Searching the required articles and information

3 4

Statistical analysis

Q, Tau-squared, I-squared Plots, such as Funnel plot, Baujat plot, L Abbe plot

Combined results – Overall Effect size / Pooled Statistical Results

Working on Utilizing Preferred Reporting Items inclusion and for Systematic Reviews and Metaexclusion Analyses (PRISMA) to note the flow criteria of information

Collection of abstract data

Heterogeneity statistics

357

Such as age of participants, sample size, outcomes, etc.

Calculation of effect size

Ratio of observed variation to within-study variance

The size of the difference of effect between two groups

Development of Forest Plot

Proportion of observed variation that can be caused by actual difference between studies

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

358

Important Statistical Techniques/Procedures used in Medical Research Following table shows some important statistical techniques or procedures commonly used in medical research. This information has been taken from the paper titled, “Statistical trends in the Journal of the American Medical Association and implications for training across the continuum of medical education” by Arnold, Braganza, Salih, & Colditz published in PLoS ONE in 2013. Table 5: Important Statistical techniques/procedures used in Medical Research Statistical Techniques/Procedures

Comments (about the Increased, decreased or static use in the past 3 decades)

Pearson correlation coefficient

Its use is declining at a faster rate

Mantel-Haenszel

Its use is declining

ANOVA

Its use is declining

Simple Linear regression

Its use is almost static

Fisher Exact

Its use is almost static

t-test

Its use is almost static

Logistic regression

Its use is almost static

Chi-square

Its use is almost static

Descriptive Statistics

Used most commonly but its use is declining

Morbidity and Mortality

Its use is almost static

Low-level Statistical measures

Used most commonly but its use is almost static

Transformation

Its use is almost static

Multiple comparison

Its use is increasing slowly

Colorful Statistics and Minitab® 19 – Usman Zafar Paracha

Epidemiologic statistics

Its use is increasing slowly

Poisson Regression

Its use is increasing

p-trend

Its use is increasing

Log-rank test

Its use is increasing at a faster rate

Wilcoxon Rank

Its use is increasing

Non-parametric test

Its use is increasing

Intention to treat

Its use is increasing

Kaplan Meier

Its use is increasing

Power

Its use is increasing at a faster rate

Cox models

Its use is increasing

Multi-level modeling

It use is increasing at a faster rate

Survival analysis

Its use is increasing at a faster rate

Multiple regression

Its use is increasing

Sensitivity analysis

Its use is increasing

359