257 25 7MB
English Pages 500 [501] Year 2021
Applied Engineering Statistics
Applied Engineering Statistics Second Edition
R. Russell Rhinehart Robert M. Bethea
Second edition published 2022 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN © 2022 CRC Press First edition published by CRC Press 1991 CRC Press is an imprint of Taylor & Francis Group, LLC Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermissions@tandf.co.uk Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe.
Library of Congress Cataloging‑in‑Publication Data Names: Rhinehart, R. Russell, 1946- author. | Bethea, Robert M., author. Title: Applied engineering statistics / R. Russell Rhinhart, Robert M. Bethea. Description: Second edition. | Boca Raton : CRC Press, 2022. | Revision of: Applied engineering statistics / Robert M. Bethea, R. Russell Rhinehart. 1991. | Includes bibliographical references and index. Identifiers: LCCN 2021023519 (print) | LCCN 2021023520 (ebook) | ISBN 9781032119489 (hardback) | ISBN 9781032119496 (paperback) | ISBN 9781003222330 (ebook) Subjects: LCSH: Engineering--Statistical methods. Classification: LCC TA340 .B48 2022 (print) | LCC TA340 (ebook) | DDC 519.502/462--dc23 LC record available at https://lccn.loc.gov/2021023519LC ebook record available at https://lccn.loc.gov/2021023520 ISBN: 9781032119489 (hbk) ISBN: 9781032119496 (pbk) ISBN: 9781003222330 (ebk) DOI: 10.1201/9781003222330 Typeset in Palatino by Deanta Global Publishing Services, Chennai, India
Disclaimer The reader assumes all responsibility for use of the material in this book. Applications are context-dependent, and yours may have aspects not addressed here. The user must choose appropriate tests, methods, etc. Especially, if critical, be sure that your choices are correct.
v
Contents Disclaimer.........................................................................................................................................v Nomenclature.................................................................................................................................xv Preface to Second Edition.......................................................................................................... xvii
Section 1 Fundamentals of Probability and Statistics 1. Introduction..............................................................................................................................3 1.1 Introduction....................................................................................................................3 1.2 Deterministic and Stochastic.......................................................................................3 1.3 Treatments, Process, and Outcomes............................................................................4 1.4 Uses of Statistics.............................................................................................................4 1.5 Stationarity......................................................................................................................5 1.6 There are No Absolutes.................................................................................................6 1.7 A Caution on Statements of Confidence.....................................................................6 1.8 Correlation is Not Causation........................................................................................6 1.9 Uncertainty and Disparate Metrics.............................................................................7 1.10 Takeaway......................................................................................................................... 7 1.11 Exercises..........................................................................................................................7 2. Probability................................................................................................................................9 2.1 Probability.......................................................................................................................9 2.2 Probability Calculations.............................................................................................. 10 2.2.1 A Priori Probability Calculations.................................................................. 10 2.2.1.1 Rule 1: Multiplication...................................................................... 11 2.2.1.2 Rule 2: Addition............................................................................... 11 2.2.2 Conditional Probability Calculations.......................................................... 15 2.2.3 Bayes’ Belief Calculations.............................................................................. 17 2.3 Takeaway....................................................................................................................... 23 2.4 Exercises........................................................................................................................ 23 3. Distributions.......................................................................................................................... 25 3.1 Introduction.................................................................................................................. 25 3.2 Definitions..................................................................................................................... 25 3.3 Discrete Distributions................................................................................................. 26 3.3.1 Discrete Uniform Distribution...................................................................... 28 3.3.2 Binomial Distribution.................................................................................... 29 3.3.3 Poisson Distribution....................................................................................... 32 3.3.4 Negative Binomial Distribution.................................................................... 35 3.3.5 Hypergeometric Distribution....................................................................... 36 3.3.6 Geometric Distribution.................................................................................. 38 3.4 Continuous Distributions........................................................................................... 39 3.4.1 Continuous Uniform Distribution............................................................... 40 3.4.2 Proportion........................................................................................................42 vii
viii
Contents
3.4.3 Exponential Distribution...............................................................................43 3.4.4 Gamma Distribution...................................................................................... 45 3.4.5 Normal Distribution....................................................................................... 47 3.4.6 “Student’s” t-Distribution.............................................................................. 52 3.4.7 Chi-Squared Distribution..............................................................................54 3.4.8 F-Distribution.................................................................................................. 55 3.4.9 Log-Normal Distribution............................................................................... 56 3.4.10 Weibull Distribution....................................................................................... 59 3.5 Experimental Distributions for Continuum-Valued Data...................................... 60 3.6 Values of Distributions and Inverses........................................................................63 3.6.1 For Continuum-Valued Variables.................................................................63 3.6.2 For Discrete-Valued Variables.......................................................................64 3.7 Distribution Properties, Identities, and Excel Cell Functions...............................65 3.7.1 Continuum-Valued Variables........................................................................65 3.7.1.1 Standard Normal Distribution......................................................65 3.7.1.2 t-Distribution................................................................................... 66 3.7.1.3 Chi-Squared Distribution.............................................................. 67 3.7.1.4 F-Distribution.................................................................................. 67 3.7.2 Discrete-Valued Variables.............................................................................. 68 3.7.2.1 Binomial Distribution..................................................................... 68 3.7.2.2 Poisson Distribution....................................................................... 68 3.8 Propagating Distributions with Variable Transformations................................... 68 3.9 Takeaway....................................................................................................................... 71 3.10 Exercises........................................................................................................................ 72 4. Descriptive Statistics............................................................................................................ 75 4.1 Measures of Location (Centrality)............................................................................. 75 4.2 Measures of Variability...............................................................................................77 4.3 Measures of Patterns in the Data............................................................................... 78 4.4 Scaled Measures of Deviations.................................................................................. 82 4.5 Degrees of Freedom.....................................................................................................83 4.6 Expectation................................................................................................................... 86 4.7 A Note about Dimensional Consistency.................................................................. 87 4.7.1 Average and Central Limit Representations............................................... 87 4.7.2 Dimensional Consistency in Other Equations (An Aside)....................... 87 4.8 Takeaway....................................................................................................................... 88 4.9 Exercises........................................................................................................................ 88 5. Data and Parameter Interval Estimation.......................................................................... 91 5.1 Interval Estimation...................................................................................................... 91 5.1.1 Continuous Distributions.............................................................................. 91 5.1.2 Discrete Distributions.................................................................................... 94 5.2 Distribution Parameter Estimation........................................................................... 96 5.2.1 Continuous Distributions.............................................................................. 98 5.2.2 Discrete Distributions.................................................................................. 100 5.3 Approximation with the Normal Distribution...................................................... 102 5.4 Empirical Data............................................................................................................ 103 5.4.1 Data Range..................................................................................................... 103 5.4.2 Empirical Distribution Parameter Range.................................................. 105
Contents
ix
5.5 Takeaway..................................................................................................................... 106 5.6 Exercises...................................................................................................................... 106 6. Hypothesis Formulation and Testing – Parametric Tests............................................ 107 6.1 Introduction................................................................................................................ 107 6.1.1 Critical Value Method.................................................................................. 110 6.1.2 p-Value Assessment...................................................................................... 111 6.1.3 Probability Ratio Method............................................................................ 112 6.1.4 What Distribution to Use?........................................................................... 114 6.2 Types of Hypothesis Testing Errors........................................................................ 114 6.3 Two-Sided and One-Sided Tests.............................................................................. 117 6.4 Tests about the Mean................................................................................................. 120 6.5 Tests on the Difference of Two Means.................................................................... 125 6.5.1 Case 1 (s 12 and s 22 Known).......................................................................... 125 6.5.2 Case 2 (s 12 and s 22 Both Unknown but Presumed Equal)....................... 127 6.5.3 Case 3 (s 12 and s 22 Both Unknown and Presumed Unequal)................. 130 6.5.4 An Interpretation of the Comparison of Means – A One-Sided Test.... 132 6.6 Paired t-Test................................................................................................................. 134 6.7 Tests on a Single Variance......................................................................................... 137 6.8 Tests Concerning Two Variances............................................................................. 139 6.9 Characterizing Experimental Distributions.......................................................... 141 6.10 Contingency Tests...................................................................................................... 142 6.11 Testing Proportions................................................................................................... 144 6.12 Testing Probable Outcomes...................................................................................... 147 6.13 Takeaway..................................................................................................................... 148 6.14 Exercises...................................................................................................................... 149 7. Nonparametric Hypothesis Tests..................................................................................... 151 7.1 Introduction................................................................................................................ 151 7.2 The Sign Test............................................................................................................... 152 7.3 Wilcoxon Signed-Rank Test...................................................................................... 155 7.4 Modification to the Sign and Signed-Rank Tests.................................................. 156 7.5 Runs Test..................................................................................................................... 157 7.6 Chi-Squared Goodness-of-Fit Test........................................................................... 159 7.7 Kolmogorov-Smirnov Goodness-of-Fit Test........................................................... 161 7.8 Takeaway..................................................................................................................... 163 7.9 Exercises...................................................................................................................... 164 8. Reporting and Propagating Uncertainty in Calculations........................................... 165 8.1 Introduction................................................................................................................ 165 8.1.1 Applications................................................................................................... 166 8.1.2 Objectives/Rationale.................................................................................... 166 8.1.3 Propagation of Uncertainty......................................................................... 167 8.1.4 Nomenclature................................................................................................ 168 8.2 Fundamentals............................................................................................................. 168 8.2.1 What Experimental Variation is and is Not.............................................. 168 8.2.2 Measures of Random Variation.................................................................. 169 8.2.3 Sources of Variation...................................................................................... 170 8.2.4 Data and Process Models............................................................................. 172
x
Contents
8.2.5 Explicit and Implicit Models....................................................................... 173 8.2.6 Significant Digits........................................................................................... 174 8.2.7 Estimating Uncertainty on Input Values................................................... 175 8.2.8 Random and Systematic Error.................................................................... 176 8.2.9 Coefficient Error Types................................................................................ 177 8.3 Propagation of Uncertainty in Models................................................................... 178 8.3.1 Analytical Method for Maximum Uncertainty........................................ 179 8.3.2 Analytical Method for Probable Uncertainty........................................... 182 8.3.3 Numerical Method for Maximum Uncertainty....................................... 186 8.3.4 Numerical Method for Probable Uncertainty........................................... 186 8.4 Identifying Key Sources of Uncertainty................................................................. 188 8.5 Bias and Precision...................................................................................................... 188 8.6 Takeaway..................................................................................................................... 189 8.7 Exercises...................................................................................................................... 189 9. Stochastic Simulation......................................................................................................... 191 9.1 Introduction................................................................................................................ 191 9.2 Generating Data That Represents a Distribution.................................................. 191 9.3 Generating Data That Represents Natural Variation over Time......................... 196 9.4 Generating Stochastic Models.................................................................................. 199 9.5 Number of Realizations Needed for Various Statistics........................................ 203 9.6 Correlated and Conditional Perturbations............................................................. 204 9.7 Takeaway..................................................................................................................... 204 9.8 Exercises...................................................................................................................... 204
Section 2 Choices 10. Choices................................................................................................................................... 209 10.1 Introduction................................................................................................................ 209 10.2 Cases............................................................................................................................ 209 10.2.1 The Hypothesis............................................................................................. 209 10.2.2 Truncating or Rounding.............................................................................. 211 10.2.3 The Conclusion.............................................................................................. 211 10.2.4 Data Collection.............................................................................................. 211 10.2.5 Data Preprocessing....................................................................................... 211 10.2.6 Data Post Processing.................................................................................... 212 10.2.7 Choice of Distribution and Test.................................................................. 212 10.2.8 Choice of N..................................................................................................... 212 10.2.9 Parametric or Nonparametric Test............................................................. 212 10.2.10 Level of Confidence, Level of Significance, T-I Error, Alpha.................. 213 10.2.11 One-Sided or Two-Sided Test...................................................................... 214 10.2.12 Choosing the Quantity for Comparison.................................................... 214 10.2.13 Use the Mean or the Probability of an Extreme Value?........................... 217 10.2.14 Correlation vs Causation............................................................................. 219 10.2.15 Intuition.......................................................................................................... 220 10.2.16 A Possible Method to Determine Values of α, β, and N.......................... 220 10.2.17 The Hypothesis is not the Supposition......................................................222 10.2.18 Seek to Reject, Not to Support.....................................................................222
Contents
xi
10.3 Takeaway..................................................................................................................... 223 10.4 Exercises......................................................................................................................223
Section 3 Applications of Probability and Statistical Fundamentals 11. Risk......................................................................................................................................... 227 11.1 Introduction................................................................................................................ 227 11.2 Estimating the Financial Penalty............................................................................. 229 11.3 Frequency or Probability?......................................................................................... 230 11.3.1 Estimating Event Probability – Independent Events............................... 230 11.3.2 Estimating Event Probability – Common Cause Events......................... 233 11.3.3 Intermittent or Continuous Use.................................................................. 233 11.3.4 Catastrophic Events......................................................................................234 11.4 Estimating the Penalty from Multiple Possible Events........................................234 11.5 Using Risk in Comparing Treatments.................................................................... 235 11.6 Uncertainty................................................................................................................. 235 11.7 Detectability................................................................................................................ 236 11.8 Achieving Zero Risk.................................................................................................. 237 11.9 Takeaway..................................................................................................................... 237 11.10 Exercises...................................................................................................................... 237 12. Analysis of Variance........................................................................................................... 239 12.1 Introduction................................................................................................................ 239 12.2 One-Way ANOVA...................................................................................................... 239 12.2.1 One-Way ANOVA Method.......................................................................... 240 12.2.2 Alternate Analysis Approaches.................................................................. 245 12.2.3 Model for One-Way Analysis of Variance................................................. 246 12.2.4 Subsampling in One-Way Analysis of Variance...................................... 247 12.3 Two-Way Analysis of Variance................................................................................ 252 12.3.1 Model for Two-Way Analysis of Variance................................................. 252 12.3.2 Two-Way Analysis of Variance Without Replicates................................. 253 12.3.3 Interaction in Two-Way ANOVA................................................................ 255 12.3.4 Two-Way Analysis of Variance with Replicates....................................... 262 12.4 Takeaway..................................................................................................................... 262 12.5 Exercises...................................................................................................................... 262 13. Correlation............................................................................................................................ 265 13.1 Introduction................................................................................................................ 265 13.2 Correlation Between Variables................................................................................. 266 13.2.1 Method........................................................................................................... 266 13.2.2 An Illustration............................................................................................... 267 13.2.3 Determining Confidence in a Correlation................................................ 271 13.3 Autocorrelation........................................................................................................... 272 13.3.1 Method........................................................................................................... 272 13.3.2 An Autocorrelation Illustration.................................................................. 272 13.3.3 Determining Confidence in Autocorrelation............................................ 274 13.4 Takeaway..................................................................................................................... 275 13.5 Exercises...................................................................................................................... 275
xii
Contents
14. Steady State and Transient State Identification in Noisy Processes......................... 277 14.1 Introduction................................................................................................................ 277 14.1.1 Approaches and Issues to SSID and TSID................................................. 278 14.2 A Ratio of Variances Methods.................................................................................. 279 14.2.1 Filter Method................................................................................................. 280 14.2.2 Choice of Filter Factor Values...................................................................... 283 14.2.3 Critical Values................................................................................................284 14.2.4 Illustration...................................................................................................... 285 14.2.5 Discussion of Important Attributes........................................................... 285 14.2.5.1 Distribution Separation................................................................ 285 14.2.5.2 Average Run Length..................................................................... 286 14.2.5.3 Balancing ARL, T-I and T-II Errors............................................. 287 14.2.5.4 Distribution Robustness............................................................... 288 14.2.5.5 Autocorrelation.............................................................................. 288 14.2.5.6 Signal Discretization..................................................................... 288 14.2.5.7 Aberrational Autocorrelation...................................................... 290 14.2.5.8 Multivariable Extension............................................................... 291 14.2.5.9 Cross Correlation.......................................................................... 292 14.2.5.10 Selection of Variables.................................................................... 292 14.2.5.11 Noiseless Data................................................................................ 293 14.2.6 Alternate R-Statistic Structure-Array......................................................... 294 14.3 4Point Method............................................................................................................ 296 14.4 Using SSID as Regression Convergence Criterion................................................ 297 14.5 Using SSID as Stochastic Optimization Convergence Criterion......................... 298 14.6 Takeaway.....................................................................................................................300 14.7 Exercises......................................................................................................................300 15. Linear Regression – Steady-State Models...................................................................... 301 15.1 Introduction................................................................................................................ 301 15.2 Simple Linear Regression......................................................................................... 303 15.2.1 Hypotheses in Simple Linear Regression.................................................304 15.2.2 Interval Estimation in Simple Linear Regression....................................306 15.2.3 Inverse Prediction in Simple Linear Regression......................................308 15.2.4 Evaluation of Outliers.................................................................................. 311 15.2.5 Testing Equality of Slopes........................................................................... 312 15.2.6 Regression Through a Point........................................................................ 314 15.2.7 Measures of Goodness-of-Fit...................................................................... 315 15.3 Multiple Linear Regression...................................................................................... 316 15.4 Polynomial Regression.............................................................................................. 321 15.4.1 Determining Model Complexity................................................................ 322 15.4.2 Culling Irrelevant Model Functionalities.................................................. 324 15.4.3 Extrapolation of Polynomial Models......................................................... 324 15.5 Functional Linearization of Models with Nonlinear Coefficients...................... 324 15.6 Takeaway..................................................................................................................... 327 15.7 Exercises...................................................................................................................... 327 16. Nonlinear Regression – An Introduction....................................................................... 331 16.1 Introduction................................................................................................................ 331 16.2 Takeaway..................................................................................................................... 333 16.3 Exercises...................................................................................................................... 333
Contents
xiii
17. Experimental Replicate Planning and Testing.............................................................. 335 17.1 Introduction................................................................................................................ 335 17.2 A Priori Estimation of N............................................................................................ 338 17.2.1 Classic Estimation of n................................................................................. 338 17.2.2 Economic Estimation of n – Method 1.......................................................342 17.2.3 Economic Estimation of n – Method 2.......................................................342 17.2.4 Economic Estimation of n – Method 3.......................................................344 17.3 A Posteriori Estimation of N...................................................................................... 351 17.4 Takeaway..................................................................................................................... 354 17.5 Exercises......................................................................................................................354 18. Experimental Design for Linear Steady-State Models – Screening Designs.......... 357 18.1 Introduction................................................................................................................ 357 18.2 Random Ordering of the Experimental Sequence................................................ 358 18.3 Factorial Experiments................................................................................................ 359 18.3.1 Constraints..................................................................................................... 362 18.3.2 Missing Data.................................................................................................. 362 18.3.3 Confounding................................................................................................. 362 18.3.4 Alternate Screening Trial Designs............................................................. 363 18.4 Takeaway..................................................................................................................... 363 18.5 Exercises...................................................................................................................... 363 19. Data-Based Model Validation........................................................................................... 365 19.1 Introduction................................................................................................................ 365 19.2 Data-Based Evaluation Criteria and Tests.............................................................. 366 19.3 Bootstrapping to Estimate Model Uncertainty...................................................... 372 19.4 Test for Variance Expectations................................................................................. 378 19.4.1 Trouble Shooting Variance Indications..................................................... 379 19.5 Closing Remarks........................................................................................................ 380 19.6 Takeaway..................................................................................................................... 382 19.7 Exercises...................................................................................................................... 383 20. Experimental Design for Data-Based Model Validation............................................ 385 20.1 Introduction................................................................................................................ 385 20.2 Patterns Desired and Undesired.............................................................................. 385 20.3 An experimental Plan............................................................................................... 390 20.4 Data Sources and Other Modeling Objectives....................................................... 394 20.5 Takeaway..................................................................................................................... 395 20.6 Exercises...................................................................................................................... 395 21. Statistical Process Control................................................................................................. 397 21.1 SPC Concepts.............................................................................................................. 397 21.2 Process Capability......................................................................................................400 21.3 Mean and Range Charts........................................................................................... 401 21.4 Modifications to the X and R Charts...................................................................... 406 21.5 CUSUM and RUNSUM Charts................................................................................ 409 21.6 Attribute Charts: Nonconforming........................................................................... 409 21.7 Attribute Charts: Defects.......................................................................................... 411 21.8 Takeaway..................................................................................................................... 411 21.9 Exercises...................................................................................................................... 411
xiv
Contents
22. Reliability.............................................................................................................................. 413 22.1 Introduction................................................................................................................ 413 22.2 Probability Distributions.......................................................................................... 414 22.3 Calculation of Composite Probabilities.................................................................. 417 22.3.1 “And” Events................................................................................................. 419 22.2.2 “Or” Events.................................................................................................... 420 22.3.3 Combinations of Events...............................................................................423 22.3.4 Conditional Events....................................................................................... 424 22.3.5 Weakest Link.................................................................................................425 22.4 Measures of Reliability.............................................................................................. 427 22.4.1 Average Life or Mean Time to Failure....................................................... 427 22.4.2 On-Stream Time............................................................................................430 22.4.3 Monte Carlo Techniques.............................................................................. 431 22.5 Reliability in Process Design Choices..................................................................... 431 22.5.1 Sizing Equipment in Series......................................................................... 432 22.5.2 Selecting Redundancy..................................................................................433 22.5.3 Selecting Reliability of Component Parts.................................................434 22.6 Takeaway..................................................................................................................... 436 22.7 Exercises...................................................................................................................... 436
Section 4 Case Studies Case Studies................................................................................................................................. 441 Case Study 1 – DJIA and Political Party............................................................................ 441 Exercises......................................................................................................................442 Case Study 2 – PBT Justification for a Change.................................................................443 Exercises......................................................................................................................444 Case Study 3 – Central Limit Phenomena and μ and σ...................................................445 Exercises......................................................................................................................446 Case Study 4 – A Corkboard...............................................................................................446 Exercises...................................................................................................................... 447
Appendix Critical Value Tables Appendix: Tables of Critical Values....................................................................................... 451 Table A.1 Critical Values of r in the Sign Test................................................................... 451 Table A.2 Critical Values of s in the Wilcoxon Matched-Pairs Signed-Rank Test........ 452 Table A.3a Critical Values of u in the Runs Test for small N = n + m............................ 453 Table A.3b Critical Values of the u in the Runs Test for Large N................................... 458 Table A.4 Critical Values of d in the Kolmogorov–Smirnov Goodness-of-Fit Test...... 470 Index.............................................................................................................................................. 473
Nomenclature
Latin Letters Capital Latin letters represent populations. Lowercase Latin letters represent numerical values such as those calculated from a sample taken from the populations. Lowercase Latin letters i, j, k, l, m, and n represent counting integers. Lowercase Latin letters x, y, and z typically represent influences to a process or the independent variables in a model. y also is typical of the dependent variable in a model. Lowercase Latin letters a, b, c, d, … represent model coefficients. B represents a Belief, or a Condition. s represents a slope or a sample standard deviation.
Greek Letters Lowercase Greek letters are used for population parameters (coefficients). μ represents the population mean. σ2 represents the population variance. σ represents the population standard deviation. υ represents the degrees of freedom in an analysis. α represents the probability of a Type-I error, or rejecting a hypothesis that is true, and is termed the level of significance. β represents the probability of a Type-II error, or accepting a hypothesis that is false, and is termed the power of a test. f(xi) = pdf(xi) is the point probability distribution function of a discretized variable x. It is the expected frequency that a particular value, xi will occur. It is dimensionless. Note the subscript on xi, indicating a discrete number, integer, or category. F(xi) = CDF(xi) is the cumulative distribution function of a discretized variable x. It is the probability that a particular value or lower value of xi will happen. It is dimensionless and bounded between 0 and 1. Note the subscript on xi, indicating a discrete number, integer, or category. xv
xvi
Nomenclature
f(x) = pdf(x) is the probability distribution function of a continuum-valued variable x. It is the expected rate that the probability is increasing a particular value, x. It has dimensions of the reciprocal of the variable x. F(x) = CDF(x) is the cumulative distribution function of a continuum-valued variable x. It is the probability that a particular value or lower value of x will happen. It is dimensionless and bounded between 0 and 1. P(Event) or P(E) represents the probability of a particular event. p is a particular value of P(Event). N, n, x, or xi represent the number of items, a count, in a class (group, collection, etc.).
Accents , a horizontal overbar, represents the average if on a letter, or “not” if over an event. , an underscore, represents a vector. , the squiggle over accent, represents a modeled value. , the carat over accent, represents an estimated value.
Subscript , the subscript of an integer, representing the ith category or value.
i
xi, usually refers to the count in the ith category, but often it also refers to the data value when sorted w.r.t. some other variable or its own value.
Acronyms w.r.t. means “with respect to”. SS = steady-state. TS = transient state. DoE = design of experiments. DoF = degrees of freedom.
Preface to Second Edition Although measured data and related calculations possess some level of uncertainty, they are the basis of engineering/business decisions. Engineers must account for that uncertainty for their designs to be both safe and effective. Applied scientists must include uncertainty in experimental design and data analysis for valid processing, and in reporting for communicating uncertainty. Decision-makers must include uncertainty in taking appropriate action. Statistics offers us the tools to do so. The book is about the practicable methods of statistical applications for engineers, scientists, and business folks. Contrasting a mathematical and abstract orientation of many statistics texts, which express the science/math values of researchers, this book will focus on the application and the interpretation of outcomes (as described with concrete examples). However, the book also presents the fundamental mathematical concepts, and provides some supporting derivations, to show the grounding of the methods. The science underlying statistics is important, and this book seeks to reveal that. But of greater importance to applications are the choices an investigator makes in defining a hypothesis relevant to a supposition, selecting appropriate confidence levels, selecting a test procedure, and in deciding action from the analysis that is appropriate to the context. These aspects, usually missing in classroom texts, are incorporated here. The authors each came out of significant engineering practice experience, where we used statistical methods to support legitimacy in decision making. After changing to academic careers, we found ourselves co-teaching the unit operations laboratory course, which required students to use similar statistical techniques as part of their career preparation. We co-authored the first edition of Applied Engineering Statistics, to develop a reference text that would be of utility for the students – in both the course and their engineering careers. We continue to sense the need for the explanation of statistical concepts and application methods for the practitioner, and like the original organization of the book – explain basic concepts, explain commonly applied statistical methods, use examples from industrial practice, and provide case study chapters on advanced applications. A few things have changed since the original publication: 1) The custom has shifted from critical values and the accept/reject dichotomy to the use of p-values to indicate a degree of confidence. 2) The Big Data and Machine Learning era has introduced a few new techniques on finding associations within data. 3) Computational accessibility has improved the utility of non-linear regression and stochastic approaches for propagating uncertainty. 4) New examples and exercises will take the learner out of the “You are finished after doing a simple word problem” mindset to a more complete analysis of issues and auxiliary aspects of a problem. 5) Statistical tools are widely available and convenient, and this book will reveal those in Excel. 6) Rhinehart has a website that offers Excel/VBA programs to apply some of the procedures – www.r3eda.com. The book is written for two audiences. One is students in the upper level of an undergraduate program or graduate program in either engineering, science, or business. The other is those in professional life. Our intent is that the text would continue to be useful in professional life after graduation and be appropriate as a self-learning tool for those in their professional practice. The book is presented in four sections. The first provides the fundamental math/science concepts related to probability, distributions, their characteristics, and hypothesis testing. xvii
xviii
Preface to Second Edition
The second section provides guidance related to choices that a person makes in statistical analysis. The third section describes many classical statistical applications (such as analysis of variance, regression, design of experiments, model validation, statistical process control, and reliability). Many worked examples, throughout, reveal how to perform the procedures and interpret the outcomes. The fourth section contains case studies which allow the reader to explore a variety of techniques on examples, contrasting the one-aspect application of examples and typical end-of-chapter exercises. An instructor (college or continuing education) will find that this book serves as the basis for both elementary and intermediate courses. Although statistical computations can be performed by hand, they can also be programmed on a computer. With the availability of computers and statistical packages at every level (handheld calculator, PC, mini, mainframe), there is something convenient for your specific needs. We detail the computations in most of the examples so that you can easily follow the procedure by hand, but we also demonstrate the use of Excel Spreadsheet and Add-In functions. Many other statistical libraries are available. Some of the applications are supported with open-code software on Rhinehart’s website, www.r3eda.com. Many end-of-chapter exercises are open-ended, asking the reader to make appropriate decisions, not just to apply a recipe to get a numerical value. We are grateful to Marcel Dekker, Inc. for allowing us to use one example problem and many homework exercises from those included in Statistical Methods for Engineers and Scientists, Second Edition, Revised and Expanded. We are also grateful to the American Cyanamid Company, the Institute of Mathematical Statistics, and the American Statistical Association for allowing us to reprint several of the statistical tables as acknowledged in the Appendix. We are also grateful to the American Society for Testing and Materials for data in Tables 21.3 and 21.4. As well, we are grateful for all those who have provided essential experiences in fundamentals and practice needed for us to be able to compose this book. Most importantly, we are most appreciative of the support of our families who enabled us to get to this place. This edition preserves the contributions of both authors to the first edition. However, changes to create this second edition are substantially the work of Rhinehart, who remains very appreciative of the collaboration with Bethea. R. Russell Rhinehart Robert M. Bethea
Section 1
Fundamentals of Probability and Statistics
1 Introduction
1.1 Introduction In this book, the word statistics is used in two ways. First, it refers to the techniques involved in collecting, analyzing, and drawing conclusions from data – a procedure, or a recipe. The second, more frequently inferred meaning, is that of an estimated value, a number, calculated from either the data or a proposed theory, that is used for comparative purposes in testing a hypothesis (guess, supposition, etc.) about a parameter of a population – a numerical value. The topics presented in this book have been selected from our experience (and others’) to provide you with a set of procedures which are relevant to application work (such as data analysis in engineering, science, and business). Although fundamental concepts are explained and some equations are derived, the focus of this book is on the how-to of statistical applications. There is a tension between perfection and sufficiency. Perfection seeks the truth, which follows the mathematical science viewpoint. Although perfection provides grounding in statistical analysis methods, it is usually a mathematical analysis that is predicated on many idealizations, making it imperfect. By contrast, sufficiency seeks utility and functional adequacy, a balance of expediency which is also grounded in mathematical fundamentals. Sufficiency is not sloppiness or inaccuracy. It is appropriate liberty with the idealization, grounded in an understanding of the limitations of the idealization and uncertainty in the “givens”. Both perfection and sufficiency are important, and perspectives of both are presented in this text. In this “Applied Engineering” text the balance tends toward sufficiency, rather than unrealistic perfection.
1.2 Deterministic and Stochastic The term deterministic means that there is no uncertainty, there is perfect certainty about a value. Here are some simple examples: What is 3 times 4? Given that the side of a cube is 2.1 cm, what is the surface area? What angle (rounded to three digits) has a tangent value of 0.75? These were very simple calculations, but it is the same with something more complicated, such as: Given a particular heat exchanger and fluid flow rates and associated properties, use the equations in your heat transfer book to calculate the exit temperatures of the fluids. Regardless of the time of day, or location, or the computer type being used, every time the calculation is performed, we get exactly the same answer.
DOI: 10.1201/9781003222330-1
3
4
Applied Engineering Statistics
The term stochastic means that we get a different answer each time the calculation is performed, or each time the measurement is obtained. Here are some examples: What is the height of the next person you pass on the street? How many grains of sand are in a handful? If the product label indicates that the package contains 40 lbs, what might be the actual weight? If there are the same number of red and green marbles in a box and you draw three, blindfolded, how many green marbles will you have? If you want to compare fertilizer treatments, you will find that the year-to-year variation in weather and insect population, and the location-to-location variation in properties of the earth will cause significant variation in results. Despite the use of deterministic calculations in teaching concepts and in estimating values, the reality about measurements and samples and predictions is that they have variation. Statistics provides techniques for analyzing and making decisions within the uncertainty. Sources of variation include the vagaries of weather, the probability of selecting a particular sample, variation in raw material, mechanical vibration, incomplete fluid mixing, prior stress on a device, new laws and regulations, future prices, and many other aspects.
1.3 Treatments, Process, and Outcomes The term treatment refers to the influence on a process. The influence might be how safety training is delivered (video, reading materials, in-person, comically, or seriously). A treatment could be a recipe or procedure to be followed. A treatment could be the type of equipment used (batch or continuous, toaster or microwave). A treatment could be the raw material supplier or the service provider. The treatment might be the operating conditions in manufacturing (flow rates, temperatures, mixing time, etc.). The process is whatever responds to the treatment. It may be a human response to an office lighting treatment. It may be a mechanical spring-and-weight response to treatment by the ambient temperature. It may be a biological process response to a pH (acidity) treatment. Outcome refers to the response of the process. It may be the time to recover physical health after an infection in response to the medicine dose. It may be the economic response of the nation due to changes in the prime lending rate. It may be the variation in a quality metric due to a particular treatment. It may be the probability of automobile accidents if the speed limit is changed. Treatments and outcomes are variously termed influences and responses, causes and effects, inputs and outputs, independent and dependent variables, etc.
1.4 Uses of Statistics You will use statistics in five ways. One is in the design of experiments or surveys. In this instance, you need the answers to some questions about an event or a process. An effective experiment is one that has been designed so that the answers to your questions will be obtained more often than not. An efficient experiment is one that is unbiased (predicts
Introduction
5
the correct value of the parameter) and that also has the smallest variance (scatter about the true value of the population parameter in question). Efficiency also means that the answers will have been obtained with the minimum expenditure of time (yours, an operator’s, a technician’s, etc.) and other resources. The second way you will use statistical techniques is with descriptive statistics. This method involves using sample data to make an inference about the population. The population is the entire or complete set of possible values, attributes, etc. that are common to, describe, or are characteristic of a given experiment or event. A sample is a subset of that data. Descriptive statistics are used for describing and summarizing experimental, production, reliability, and other types of data. The description can take many forms. The average, median, and mode are all measures of centrality. Variance, standard deviation, and probable range are all measures variation. The descriptor may be a probability, which refers to the chance an event might happen (such as getting three or more successes in five-coin flips) or the chance that a value might exceed some threshold (the probability of seeing someone taller than 6ft 8in on your next shopping trip). It is essential that your samples are random samples if you are to have any reasonable expectation of obtaining reliable answers to your questions. To obtain a random sample, you must first define, not just describe, the population under consideration. Then you can use the principles of random selection of population values or experimental conditions to obtain the random sample that is essential to statistical inference. A third statistical use is estimating the uncertainty of a value, estimating the possible range of values it might have. The value might be an average from a sample and the question is what range of population means could have generated that sample average. The value might be a predicted outcome from a model when all model coefficient values and influences are not known with certainty. A fourth use of statistics is in the testing of hypotheses. A hypothesis about any event, process, or variable relationship is a statement of anticipated behavior under specified conditions. Hypotheses are tested by determining whether the hypothesized results reasonably agree with the observed data. If they do, the hypothesis is likely to be valid. Otherwise, the hypothesis is likely to be false. Hypotheses could be relatively complex, such as the model matching the data, the design being reliable, or the process being at steady-state. The fifth use of methods in this book is to obtain quantitative relationships between variables by use of sample data. This aspect of statistics is loosely called “curve fitting” but is more properly termed regression analysis. We will use the method of least squares for regression because that technique provides a conventional way to estimate the “best fit” of the data to the hypothetical relationship.
1.5 Stationarity In statistics, a stationary process does not change in mean (average) or variance (variability). It is steady, but any measurement is subject to random variation. The value of the data perturbation changes from sample to sample, but the distribution of the perturbations does not change. This is in contrast to classic deterministic analysis of transient and steady-state processes. A steady process flatlines in time. The measurement achieves a particular value
6
Applied Engineering Statistics
and remains at that value. When the process is in a transient state the average or mean changes in time. In statistics the term stationary means that the steady-state process will not deterministically flatline. Instead, the data will be continually fluctuating about a fixed value (mean) with the same variance. In statistics, a stationary process is not in a transient state.
1.6 There are No Absolutes You should have noticed by now that we have repeatedly used the words “probably” and “likely” in the first part of this chapter. Our use was intentional, as there are NO absolutes in statistics except the following: There are NO absolutes. Every conclusion you reach and everything you say as a result of statistical examination of data is subject to error. Instead of saying, “The relationship between production rate and employee training is …,” you must say, “The relationship between production rate and employee training probably is ….” The reason for our statement concerning absolutes is simple. The data from which the estimators were obtained were subject to error. The exact value of any estimator is uncertain, and these uncertainties carry forward every time the estimators are used. This book is devoted to helping you learn how to make qualitative and quantitative statements in the face of the uncertainties.
1.7 A Caution on Statements of Confidence Level of confidence is a measure of how probable your statistical conclusion is. As an example, after testing raw materials A and B for their influence on product purity, you might be 95% confident that A leads to higher purity. But you cannot extend this result to report that you are 95% sure that using raw material A is the better business decision. You have only tested product purity. You have not evaluated product variability, other product characteristics, manufacturing costs, process safety implications, etc. You can only be 95% confident in your evaluation of purity. Be careful that you do not project statistical confidence about one aspect onto your interpretation of the appropriate business action.
1.8 Correlation is Not Causation Statistics does not prove that some event or value caused some other response. Causation refers to a cause-and-effect mechanism. Correlation means that there is a strong relationship between two variables, or observations. As an example, there is a strong correlation to people awakening and the sun rising, but one cannot claim that people awakening causes the sun to rise. The cause-and-effect mechanism for this observed correlation is more akin to the opposite. As another example,
Introduction
7
there is a strong correlation between gray hair and wrinkles, but that does not mean that gray hair causes wrinkles. The mechanism is that another variable, age, causes both observations. So, more so than just tempering claims about confidence in taking action from testing a single aspect, be careful not to let indications of correlation dupe you into claiming causation. If you have an opinion as to the cause-and-effect mechanism, and you have correlation that supports it, before you claim it is the truth, perform experiments and seek data that could reject your hypothesized mechanism. State exactly, mechanistically how the treatment leads to the outcome expectations. State what else you expect should be observed, and what should not be observed. State when and where these should be observed. Do the experiments to see if your hypothesized theory is true.
1.9 Uncertainty and Disparate Metrics Traditionally, statistics deals with the probable outcomes from a distribution. This book is grounded in that mathematical science, and many examples reveal how to describe the likelihood of some extreme value. But more than this, the basis (the “givens”) in any particular application have uncertainty, which is unlike the basis of givens in a schoolbook example. In the real world, to make decisions based on the statistical analysis, the impact of uncertainty needs to be considered. Further, concerns over possible negative choices might not just be about monetary shortfalls. They may be related to disparate issues such as reputation. This book includes a chapter on propagation of uncertainty, another on stochastic simulation, and frequent discussions on Equal–Concern approaches for combining disparate metrics.
1.10 Takeaway Statistical statements should be tempered with a declaration related to probability (likelihood, or confidence). Statements about something should only relate to the test data, not an interpretation or extrapolated action. Do not confuse correlation with causation.
1.11 Exercises
1. List several statements that you have recently heard or read and estimate the certainty or qualifications that should be associated with it. Here are example statements: “Men are taller than Women.” “I floss each day.” “Net wt. 18 oz.” 2. List several deterministic and several stochastic processes.
8
Applied Engineering Statistics
3. List several examples of processes – meaning some influence leads to an outcome. These could be computer procedures, human recipes, physical devices, or human social responses. 4. Sketch data that would come over time from a stationary stochastic process. Sketch another with a change in mean, and another with a change in variance.
2 Probability
2.1 Probability An event is a particular outcome of a trial, test, experiment, or process. It is a particular category for the outcome. You define that category. The outcome category could be dichotomous, meaning either one thing or another. In flipping a coin, the outcome is either a Head (H) or a Tail (T). In flipping an electric light switch to the “on” position, the result is either the light lights or it does not. In passing people on a walk, they either return the smile or do not. These events are mutually exclusive, meaning if one happens the other cannot. You could define the event as a H, or as a T; as the light working, or the light not working. Alternately, there could be any number of mutually exclusive events. If the outcome is one event, one possible outcome from all possible discrete outcomes, then it cannot be any other. The event of randomly sampling the alphabet could result in 26 possible outcomes. But if the event is defined as finding the letter “T”, this success excludes finding any of the other 25 letters. By contrast, the outcome may be a continuum-valued variable, such as temperature, and the event might be defined as sampling a temperature with a value above 85°F. A temperature of 84.9°F would not count as the event. A temperature of 85.1°F would count as the event. For continuum-valued variables, do not define an event as a particular value. If the event is defined as sampling a temperature value of 85°F, then 84.9999999°F would not count as the event. Nor would 85.00000001°F count as the event. Mathematically, since a point has no width, the likelihood of getting an exact numerical value, is impossible. So, for continuum-valued outcomes, define an event as being greater than, or less than a particular value, or as being between two values. One definition of probability is the ratio of the number of particular occurrences of an event to the number of all possible occurrences of mutually exclusive events. This classical definition of probability requires that the total number of independent trials of the experiment be infinite. This definition is often not as useful as the relative-frequency definition. That interpretation of probability requires only that the experiment be repeated a finite number of times, n. Then, if an event E occurs nE times out of n trials and if the ratio nE/n tends to stabilize at some constant as n becomes larger, the probability of E is denoted as:
P ( E ) = lim nE / n (2.1) n ®¥
The probability is a number between 0 and 1 and inclusive of the extremes 0 and 1, 0 £ P ( E) £ 1. DOI: 10.1201/9781003222330-2
9
10
Applied Engineering Statistics
Ideally, Equation (2.1) specifies the infinity of data, but certainly reasonable values for an estimate of Pˆ ( E ) = nE / ntotal can be determined from limited data. The complement of an event is not-the-event, indicated by E. Here the overbar does not mean average, it means “not”.
( )
P E =
nE n - nE n = = 1 - E = 1 - P ( E ) (2.2) n n n
The value of the event probability can be either derived from fundamental principles, representing an infinite number of experiments, or calculated from a finite number of experiments. These would be called theoretical or experimental values, respectively. The trials could be of an end-of-process, or batch, nature. For example, during the process of flipping a coin, the coin alternates heads-up and tails-up while in the arc. It may hit the ground and bounce up still spinning. Then it finally settles, and only after it stops do we count H or T. Alternately, even if the process is in a transient state, one can calculate the probability of an event after a certain point in time. For example, a company might have two versions of a new product and want to know which is more robust. So, they engage 10,000 people to be consumer testers. Half of the testers have one version, and half have the other. They are divided into groups representing equivalent demographics. After each month into a twoyear trial period, testers return their product for evaluation, then get it back and continue their use of it. The company may be tracking the point in time that some feature of the product fails. Version A might have 6% failures (p = 0.06) after 12 months, and version B might have 11% (p = 0.11). In such in-process testing, the probability is not end-of-process, but at a specified time duration. Once event probabilities are known, we often want to know the probability of composite or compound events. Two types of probabilities are of primary interest to engineers. The first is a priori probability, in which we assume that the composite/compound event will happen and that its probability can be predicted or calculated prior to the experiment. The second is a posteriori, or conditional, probability, which is calculated after some evidence is obtained.
2.2 Probability Calculations 2.2.1 A Priori Probability Calculations Let us consider that E1 and E2 are two user-specified events (results) of outcomes of an experiment. Here are some definitions: If E1 and E2 are the only possible outcomes of the experiment, then the collection of events E1 and E2 is said to be exhaustive. For instance, if E1 is that the product meets specifications and E2 is that the product does not meet specifications, then the collection E1 and E2 represents all possible outcomes and is exhaustive. The events E1 and E2 are mutually exclusive if the occurrence of one event precludes the occurrence of the other event. For example, again, if E1 is that the product meets specifications and E2 is that the product does not meet specifications, then E1 precludes E2, they are mutually exclusive, if the outcome is one, then it cannot be the other.
11
Probability
Event E1 is independent of event E2 if the probability of occurrence of E1 is not affected by E2 and vice versa. For example, flip a coin and roll a die. The coin flip event of being a Head is independent of the number that the die roll reveals. As another example, E1 might be that the product meets specifications, and E2 might be that fewer than two employees called in sick. These are independent. The composite event “E1 and E2” means that both events occur. For example, you flipped a H and rolled a 3. If the events are mutually exclusive, then the probability that both can occur is zero. The composite event “E1 or E2” means that at least one of events E1 and E2 occurs. When you flipped and rolled, a H and/or a 3 were the outcomes. This situation allows both E1 and E2 to occur but does not require that result, as does the “E1 and E2” case. There could be any number of user-specified events, E1, E2, E3, …, En. Two rules govern the calculation of a priori probabilities. 2.2.1.1 Rule 1: Multiplication If E1, E2, E3, …, En are independent and not mutually exclusive events having probabilities P(E1), P(E2), P(E3), …, P(En), respectively, then the probability of occurrence of E1 and E2 and E3 and … En is:
P ( E1 & E2 & E3 & & En ) = éë P ( E1 ) ùû éë P ( E2 ) ùû éë P ( E3 ) ùû éë P ( En ) ùû =
n
ÕP ( E ) (2.3) i
i =1
The probability of occurrence of all n independent events is the product of all their individual probabilities. 2.2.1.2 Rule 2: Addition If E1, E2, and E3 are independent and not mutually exclusive events with individual probabilities P(E1), P(E2), and P(E3), then the probability of at least one (possibly two or more) of these events is:
P ( E1 or E2 or E3 ) = P ( E1 ) + P ( E2 ) + P ( E3 ) - P ( E1E2 ) - P ( E1E3 ) - P ( E2E3 ) + P ( E1E2E3 )
(2.4)
This can be more conveniently calculated by considering the not case:
(
)
P ( E1 or E2 or E3 ) = 1 - P E1 & E2 & E3 (2.5)
Here, the overbar on E means “not the event”. Assume that E1, E2 , etc. are independent and not mutually exclusive, using the multiplication rule for the “&” conjunction:
(
)
( )
( )
( )
P E1 & E2 & E3 = é P E1 ù é P E1 ù é P E1 ù (2.6) ë ûë ûë û
Then, converting the not-event back to the event:
P ( E1 or E2 or E3 ) = 1 - éë1 - P ( E1 ) ùû éë1 - P ( E2 ) ùû éë1 - P ( E3 ) ùû (2.7)
12
Applied Engineering Statistics
When Equation (2.7) is algebraically expanded it is the same as Equation (2.4). In general, for n independent and not mutually exclusive events: P ( E1 or E2 or ¼ or En ) = 1 -
Õ
n i =1
éë1 - P ( Ei ) ùû (2.8)
If the events are mutually exclusive, then the addition rule becomes: P ( E1 or E2 or ¼ or En ) =
n
åP ( E ) (2.9) i
i =1
Also, if the events are not mutually exclusive, but the probabilities are small enough so that the product terms in Equation (2.4) are negligible, then Equation (2.9) is a reasonable approximation. However, if there are many terms in the “OR” conjunction, even if each product is small, the sum of probabilities might add to >1. Equation (2.8) is not prone to such a possibility. Example 2.1: A person is flipping a fair coin and rolling a fair six-sided die. The desired outcome is a T for the coin and a 4 on the die. a) What is the probability of E1 and E2? b) What is the probability of E1 or E2? The two outcomes are independent, and not mutually exclusive. Conceptually, the probability of either a H or T is 0.5, and the probability of any particular value on the die is 1/6. For Part (a) use Equation (2.3):
P ( E1 & E2 ) =
n=2
ÕP (E ) = éëP (E )ùû éëP ( E )ùû = 0.5 6 = 0.08333 i
1
2
1
i =1
For Part (b) use Equation (2.8):
P ( E1 or E2 ) = 1 -
Õ
n=2 i =1
1ö æ éë1 - P ( Ei ) ùû = 1 - (1 - 0.5 ) ç 1 - ÷ = 0.58333 6ø è
There is about an 8% chance that both trials will be successes, and about a 58% chance that at least one of the two will be a success. Example 2.2: A person is flipping a fair coin and rolling a fair six-sided die. The desired outcome is a H for the coin and a 4 or lower number on the die. a) What is the probability of E1 and E2? b) What is the probability of E1 or E2? The two outcomes are independent, and not mutually exclusive. Conceptually, the probability of either a H or T is 0.5, and the probability of any particular value on the die is 1/6. The die roll event is a composite event. Either a 1, or a 2, or a 3, or a 4 will represent
13
Probability
a success. These outcomes are mutually exclusive. If, for example, the roll shows a 2, it cannot be a 1, 3, or 4. Use Equation (2.9) to calculate the probability of Event 2.
P ( E2 ) = P ( 1 or 2 or 3 or 4 ) =
n
åP (E ) = 6 + 6 + 6 + 6 = 0.6666 1
i
1
1
1
i =1
For Part (a) use Equation (2.3).
P ( E1 & E2 ) =
n=2
ÕP (E ) = éëP (E )ùû éëP (E )ùû = 0.5 (0.6666 ) = 0.3333 i
1
2
i =1
For Part (b) use Equation (2.8).
P ( E1 or E2 ) = 1 -
Õ
n=2 i =1
(
)
éë1 - P ( Ei ) ùû = 1 - (1 - 0.5 ) 1 - 0.6666 = 0.8333
There is about a 33% chance that both trials will be successes, and about an 83% chance that at least one of the two will be a success. Example 2.3: The results of the rigorous examination of a dozen automobile tires showed the following: One tire was perfect; three had only slight flaws in appearance; two had incompletely formed treads; one had a serious structural defect; and the rest had at least two of these defects. What is the probability that the next set of four tires you buy of this particular brand will be perfect? That they will have at most only undesirable appearances? That they will have less than two defects? The solution begins with the assumption that the population is adequately represented by the sample of 12 tires.
The sample results are displayed fanwise as shown above to help you visualize the probabilities of the events Ei. The probability of each event, P(Ei) = ni/n where n = 12. The multiplication rule is appropriate for the solution to Part (a) because we want the probability that all tires in the next set will be perfect, i.e., the first one and the second one and the third one and fourth one. 4
P ( 4 perfect tires in next set ) = éë P ( perfect ) ùû
æ 1 ö P ( 4 perfect tires in next set ) = ç ÷ è 12 ø
4
= 4.82253 ´ 10 -5 4.82 ´ 10 -5 For Part (b) the probability of four tires having, at most, undesirable appearance can be obtained from any combination of perfect tires and those with only appearance flaws. Since the sum of the probabilities in any of those cases is 4/12 and the answer to this question again requires the multiplication rule:
14
Applied Engineering Statistics
æ 4 ö P ( 4 tires in next set have, at most , undesirable appearance ) = ç ÷ è 12 ø
4
= 0.01234568 0.0123 For Part (c), the probability of any tire having less than two defects is [1 − P(any tire having at least two defects)] or P( $1, 030
Test claims B £ $1, 030
0.976
0.024
0.198
0.802
( 0.887 )( 0.976 ) = 0.975 ( 0.887 ) ( 0.976 ) + (1 - 0.813 )( 0.198 )
Maybe 97.5% is sure enough to switch to B. But perhaps you want to be 99% sure. So, do another trial.
Bafter third sequential accept B result =
( 0.975 )( 0.992 ) = 0.996 0 . 975 0 .9 ( ) ( 992 ) + (1 - 0.975 )( 0.149 )
If, however, the third trial indicated that the average B is worse than the threshold, then:
Bif third result is B < threshold =
( 0.975 )( 0.00766 ) = 0.258 ( 0.975 ) ( 0.00766 ) + (1 - 0.975 )( 0.851)
which would undermine the growing belief. Note: After each trial, you have better information about the variability of B and what its average might be. Additionally, if the average of the trial values of the benefits of B are being used, then the standard deviation of the average would be $50/ n which would also change the probabilities in the table. So, after each trial the probabilities in the table should be updated.
22
Applied Engineering Statistics
TABLE 2.3 Generic Probabilities of Correct Diagnosis and False Positive and Negative Test result Truth Something really is true It really is false (not true)
Test claims it is true p(true outcome, if true) p(true outcome, if false)
Test claims it is false p(false outcome, if true) p(false outcome, if false)
In general, the table of probabilities is as shown in Table 2.3. Often the probabilities in the four categories can be determined from extensive and controlled testing on known situations. But often, they can be reasonably estimated from experience. In either case do not think that the probabilities are perfectly known. They have errors. But in our experience the uncertainty on the reasonable values does not undermine the propagation of the belief. Alternate values might lead to needing one more or one less test to provide adequate confidence to make a decision. What is adequate confidence to take action? If B = 0.99 then you are very certain that the supposition is true. If B = 0.01 then you are equally confident that it was untrue. But to take action be sure that the consequences of a wrong decision, tempered by the probability of a wrong decision, are acceptable. Risk is the probability of an event times the penalty for undesirable consequences. Benefit is the probability of an event times the value of the desirable outcome. Set the threshold of the belief to take action by the consequences of taking a right or wrong action (accepting A when it should be B, taking B when it should be A, etc. Reasonable threshold values are in the 0.05 (and 0.95) and 0.001 (0.999) range. If the impact of the consequences of a decision is much greater than the cost of the trials, then use the more extreme thresholds. If the cost of trials is relatively high compared to the impact of a decision, then use the less extreme thresholds. The 0.05 decision is the commonly accepted threshold for statistical testing. Many accept this as a very good guide to updating belief with sequential results, and then using the belief to make decisions. Alternately, although purists accept the mathematical model, many object to the method because of the uncertainty on the probability values and the initial belief. Of course, humans might not want to follow this kind of logical rule. Often, when they know something is true, they consider themselves to be absolutely sure, and rather than admit they were wrong, they reject any data that would counter their personal belief. Or when they want something to be true, they reject any opposing data. Here are some examples: “Win or lose, our sports fans are kind and gracious, but our opponent’s fans are disrespectful poor sports.” “My kid is the best looking, smartest, and most athletic in the entire class.” “Reel mowers are better than rotary mowers.” If your boss, or significant other knows the truth, or wants a particular outcome, and the resulting action from an erroneous belief has fewer adverse consequences than the personal cost and effort of proving that person wrong, it might be best to let it go their way. Only martyrs let logic lead to a confrontation against authority. On the other hand, we hope that each of us protects ourselves, our organizations, and society from action based on erroneous beliefs.
Probability
23
2.3 Takeaway Composite event probability must be within 0 and 1, inclusive. If you are creating your own application, extrapolate to very many or very few events to check that the way you are calculating composite probability does not exceed rational limits.
2.4 Exercises 1. In flipping a fair coin twice, what is the probability of a) getting two Heads, b) getting two Tails, c) getting a Head on the first flip and a Tail on the second, d) not getting any Heads? 2. At a particular summer camp, the probability of getting a case of poison ivy is 0.15 and the probability of getting sunburn is 0.45. What is the probability of a) neither, b) both, c) only sunburn, d) only poison ivy? 3. After rolling three fair six-sided dice, what is the probability of a) getting three ones showing, b) having only one four showing, c) getting a one and a two and a three? 4. If the probability of rain tomorrow is 70% and rain the next day is 50%, then 0% for the next five days, what is the probability of rain a) on both of the next two days, b) on all of the next seven days, c) at least once this week? 5. There are two safety systems on a process. If an over-pressure event happens in the process, the first safety override should quench the source, and if that is not adequate the back-up system should release excess gas to a vent system. Normal control of the process is generally adequate, only permitting an average of about ten over-pressure events per year. The quench system, we are told, has a 95% probability of working adequately when needed, and the back-up vent has a 98% probability of working as needed. What is the probability of an undesired event (the over-pressure happens, and it is not contained by either safety system) in a) the next one-year period, and b) the next ten-year period? 6. There is a belief that Treatment B is better than the current Treatment A in use. The belief is a modest 75%, B = 0.75. If B is equivalent to A, not better, then there is a 50/50 chance that the trial outcome will indicate either B is better or worse. However, if B is better, then the chance that it will appear better in the trial is 80%. What is the new belief after the trial if a) the trial indicates B is better, b) if the trial indicates B is not better, and c) how many trials of sequential successes are needed to make the belief that B is the right choice raise to 99%? 7. A restaurant buys thousands of jalapeno peppers per day, of which 5% are not spicy-hot. They use five peppers in each small batch of salsa. If two (or more) of the five are not hot, customers are likely to complain that the salsa is not adequate. What is the probability of making an inadequate batch of salsa? Quantify how larger batch sizes will change the probability.
3 Distributions
3.1 Introduction Most statistical methods are based on theoretical distributions, described by parameters (such as mean and variance), which can be good approximations to the distribution of experimental data. The mathematical models of the distribution define its shape, but the parameter values define location and variation. Parameters, then, are descriptors of a theoretical distribution. Statistical procedures that state conclusions about such parameters using sample data are parametric statistics. Statistical procedures that state conclusions about populations for which the theoretical distribution has not been assumed are termed nonparametric statistics. Before you can effectively utilize such methods, you must be able to choose the distribution that best matches the data. This chapter presents key theoretical distributions and the characteristics of the populations involved. It also shows how to describe the distribution of experimental data.
3.2 Definitions Measurement: A numerical value indicating the extent, intensity, or measure of a characteristic of an object. Data: Either singular as a single measurement (such as a y-value) or plural as a set of measurements (such as all the y-values). Data could refer to an input-output pair (x, y) or the set x , y .
( )
Observation: A recording of information on some characteristic of an object. Usually a paired set of measurements. Sample: 1) A subset of possible results of a process that generates data. 2) A single observation. Sample size: The number of observations, datasets, in the sample. Population: All of the possible data from an event or process – usually n = ∞. Random disturbance: Small influences on a process that are neither correlated to other variables nor correlated to their own prior values.
DOI: 10.1201/9781003222330-3
25
26
Applied Engineering Statistics
Random variable: A variable or function with values that are affected by many independent and random disturbances despite efforts to prevent such occurrences. Discrete variable: A variable that can assume only isolated values, that is, values in a finite or countably infinite set. It may be the counting numbers, or it may be the digital display values of truncated data. Continuum variable: A variable that can assume any value between two distinct numbers. Frequency: The fraction of the number of observations within a specified range of numerical values relative to the total number of observations. Cumulative frequency: The sum of the frequencies of all values less than or equal to a particular value. Mean: A measure of location that provides information regarding the central value or point about which all members of the random variable X are distributed. The mean of any distribution is a parameter denoted by the Greek letter µ. Variance: A parameter that measures the variability of individual population values xi about the population mean µ. The population variance is indicated by σ2. Standard deviation: σ is the positive square root of the variance. Empirical Distributions: These are obtained from a sampling of the population data. As a result, the models or the parameter values that best fit a model to the data (such as μ and σ) may not exactly match those of the population. Theoretical Distributions: These are obtained by derivation from concepts about the population. If the concepts are true, then the models and corresponding parameter values represent the population. But nature is not required to comply with human mental constructs. Category (classification): The name of a grouping of like data, influences, events such as heads, defectives, zero-crossings, integers, negative numbers, green, etc.
3.3 Discrete Distributions There are two classes of distributions: Discrete and continuous. Discrete distributions are used to describe data that can have only discrete values. Such data have a specific probability associated with each value of the random variable. There are distinct and measurable step changes associated with each value of the variable. Some examples of discrete variables are the size of the last raise you received (it was not in fractions of a cent), the score of the last sporting event you watched, the number of personal protective equipment items available to you on your job, the number of first-quality computer chips on a silicon wafer, the number of defects in a skein of yarn, the energy of electrons in a particular quantum state, the number of raindrops that fall onto a square inch of land, etc. The variable xi represents the count of events in the ith category. The categories are mutually exclusive, such as alphabet letters, or pass/fail. The value of xi is an integer number. Looking at this paragraph, if I = 1 represents the occurrence of the letter “a” and I = 2 that of the letter “b”, then the value of x1 = 18 and x2 = 4.
27
Distributions
Probability density functions, pdf(xi) or simply f(xi), are associated with distributions of discrete variables, xi represent the probability of possible values of the ith data category. For example, if you flip a coin you expect k = 2, two outcomes, Head and Tail, or 0 and 1. If the first classification of x1 = Head, then f(x1) = 0.5. All such probability functions have the following properties: 1. xi are the discrete possible values of a variable X, and xi is the ith of the k finite values of the outcome. Usually, the index i places the xi values in ascending order. 2. The probability functions are mathematical models of the population, of the infinity of possible samples, not of a finite sample of k number of values. 3. f(xi) is the frequency, the probability of occurrence that a value xi will occur. It is positive and real for each xi. f(xi) = limn→∞ {ni/n}.
4. å ik=1 f ( xi ) = 1 where k is the number of categories. 5. P(E) = ∑ f(xi) where the sum includes all xi in the event E.
These definitions illustrate the notation we use throughout this book. We use capital Latin letters for populations and lowercase Latin letters for particular numerical observation values from the populations. Lowercase Greek letters are used for population parameters. Point and cumulative distributions are identified by f and F (or alternately CDF) respectively. P stands for “probability of …”. We are using the conventional notation for discrete distributions: x in the summations of the cumulative distribution functions of discrete distributions sometimes represents the number of items in a class (group, collection, etc.) or at other times, x represents the numerical value that quantifies the class. By using this notation, the formulas in this book are consistent with those you may find in other statistics books. We state this as a warning, because in conventional notation for variables, x means the value of the variable as opposed to the number of occurrences in a category. The cumulative distribution function (CDF) is a function F(xr) obtained from the probability function and is defined for the values of xi of the random variable X by
ì0 ï r ï CDF ( xr ) = F ( xr ) = P ( X £ xr ) = í f ( xi ) ï i =1 ï1 î
å
for X < x1 where x1 £ X £ x (3.1) for X ³ xn
where xi is an ordered set x1 < x2 < … < xn−1 < xn. The cumulative distribution function is a nondecreasing function with the following properties: 1. 0 ≤ F (x) ≤ 1. 2. limx→ − ∞ F(x) = 0. 3. limx → ∞ F(x) = 1. As a result, the probability that X will lie between two values xi and xj can be found from the difference of the cumulative probabilities at xi and xj or
P ( xi < X £ x j ) = P ( X £ x j ) - P ( X £ xi ) = F ( x j ) - F ( xi ) (3.2)
28
Applied Engineering Statistics
The mean and variance of a discrete distribution are calculated from ¥
m=
åx f ( x ) (3.3) i
i
i =1
¥
s = 2
å(x - m )
2
i
f ( xi ) (3.4)
i =1
Where i is the index representing each of the possible xi values, and f(xi) is the weighting factor for the ith xi value, for all possible i-values. Note: The terms CDF and F are used interchangeably. And often, so are pdf(x) and f(x). 3.3.1 Discrete Uniform Distribution When each discrete event has the same likelihood (probability) of occurring, the probability function is given by
f ( xi ) =
1 , 1 £ i < n (3.5) n
where n is the number of discrete values for x. For the cumulative discrete distribution function,
F ( xi ) = P ( X £ xi ) =
i (3.6) n
where x1 < x2 < x3 … < xn. A classic example is that of rolling a cubical die. The n = 6 categories of possible outcomes are equally probable. The X in Equation (3.6) may represent either a dimensionless counting number (7 bolts), a category (3 Heads) or a dimensional real number (last raise was $437.25/month); however, X must be limited to a finite number, n, of discrete values. For the raise example, the discrete values are multiples of 1¢/month. If the maximum possible raise could have been $600.00/month, then n = 600.00/.01 + 1 = 60,001 (we cannot exclude the zero-raise event). Consequently, x10,000 represents the 10,000th value of X, which is $99.99/month. Figure 3.1 illustrates the discrete uniform distribution for n = 5, and the corresponding cumulative discrete uniform distribution, also for n = 5. Recognizing that each xi value has the same probability, or frequency of occurring, f ( xi ) = f ( x j ), the mean and variance of the discrete uniform distribution are
m=
s2 =
1 n
1 n
n
åx (3.7) i
i =1
n
å ( x - m ) (3.8) 2
i
i =1
If the xi values are also equally incremented between x1 = a and xn = b, so that x j +1 - x j = Dx = ( b - a ) / x j +1 - x j = Dx = ( b - a ) / n (such as with a die which has sides with values of 1, 2, 3…, 6, where a = 1 and b = 6) then
29
Distributions
FIGURE 3.1 Discrete uniform distribution: (a) point, (b) cumulative, both for n = 5.
m=
s2 =
a+b (3.7a) 2
( b - a + 1) 12
2
-1
(3.8a)
Note: In Equations (3.7a) and (3.8a) the μ and σ2 values are calculated from a theoretical model with a and b known, as opposed to those that would be calculated from observed sample values. The variable X may be dimensionless, or it could have any units. The parameters µ and σ have the same units as X. The probabilities P and f are dimensionless. The variable X will have k discrete values that need not be uniformly spaced. For instance, if a family has five children, whose heights are 4 ft 2 in., 4 ft 9 in., 5 ft 1 in., 5 ft 5 in., and 5 ft 7 in., we can let Xi be the height of the ith child. Then,
X1 X2 X3 X4 X5
= = = = =
4 ft 2 in. 4 ft 9 in. 5 ft 1 in. 5 ft 5 in. 5 ftt 7 in.
If the children are chosen randomly, the probability of selecting a child with a particular height, say 5 ft 5 in. (the fourth child), is 1/n = 0.2. Also, for instance, the probability of a particular feasible value showing on the roll of a fair six-sided die is 1/n = 1/6 = 0.1666 . 3.3.2 Binomial Distribution A discrete distribution called the binomial occurs when any observation can be placed in only one of two mutually exclusive categories, such as greater-than or less-than-or-equalto, safe or unsafe, hot or cold, on or off, 0 or 1, pass or fail, Heads or Tails, etc. Although these characteristics are qualitative, the distribution can be made quantitative by assigning the values 0 and 1 to the two categories. The method of assignment is immaterial so long as it is consistent. Customarily, the categories are labeled success (value = 1) and
30
Applied Engineering Statistics
failure (value = 0). If p = probability of success and q = 1 − p = probability of failure in one trial of the experiment (one observation), the probability of exactly x number of successes in n trials can be described by the corresponding term of the binomial expansion, or
ænö n! f ( x |n ) = ç ÷ p x q n - x º p x qn - x , x = 0, 1, 2, ¼ , n (3.9) x x n x ! ! ( ) è ø
where X may only have integer values. ænö n! Note: The ç ÷ symbol does not mean n divided by x, it represents , which is x ! x n - x )! ( è ø the number of combinations (ways) of having x occur in n trials. If n = 4 and x = 2 then ænö 4! 4´ 3´ 2´1 = = 6. The six possible success–fail patterns could be 1100, 1010, ç ÷= x ! ! 2 4 2 ( ) 2 ´ 1 ( 2 ´ 1) è ø 1001, 0110, 0101, and 0011. Note: The variable x represents the numerical count in a particular category, it is not the value of the category. Note: When n is large, the factorial terms become large, and direct calculation of either the numerator or denominator can result in digital overflow. Fortunately, the number of integers in the numerator and denominator is equal, there are n digits in each, and a best way to calculate the ratio is to alternate dividing and multiplying. But, many software packages provide convenient functions. In Excel the function is f ( x|n ) = BINOMIAL.DIST ( x , n, p, 0 ). The binomial cumulative distribution function is
F ( x i |n ) = P ( X £ x i |n ) =
xi
ænö
å çè k ÷ø p (1 - p ) k
n-k
, i = 0, 1, 2, ¼ , n
(3.10)
k =0
where X may have only integer values, for selected values of n and p. The notation (something | n) means “something given the value of n”. In Excel F ( xi |n ) = BINOMIAL.DIST ( x , n, p, 1). One can compute other probabilities such as P ( xi £ X £ x j ) , indicating the probability that an observation value, X, would be between and including xi and xj.
P ( xi £ X £ x j ) = P ( X £ x j ) - P ( X £ xi - 1 ) = F ( x j ) - F ( xi - 1 ) xj
=
ænö k n-k ç ÷ p (1 - p ) k k =0 è ø
å
xi-1
ænö
å çè k ÷ø p (1 - p ) k
n-k
(3.11)
k =0
The best way to explain the use of Equation (3.11) is by use of a brief example. If you want P(10 ≤ X ≤ 20), you need to exclude all values of X which are not in the probability specification. In this case, we want to include only the values X = 10, 11, 12, …, 20. The values of X = 0, 1, 2, …, 9 must be excluded. As x1 = 10, to exclude values below 10, we must use (x1 − 1) = (10 − 1) = 9 as the index. Specific values, such that the probability will be exactly s successes in n trials, can be found from
P ( X = s|n ) = P ( X £ s|n ) - P ( X £ ( s - 1)|n ) (3.12)
31
Distributions
FIGURE 3.2 Typical binomial distribution: (a) point, (b) cumulative for both n = 6, p = 0.3.
Depending on the values of n and p, the binomial distribution may have several shapes; however, Figure 3.2, with n = 6 and p = 0.3, illustrates the characteristic shape. The mean and variance of the number count of events described by the binomial distribution are
m x = np (3.13)
and
s x 2 = npq (3.14)
where n is the total number of attempts (trials), p is the portion of resulting successes expressed as a numeric or decimal fraction of n, and q = 1 − p. Accordingly, µ is the average number of successes, and might not be an integer. If you flip eight coins you expect four to be a head. However, if you flip seven, you expect 3.5 to be a head. Of course, you cannot have a fraction of a count. What that means is if you flipped seven coins millions of times, and counted the number of heads each time, you might have 3, 2, 4, 5, 4, 3, …. Averaging this list is 3.5. Similarly, σ is the standard deviation on the value of the number of successes, not of the probability. The point probability function f and the cumulative probability function F are dimensionless. The value of the probability of a particular trial being a success, p, is considered to be the true distribution value, not the chance value one might assign from a few trials. The ideal true value of a coin flipping a head is 0.5. However, if you did not know that value, and flipped a coin seven times to see the head-to-tail ratio you might find three out of seven 3 heads and calculate p = = 0.42857 ¼ as the probability. If you do not know the true value 7 for p, you could get a reasonably close approximation by doing many (perhaps thousands) experiments. Example 3.1: You have submitted four proposals for upgrading the manufacturing facilities in your process area. From past experience you feel that the chance for any one project to be approved by the Finance Committee is 0.6. Accepting that p = 0.6 and that
32
Applied Engineering Statistics
selection is a random event, what are the chances (a) that one project will be approved and (b) that at least one project will be approved? For n = 4 and p = 0.6, using Equation (3.12): (a) P ( X = 1) = P ( X £ 1) - P ( X £ 0 )
= 0.1792 - 0.0256 = 0.1539 or 15% P ( X = 1) = 15%
(b) P ( X ³ 1) = 1 - P ( X £ 0 )
= 1 - 0.0256 = 0.9744 or 97% P ( X ³ 1) = 97%
Why are these answers different? The first allows the occurrence of only a single event, approval of only one project. The second allows approval of any number of your projects except 0.
3.3.3 Poisson Distribution The Poisson distribution is concerned with the number of events occurring during a given time or space interval. The interval may be of any duration or in any specified region. The Poisson distribution, then, can be used to describe the number of breaks or other flaws in a particular beam of finished cloth, or the arrival rate of people in a queuing line, or the number of defectives in a paint weathering trial, or the number of defective beakers per line per shift. The Poisson distribution describes processes with the following properties:
1. The number of events, X, in any time interval or region is independent of those occurring elsewhere in time or space. 2. The probability of an event happening in a very short time interval or in a very small region does not depend on the events outside this interval or region. 3. The interval or region is so short or small that the number of events in the interval is much smaller than the total number of events, n. The point Poisson distribution (point probability) function f(x) can be expressed as
f (x) =
l x e -l (3.15) x!
where x is the number of events, f(x) is the probability of x events occurring in an interval, λ is the expected average number of events per interval, and e = 2.7182818… is the base of the natural logarithm system. The cumulative Poisson distribution function F(x) is
CDF ( x ) = F ( x ) = P ( X £ x ) =
x
å k =0
where e is the base of the natural logarithm system.
l k e -l (3.16) k!
33
Distributions
FIGURE 3.3 Poisson distribution: (a) point, (b) cumulative, both for λ = 3.
Using Excel functions, values of the point Poisson distribution are obtained by f ( x ) = POISSON.DIST ( x , l , 0 ) , and values for the cumulative Poisson distribution are obtained by F ( x ) = POISSON.DIST ( x , l , 1) . Figure 3.3 illustrates the shape of the Poisson distribution for λ = 3 (three events per interval of interest). Note: The shape of the Poisson distribution is similar to that of the binomial; however, the right-hand tail of the Poisson distribution extends to infinity (although here only graphed up to n = 10), whereas there can be at most 6 out of 6 (or n out of n) events in the binomial distribution; even if the expected number of events per unit is 3, the Poisson distribution allows for the rare but possible case in which 9 events occur in the same interval. The mean and variance of the Poisson distribution are
m = l = np (3.17)
and
s 2 = l = np (3.18)
where n is the total number of events within the interval, an integer, and p is the probability that each event will occur in an interval. λ may have a fractional value. The variable X is an integer that has the units of number of events or things per unit of time, area, volume, or other specified interval, and λ represents the expected (or average) number of events that will occur within that unit interval. The point and cumulative distributions are dimensionless. Again, this supposes that you know the true value for λ. Example 3.2: A branch bank has found that their drive-in customer arrival rate averages one customer per minute during the morning hours. What is the probability that there will be no customer arrivals in any particular minute? What is the probability that three or more customers will arrive in any minute? From Equation (3.15), f (x = 0) = λxe−λ/x! = 10e−1/0! = 0.3678944 or about 37%. From Equation (3.16),
34
Applied Engineering Statistics
P ( X ³ 3 ) = 1 - P ( X < 3 ) = 1 - P ( X £ 2)
2
= 1-
å k =0
l k e -l = 0.080301397 or about 8%. k!
The probability of no arrivals in any particular minute is about 37%. The probability of three or more customers arriving in any one minute is about 8%.
The Poisson distribution can be used to approximate the binomial distribution. In general, if n ≥ 20 and p ≤ 0.01, the Poisson can be used to approximate the binomial with errors on f(x) generally below 5% for any value of x. Example 3.3: In the preparation of sample coupons for a salt-spray corrosion test, 4% of the metal coupons are found to be unusable because of improper cleaning. In a sample of 30 coupons, what is the chance of having two or fewer rejects? We assume the binomial model as the coupons are initially either clean or not. Furthermore, the cleanliness of any coupon has no effect on whether any of the others in the sample are clean or not. For p = 0.04 and n = 30, the binomial probability is Pbinomial ( X £ 2 ) =
30 ! 2 28 ( 0.04 ) ( 0.96 ) 2 ! 28 !
+
30 ! 29 1 ( 0.04 ) ( 0.96 ) 1! 29 !
+
30 ! 0 30 ( 0.04 ) ( 0.96 ) 0 ! 30 !
= 0.883103 or 88% For λ = np = 0.04(30) = 1.2, the Poisson approximation of the binomial probability is
PPoisson ( X £ 2 ) = 0.8795 or 88%.
Although the Poisson distribution is not the proper descriptor of these events (a single coupon cannot be dirty twice), in the limit of n = 30 and low event probabilities the Poisson and binomial distributions are similar. Example 3.4: If one process has an average of three shutdowns per year, how many would you expect from five similar processes in a month? For the one process l = 3 éëshutdowns per year per process ùû . Five times as many processes would total five times more on average. But, in a month, you expect 1/12 as many on average. So, l = 3 × 5/12 = 1.25 éëshutdowns per month per five processes ùû . The individual probabilities of 0, 1, and 2 of shutdowns
P (0) =
l 0 e -l = 0.2865¼ 0!
P ( 1) =
l 1e - l = 0.3581¼ 1!
35
Distributions
P ( 2) =
l 2e -l = 0.2238¼ 2!
To obtain the average ¥
n=
åi × P(i) = 1.25 éëshutdowns per month per five processesùû i =1
Expectedly n = l .
3.3.4 Negative Binomial Distribution In cases in which the binomial distribution governs the probability of occurrence of one of two mutually exclusive events, we calculated the probability of success exactly s times out of n trials. The negative binomial distribution is used in a complementary way, that is, for calculating the probability that exactly n trials are required to produce s successes. The probabilities of success and failure remain fixed at p and q, respectively. The only way this situation can occur is for exactly (s − 1) of the first (n − 1) trials to be a success, and for the next, or last, trial also to be a success. The probability of x = n, the number of trials needed to produce s successful outcomes, then
æ n - 1 ö s n-s f ( x = n|s ) = ç ÷ p q , s £ n (3.19) è s-1ø
is the negative binomial distribution. The cumulative negative binomial distribution is
F ( x = n|s ) = P ( s £ x £ n ) =
n
æ i -1ö
å çè s - 1÷ø p q
s i-s
(3.20)
i=s
Figure 3.4 illustrates the negative binomial distribution for p = 0.5 and s = 3. The mean and variance of the negative binomial distribution are given by
s m = (3.21) p
FIGURE 3.4 Negative binomial distribution: (a) point, (b) cumulative both for p = 0.5 and s = 3.
36
Applied Engineering Statistics
and sq (3.22) p2
s2 =
The units on x, s, n, and µ are the numbers of trials. The point and cumulative distribution functions, f(xi) and F(xi), are dimensionless. The units on p and q are the probabilities of success or failure. Example 3.5: Suppose one of your power sources for an analytical instrument in the quality control laboratory has died with a snap and a wisp of smoke. You have finally located the trouble as a faulty integrated circuit (IC). You have been able to find five replacement ICs. You have also found that for this service the chance of failure of an IC is 12%. What is the probability that you will have to use all five ICs before getting one that does not burn out? Let us define burnout as failure, so q = 0.12 and p = 0.88. As x = 5 and s = 1, using Equation (3.19),
æ 4ö f ( x = 5|1) = ç ÷ ( 0.88 ) 0.12 4 = 1.825 ´ 10 -4 or 0.02% è0ø
the probability is less than 0.02% that you will have to try all five of the ICs to repair the power supply.
3.3.5 Hypergeometric Distribution The hypergeometric distribution is often used to obtain probabilities when sampling is done without replacement. As a result, the probability of success changes with each trial or experiment. The point hypergeometric probability function is
æSöæ N - Sö ÷ ç ÷ç s øè n - s ø è P ( X = s) = f ( s) = , s £ min ( n, S ) (3.23) æNö ç ÷ ènø
where N is the population size, n is the sample size, S is the actual number of successes in the population, s is the number of successes in the sample, and n ≤ N and (n − s) ≤ (N − S). The cumulative hypergeometric distribution is
F (x) = P (X £ x) =
x
å k =0
æSöæ N - Sö ç ÷ç ÷ è k ø è n - k ø (3.24) æNö ç ÷ ènø
Examples of the point and cumulative hypergeometric distributions are shown in Figure 3.5 for N = 20, S = 15, and n = 5. The mean of the point hypergeometric distribution is
m=
nS (3.25) N
37
Distributions
FIGURE 3.5 Hypergeometric distribution: (a) point, (b) cumulative, both for N = 20, s = 15, n = 5.
and the variance is
s2 =
N -n S N -S n (3.26) N -1 N N
The units of μ, s, N, n, and S are the number of items, populations, or successes. The point and cumulative probability functions f(x) and F(x) are dimensionless. Example 3.6: In the production of avionics equipment for civilian and military use, one manufacturer randomly inspects 10% of all incoming parts for defects. If any of the parts is defective, all the rest are inspected. If 2 of the next box of 50 diodes are actually defective, what is the probability that all of the diodes will be checked before use? This question is really whether the quality control sample of 5 will contain at least one of the defective parts. For this problem, N = 50, n = 5, and s = 2, as we choose to define success as finding a defective diode. The probability is found from
F ( 0 ) = P ( X ³ 1) =
2
å k =1
æ 2 ö æ 50 - 2 ö ç ÷ç ÷ è k øè 5 - k ø æ 50 ö ç ÷ 5 è ø
= 0.1918367 orr 19% With the current sampling procedure, there is approximately a 20% chance of finding a defective part. If that part would cause failure of the finished item, the 80% chance of not finding that part is inadequate as a quality control measure, considering the application. Either the sample size should be increased (up to 100% inspection of all incoming parts), or a more reliable supplier of diodes should be found! Actually, had you known for certain that 4% of the diodes were defective, you would have tested each one until you had
38
Applied Engineering Statistics
found both faulty units. However, here is where the laws of probability may lead you to accept a false conclusion: Can you ever be absolutely certain that you have found all defectives unless all incoming parts are checked? Of course not. The issue, as we will see in Chapters 5, 6, 7, 10, and 11, is how much of a chance of being wrong you are willing to take.
3.3.6 Geometric Distribution If an event can be dichotomous (have either one of two distinct discrete outcomes), the geometric distribution describes the number of trials until (up to and including) the first success (or failure). The point geometric probability distribution is described by
f ( x = k ) = pqk -1 (3.27)
where k is the number of trials until the first success, p is the probability of an individual trial success, and q is the probability of an individual trial failure = (1 − p). The cumulative geometric distribution function is
F ( x = k ) = P (1 £ x £ k ) =
k
åpq
i -1
(3.28)
i =1
Figure 3.6 shows the geometric distribution for p = 0.6. Values for the mean and variance of the geometric distribution are
1 m = (3.29) p
and
s2 =
q (3.30) p2
Although the outcome value may have any dimension (ft/sec2, kilopascals, ft, etc.) or be a class variable (on or off, wet or dry, smoker or nonsmoker, etc.), the variables p, q, f(x), and
FIGURE 3.6 Geometric distribution: (a) point, (b) cumulative, both for p = 0.6.
39
Distributions
F(k) are dimensionless. The value of the mean μ is the average number of trials until the first success, and the units on x and k are the number of trials. The pattern of 5 successive fails and the 6th trial providing a success would be FFFFFS. This is one success in 6 trials. Similar to the geometric distribution, the binomial distribution gives the probability of x = 1 with n = 6. But the binomial distribution would allow for any sequence that generated that count. For instance, FFSFFF, or FFFFSF. There are 6 possible combinations of one S, and 5 Fs. In the geometric distribution, there is only one of the æ1ö ç ÷ = n combinations that matters. The geometric distribution is the same as the binomial è nø distribution when x = 1 and divided by n to represent that only one of the n combinations is the one sought. Example 3.7: At a marina, the boat slips either have boats assigned to them or are available. If 100 slips are at the marina and 75 have been rented, what is the probability of having to select 5 slips by random choice of slip number until getting one that is available (4 fails followed by a success is 4 + 1 = 5 trials)? Since p = probability of success (available empty slip) = 0.25, then q = probability of failure (rented) = 0.75, using Equation (3.27), f(x = 5) = 0.25(0.75)5 − 1 = 0.07910156 or about 8%. In addition, using Equation (3.29),
m=
1 = 4 trials before the first success 0.25
the probability of having to select five slip numbers at random before finding one empty is about 8%. On average, you would have to choose four slips at random in order to find one empty.
3.4 Continuous Distributions In the previous section, the distributions were related to the number count of events. In contrast, many measurements are continuum-valued. Continuous distributions model the probabilities associated with continuous variables, such as those that describe events such as service life, pressure drop, flow rate, temperature, percent conversion, and degradation in yield strength. That we measure continuous variables in discrete units or at fixed time intervals does not matter; the variables themselves are continuous even if the measuring devices give data that are recorded as if step changes had occurred. A familiar example is body temperature, a continuous variable measured in discrete increments. Think about it: Even if you have a fever, your temperature does not change from 98.6 to 101.2°F in one step or even in a series of connected 0.2°F intervals, just because the thermometer is calibrated that way. We must acknowledge, however, that the world is not continuous. From an atomic and quantum mechanical view of the universe, no event has a continuum of values. However, on the macroscale of engineering, individual atoms are not distinguishable within measurement discrimination, and so the world appears continuous. For most practical engineering purposes, it is possible to approximate any distribution in which the discrete
40
Applied Engineering Statistics
variable has more than 100 values with a probability density function of a continuous random variable. A cumulative continuous distribution function F(x) is defined as
CDF ( x ) = F ( x ) =
ò
x
pdf (X ) dX (3.31)
-¥
where pdf(X) is a continuous probability density function and X is a continuous variable, which could represent time, temperature, weight, composition, etc. x is a particular value of the variable X. The units on x and X are identical and are not a count of the number of events as the x-variable in the discrete distributions. The F(x) is the area under the pdf(x) curve, is dimensionless, and as x goes from −∞ to +∞, F(x) goes from 0 to 1.
ò
+¥
pdf (X ) dX = 1 (3.32)
-¥
Note, again, the terms CDF and F are used interchangeably. Additionally, the terms pdf(x) and f(x) are also used interchangeably. In the discrete distributions, pdf(x) would mean point distribution function, and in continuous functions, it means probability density function. Although both the continuum pdf(x) and discrete f(xi) represent the histogram shape of data, they are different. The dimensional units of pdf(x) constitute a major difference between a continuous probability distribution function and the f(xi) of a discrete point probability distribution. The pdf(x) necessarily has dimensional units that are the reciprocal of the continuous variable. For F(x) of Equation (3.31) to be dimensionless, integrating with dx, the argument of the integral, pdf(x), must have the units of the reciprocal of dx. pdf(x) is often termed a rate, a rate of change of F(x) w.r.t. x. By contrast, in a discrete function F(xi) is the sum of f(xi), the fraction of the dataset with a value of xi, so f(xi) is dimensionless. You cannot use a discrete point distribution in Equation (3.31) or a continuous function in Equation (3.1) and expect F(x) to remain a dimensionless cumulative probability. Another difference between discrete and continuous probability density functions is that x is used only to represent values of the variable involved throughout the continuous case. For discrete distributions, x was often the number of events in a particular class (category). So, whether you are using the term pdf(x) or f(x) take care that you are properly using the dimensionless version for distributions of a discrete variable, and the rate version with reciprocal units of X for distributions of continuum variables. The mean and variance of the theoretical continuum distributions are:
m=
s2 =
ò
ò
+¥ -¥
+¥
x pdf ( x ) dx (3.33)
-¥
(x - m )
2
pdf ( x)dx (3.34)
3.4.1 Continuous Uniform Distribution If a random variable can have any numerical value within the range from a to b and no values outside that range, and if each possible value has an equal probability of occurring, then the probability density function for the uniform continuous distribution is
41
Distributions
ì 1 , ï pdf ( x ) = í b - a ïî 0
a£x£b (3.35) x > b, x < a
0, ì ïx - a ï , F (x) = P (X £ x) = í ïb-a 1, ïî
xb
An acronym for data that is uniformly and independently distributed within a range from a and to b is UID(a,b). Figure 3.7 illustrates the uniform distribution for a = 2 and b = 5. The mean and variance of the continuous uniform distribution are
m=
a+b (3.37) 2
and
s2 =
(b - a) 12
2
(3.38)
Note the parallels and differences between equations for the mean and variance of the continuous uniform and equally incremented discrete uniform distributions. The random variable X may have any dimensional units which must match that of parameters a and b. The mean, μ will have the same units. Whereas the cumulative distribution function F(x) is dimensionless, the probability density function, pdf(x), has units that are the reciprocal of those of the random variable X. Again, the population coefficients, a and b, represent the true values. You might not know what they are exactly, but certainly you can do enough experiments to get good estimates for them.
FIGURE 3.7 Continuous uniform distribution: (a) probability density function, (b) cumulative distribution, both for a = 5, b = 10.
42
Applied Engineering Statistics
Example 3.8: From observation, it appears that fluid turbulence superimposes a uniformly distributed disturbance of ±2 psi on a differential pressure cell installed across an orifice. If this is true and the average differential pressure is 20 psi, what is the probability of reading a value of more than 21 psi? From Equation (3.37), we have a = m - 2 psi = 18 psi
and
b = m + 2 psi = 22 psi
From Equation (3.36), we find that
P ( X £ 21) =
21 - 18 = 0.75 22 - 18
P ( X > 21) = 1 - P ( X £ 21) = 1.0 - 0.75 = 0.25
There is a 25% probability of reading a differential pressure value of greater than 21 psi.
3.4.2 Proportion A proportion is the probability of an event, a fraction of outcomes, and is a continuumvalued variable. In flipping a coin, the probability of a particular outcome is p = 0.5. In rolling a die the probability of getting a 5 is p = 0.16666 . In rolling 10 dice and winning means getting at least one five in the 10 outcomes, the probability is p = 0.83949441¼. Although the events are discrete, the probability could have a continuum of values between 0 and 1. 0 £ p £ 1. If the proportion is developed theoretically, then it is known with as much certainty as the basis and idealizations allow. Then the variance on the proportion is 0.
s p 2 = 0 (3.39)
Alternately, the proportion could be determined from experimental data. For example, a trick die could be weighted to have p = 0.21 as the probability of rolling a 5. Here, proportion, p, is the ratio of number of successes, s, per total number of trials, n, p = s/n, as n → ∞. Alternately, the proportion would be estimated as the average after many trials. s å si (3.40) mˆ p = pˆ = = n å ni
If experimentally determined, the variance on the proportion would be estimated by
sˆ p 2 =
pˆ ( 1 - pˆ ) n
=
ˆ ˆ s ( n - s) pq = (3.41) n n3
Note this is similar to the mean and variance of the binomial distribution, but here the statistics are on the continuum-valued proportion. In the binomial distribution, the statistics
43
Distributions
are on the number count of a particular type of event. The variance of the count of successes of n samples from a population would be given by Equation (3.14). Example 3.9: What are the mean and sigma when the probability of an event (outcome = 1) is and unknown p, and the probability of a not-an-event (outcome = 0) is q = ( 1 - p ) ? A sequence of n dichotomous events might be
0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1,¼
Whether we call the event a H or a T, a success or a fail, the {1,0} notation is equivalent. Experimentally, there are s = 21 successes out of n = 143 trials. From Equation (3.40) the estimate of p is s 21 pˆ = = = 0.14685314¼ n 143
From Equation (3.41) the standard deviation on pˆ is
sˆ p = sˆ p 2 =
s ( n - s) = n3
21(143 - 21) = 0.02959957 ¼ 143 3
acknowledging the uncertainty on pˆ and sˆ p one might report pˆ = 0.148
and
sˆ p = 0.03
3.4.3 Exponential Distribution The exponential (or negative exponential) distribution describes a mechanism whereby the probability of failures (or events) within a time or distance interval depends directly on the number of un-failed items remaining. It describes events such as radioisotope decay, light intensity attenuation through matter of uniform properties, the failure rate of light bulbs, and the residence time or age distribution of particles in a continuous-flow stirred tank. Requirements for the distribution are that, at any time, the probability of any one particular item failing is the same as that of any other item failing and is the same as it was earlier. Another restriction is that the numbers are so large that the measured values seem to be a continuum. The probability distribution functions are
pdf ( x ) = a e -a x , 0 £ x £ ¥ , a > 0 (3.42)
and
F ( x ) = 1 - e -a x (3.43)
The variable x represents the time or distance interval, not the number (or some other measure of quantity) of un-failed items. The argument of an exponential must be dimensionless,
44
Applied Engineering Statistics
FIGURE 3.8 Exponential distribution: (a) probability density function, (b) cumulative distribution, both for α = 0.3.
so the units on α are the reciprocal of the units on x. This requires that the units on pdf(x) are also the reciprocal of the units on x, making pdf(x) be a rate. Figure 3.8 illustrates the exponential distribution for α = 0.3. The mean and variance of the exponential distribution are
m=
1 (3.44) a
s2 =
1 (3.45) a2
and
The continuous random variable X may have any units. The units on μ will be the same. The units on both α and f(x) are the reciprocal of those of X. F(x) is dimensionless. For a physical interpretation, α represents the fraction of events occurring per unit of space or time. The discrete geometric distribution, in its limit as the number of events is very large and the probability of success is small, approaches the continuous exponential distribution. Example 3.10: One billion adsorption sites are available on the surface of a solid particle. Gas molecules, randomly and uniformly “looking” for a site, find one upon which to adsorb, which “hides” that site from other molecules. With an infinite gas volume, the rate at which the sites are occupied is therefore proportional to the number of unoccupied sites. If 40% of the sites are covered within the first 24 hours, how long will it take for 99% of the adsorption to be complete? What is the average lifetime of an unoccupied site? From Equation (3.43),
-a 24 40% = 0.40 = F ( t ) = 1 - e -a t = 1 - e ( )
which gives α = 0.02128440… per hour. From Equation (3.43), 99% = 0.99 = F(t) = 1 − e0.02128…t, which gives t = 216.3636244… hours or about 9 days. From Equation (3.44), μ = 1/α = 46.9827645 hours or almost 2 days.
45
Distributions
It takes about 9 days for 99% of the sites to become occupied. The average life of an unoccupied site is about 2 days. Note that although the exact number of initial active sites is immaterial, the fact that it was very large meets the second requirement for the use of the exponential distribution.
3.4.4 Gamma Distribution The gamma distribution can represent two mechanisms. In a general situation in which a number of partial random events must occur before a complete event is realized, the probability density function of the complete event is given by the gamma distribution. For instance, rust spots on your car (the partial event), may occur randomly at an average rate of one per month. If 16 spots occur before you decide to have your car repainted (the total event), the gamma distribution is the appropriate one to use to describe the repainting time interval. The gamma distribution is
pdf ( x) =
l a -1 ( l x ) exp ( -l x ) , x ³ 0 (3.46) G (a )
and where α and λ > 0 and G (a ) is the gamma function
G (a ) =
¥
òZ 0
a -1 - z
e dZ (3.47)
The gamma function has several properties
G (a ) = (a - 1) G (a - 1) (3.48)
and if α is an integer, then
G (a ) = (a - 1) ! (3.49)
The variable α represents the number of partial events required to constitute a complete event, and λ is the number of partial events per unit of x (which may be time, distance, space, or item). If α = 1, the gamma distribution reduces to the exponential distribution. For that reason, if an event rate is proportional to some power of x, then the gamma distribution can also be used as an adjusted exponential distribution. Let’s look at Example 3.10 again. If adsorption reduces the number of gas molecules available for subsequent adsorption, then the probability of any site being occupied decreases with time. If the frequency with which gas molecules impinge on the particle surface decreases as (λx)α − l, then the gamma function describes f(x). However, although close enough for most engineering applications, the power law decrease probably does not describe a real driving force exactly. For such a situation, use of the gamma distribution must be acknowledged as a convenient approximation. Depending on the values of α and λ, f(x) may have various shapes, some of which are illustrated in Figure 3.9. A general analytical expression for F(x) is intractable. For most α values, to obtain the cumulative distribution function, f(x) must be integrated numerically. Excel provides the function GAMMA.DIST ( x, a , 1 / l , 0 ) to return the pdf(x) value
46
Applied Engineering Statistics
FIGURE 3.9 Gamma distribution: probability density function and cumulative distribution, both for various parameter values – (a) α = 5, λ = 2.5, (b) α = 1, λ = 0.8.
and GAMMA.DIST ( x, a , 1 / l , 1) to return the CDF(x) value. Note that the Excel parameter beta is the reciprocal of λ, here. The mean and variance of the gamma distribution are
m=
a (3.50) l
s2 =
a (3.51) l2
The units of X are usually count per some interval (time, distance, area, space, or item). Consequently, the units for λ are the fraction of total failures per unit of X. The coefficient, α, is a counting number and is dimensionless, and f(x) has units that are the reciprocal of the units of X. Example 3.11: There are 8 lightbulbs in a particular chandelier. The lights fail randomly. When three fail, only 5 remain and there is not enough illumination for the room. The illumination is adequate if 0, 1, or 2 lights fail; but when three lights fail, the bulbs must be replaced. On average, we find that the lights need to be replaced every 2 months.
47
Distributions
What is the average failure rate of a single light on a monthly basis? What fraction of the lights will last for 4 months without needing replacement? From Equation (3.50),
l=
3 ( partial per total ) a = = 1.5 ( partials per month ) m 2 ( months per total )
We expect 1.5 lights to fail each month. P (X > 4) = 1 - P (X £ 4)
= 1-
ò
4 0
f (a = 3, l = 1.5, x ) dx
Using the Excel function GAMMA.DIST ( x, a , 1 / l , 1) with GAMMA.DIST ( 4, 3, 1 / 1.5, 1)
P ( X > 4 ) = 1 - .938031¼ = 0.061968¼ @ 6.2%
about 6% of the lights will last for four months.
3.4.5 Normal Distribution The normal distribution, often called the Gaussian distribution or bell-shaped error curve, is the most widely used of all continuous probability density functions. The assumption behind this distribution is that any errors (sources of deviation from true) in the experimental results are due to the addition of many independent small perturbation sources. All experimental situations are subject to many random errors and usually yield data that can be adequately described by the normal distribution. Even if your data is not normally distributed, the averages of data from a nonnormal distribution tend toward being normal. An average of independent samples will have some values above the mean and some below. The average will be close to the mean, and each sample would represent a small independent deviation. In the limit of large sample size, n, the standard deviation of the average is related to that of the individual data by s X = s X / n . So, when using averages, the normal distribution usually is applicable. However, this situation is not always true. If you have any doubt that your data are distributed normally, you should use the nonparametric techniques in Chapter 7 to evaluate the distribution. Use of statistics that depend on the normal distribution for a dataset that is distinctly skewed may lead to erroneous results. An acronym for data that is normally and independently distributed with a mean of μ and standard deviation of σ is NID(μ, σ). Regardless of the shape of the distribution of the original population, the central limit theorem allows us to use the normal distribution for descriptive purposes, subject to a single restriction. The theorem simply states that if the population has a mean μ and a finite variance σ2, then the distribution of the sample mean X approaches the normal distribution with mean μ and variance σ2/n as the sample size n increases. The chief problem with the theorem is how to tell when the sample size is large enough to give reasonable compliance with the theorem. The selection of sample sizes is covered in Chapters 10, 11, and 17. The probability density function f(x) for the normal distribution is
é 1 æ x - m ö2 ù ê- ç ÷ ú 2è s ø ú û
1 ê f (x) = eë s 2p
, - ¥ < x < ¥ (3.52)
48
Applied Engineering Statistics
é 1 æ x - m ö2 ù Note that the argument of the exponentiation, ê - ç ÷ ú , must be dimensionless. As êë 2 è s ø úû expected, x, μ, and σ each have identical units. The exponentiation value is also dimensionless. Also, since f(x) is proportional to 1/σ, it has the reciprocal units of x. As seen in Equation (3.52), the normal distribution has two parameters, μ and σ, which are the mean and standard deviation, respectively. The cumulative distribution function (CDF), described by
CDF ( x ) = F ( x ) = P ( X £ x ) =
1 s 2p
ò
x
2 - X - m /2s 2 e ( ) dX (3.53)
-¥
In Equation (3.53) the variable X is the generic variable, and the lower-case x represents a particular value. 1 The logistic model, CDF ( x ) = F ( x ) = P ( X £ x ) = , is a convenient and reason-s x -c 1+ e ( ) ably good approximation to the normal CDF(x). Convenient: It is computationally simple, analytically invertible, and analytically differentiable. Reasonably good: Values are no more different from the normal CDF(x) than that caused by uncertainty on μ and σ. For the scale factor, use s = s / 1.7 , and for the center, use c = μ (see Exercise 3.15). Figure 3.10 shows several possible variations on the distribution. If the variance is the same but the means are different, the shape of the curves are the same, but their locations are different, as shown in Figure 3.10 a and b. However, if the mean is the same and the variance is different, the central part of the curve will be at the same value, but the width of the pdf curve will be different, as seen in Figure 3.10 a and c. Note: In Figure 3.10a, the pdf and CDF have values at negative x-values. The distribution asymptotically approaches zero at x = -¥ , and also at x = +¥ . So, even though it is not noticeable in all three figures, the extreme pdf and CDF values are not exactly zero. Note: In all cases, the pdf distribution is symmetric on either side of the mean, and the peak is at the mean. The pdf curves, visually, are effectively zero when x is beyond 3σ of the mean. And the inflection points on the pdf curve (where the shape changes between concave to convex) are at the ±1σ deviation from the mean. When you are sketching normal pdf curves, incorporate these features. Note: The 0.5 CDF value of the cumulative distribution corresponds to the mean, as does the inflection point on the CDF curve. The CDF curves asymptotically approach 0 and 1, but they are visually effectively at 0 and 1 when x is ±3σ from the mean. When you are sketching normal CDF curves, incorporate these features. The probability of X falling within the range x1 < X £ x2 is P ( x1 < X £ x2 ) = F ( x2 ) - F ( x1 )
=
1 s 2p
ò
x2 x1
2 - x - m /2s 2 e ( ) dx
(3.54)
Because two parameter values must be specified to define the desired normal distribution, it is convenient to consider a single member of the family of distributions, that with μ = 0 and σ = 1. Such a normal distribution is called the standard normal distribution and is characterized by a single parameter z. The standard normal deviate z is defined by
z=
x-m (3.55) s
49
Distributions
FIGURE 3.10 Normal distribution curves for mean and variance change – (a) μ = 2, σ = 1, (b) μ = 4, σ = 1, (c) μ = 2, σ = 0.5.
The scaled variable z is dimensionless and represents the number of standard deviations that the x-value is away from the mean. This allows us to scale and translate (standardize) Equation (3.53) to
æ x-m ö CDF ( z ) = F ( z ) = P ç £ z÷ = è s ø
1 2p
ò
z
e-Z
-¥
2
/2
dZ. (3.56)
In this way, we have reduced all normal variables, regardless of parameter values, to a single distribution in z. In Equation (3.56) the variable Z is the generic variable, and the lower-case z represents a particular value.
50
Applied Engineering Statistics
In Excel, the function NORM.DIST(x, μ, σ, 0) returns pdf(x), and the function NORM.DIST(x, μ, σ, 1) returns CDF(x). If you want the standard normal distributions, then use NORM. DIST(z, 0, 1, 0) for f(z), and the function NORM.DIST(z, 0, 1,1) for F(z). The inverse function NORM.INV(p, μ, σ) returns the x value associated with the cumulative probability, p. Although demonstrated using the normal distribution, the principles demonstrated in the next two examples are common to the use of all continuous probability distributions and are fundamental to the use of statistics. Example 3.12: Find the probabilities associated with each of the Z ranges given below. The thumbnail sketches reveal the CDF(z) w.r.t. z-value. The dashed lines represent the numerical values. (a) P(0 < Z ≤ 2.3). The necessary probability can be found in two steps. The total area under the standard normal distribution is 1 and z ranges from -∞ to +∞. This distribution is bilaterally symmetric about Z = 0, so P(Z < 0) or P(Z > 0) = 0.5. P(0 < Z ≤ 2.3) = P(Z ≤ 2.3) − P(Z < 0), which is the area under the curve from z = 0 to z = 2.3. P(0 < Z ≤ 2.3) = 0.9893 − 0.5 = 0.4893.
(b) P(− 0.62 ≤ Z ≤ 0) = P(Z < 0) − P(z < − 0.62) = 0.5 − 0.2676 = 0.2324.
(c) P(Z > 0.6) = 1 − P(Z ≤ 0.6) = 1 − 0.7257 = 0.2743.
(d) P(−0.17 ≤ Z ≤ 1.6) = P(Z ≤ 1.6) − P(Z ≤ −0.17) = 0.9452 − 0.4325 = 0.5127.
51
Distributions
(e) P(0.25 ≤ Z ≤ 1.96) = P(Z ≤ 1.96) − P(Z ≤ 0.25) = 0.9750 − 0.5987 = 0.3763.
(f) P(−∞ < Z ≤ 1.2). As the values in the NORM. DIST function are based on the integral in Equation (3.56), the solution can be read directly as the value at z = 1.2, and that is 0.8849.
Sometimes, it will be necessary for you to find the values of z1 and/or z2 for some given value of P(z1 ≤ Z ≤ z2). The examples below use the NORM.INV function to determine the limits of Z from the given probability data. Example 3.13: Determine the z-value associated with each of the probabilities given below. (a) Find the value of z1 if P(0 ≤ Z ≤ z1) = 0.37. As P(Z < 0) is 0.5, we have to find the value of z1 for which the area under the distribution is 0.5 + 0.37 = 0.87. We find z1 = 1.1264.
(b) Find the value of z1 if P(−∞ < Z ≤ z1) = 0.69. There is only one step needed to answer this problem, as the entire area under the curve from −∞ to z1 has been specified as 69% of the total. The value of z1 is 0.49414.
52
Applied Engineering Statistics
(c) Find the value of z1 if P(−2.1 ≤ Z ≤ z1) = 0.12. The probability interval may be written as P(Z ≤ z1) − P(Z ≤ −2.1) = 0.12 or P(Z ≤ z1) − 0.0179 = 0.12, and the CDF of z1 must be 0.1379. The corresponding value of z is −1.09.
(d) Find the values for z that determine the upper and lower quartiles. For the lower quartile P(−∞ ≤ Z ≤ z1) = 0.25. For the upper quartile P(z2 ≤ Z ≤ +∞) = 0.25, which means that P(−∞ ≤ Z ≤ z2) = 1 − P(z2 ≤ Z ≤ +∞) = 1 − 0.25 = 0.75. The z-values are −0.6745 and +0.6745.
3.4.6 “Student’s” t-Distribution W. S. Gossett, publishing his work under the pseudonym “Student,” developed the t-distribution. The statistic would become the basis for the t-test so widely used for the evaluation of engineering data. The t-statistic is very similar to the standard normal z-statistic, but instead of using the true population mean and standard deviation, it uses the sample standard deviation. T=
X-m (3.57) s
Because it is based on sample data, not the entire population, the degrees of freedom ν is one less than the number of data used to calculate the sample average and s
n = n - 1 (3.58)
Relative to the z-statistic, the t-statistic includes the uncertainty on both the sample average and sample standard deviation. Both the z- and t-statistics are dimensionless regardless of the units on the variable X. The random variable t has the probability density function below:
f (t ) =
1 G ( ( v + 1) / 2 ) æ t2 ö ç1+ ÷ vø G (v / 2) è vp
CDF ( t ) = F ( t ) =
- ( v + 1)/2
1 G ( ( v + 1) / 2 ) ævö vp Gç ÷ è2ø
for - ¥ < t < ¥ (3.59)
ò
æ x2 ö ç1+ ÷ v ø -¥ è
t
- ( v + 1)/2
dx (3.60)
Distributions
53
FIGURE 3.11 Characteristic shapes of the “Student’s” t-distribution: for v = 2, and v = 200 (approaching the standard normal).
Note that G ( v/2 ) is the gamma function. The gamma function is related to the factorial and is not the gamma probability density distribution. Like the z-distribution, the distribution of t is bilaterally symmetric about t = 0. The t-distribution is illustrated in Figure 3.11 for two values of v, the degrees of freedom. The resulting bell-shaped distribution resembles that of the standard normal. However, more of the area under the t-distribution is in the “tails” of the distribution. In the limit of large n (effectively ν greater than about 150) the t- and standard normal distributions differ in the tenths of a percent. The use of the t-distribution will be described in subsequent chapters in the sections discussing confidence intervals and tests of hypotheses for the mean of experimental distributions. The cumulative t-distribution, F(t) from Equation (3.60) can be calculated by the Excel function T .DIST ( t ,n , 1) where t is calculated from the sample data. Alternately, if you wanted to know the t-value that represents a probability limit then use the Excel function T .INV ( CDF , n ) to return a t-value that would represent that CDF value. Alternately, calculate α, the level of significance, the extreme right-hand area, as a = 1 - F ( t ) = 1 - CDF , then use the Excel function T.INV ( 1 - a ,n ). That represented a one-sided evaluation, which considered the area under the t-distribution from −∞ up to a particular t-value. But often, we desire to know either the positive or negative extreme values for t, the “+” or “−” deviations from the central “0” value. You may want to know the range of t-values that includes the central 95% (or some confidence fraction C) of all expected values from sampling the population.
P ( tnegative limit £ T £ tpositive limit ) = C (3.61)
Here, the level of significance is again the extreme area. If the 95% interval is desired (C = 0.95) then a = 1 - 0.95 = 1 - C. Splitting the two tail areas equally, to define the central limits, use α/2 to represent both the far right and far left areas in the tails. Then we seek the t-value calculated with F ( t ) = 1 - a / 2. The Excel function T.INV ( 1 - a / 2,n ) will return the t-value representing the positive extreme expected value, and -T.INV ( 1 - a / 2,n ) will return the negative extreme. This is termed a two-sided (historically a two-tailed) test, because we are seeking the limits of the central area. Alternately, T .INV.2T ( 1 - a ,n ) returns the same value.
54
Applied Engineering Statistics
3.4.7 Chi-Squared Distribution Let Y1, Y2, Y3, …, Yn be independent random variables each distributed with mean 0 and variance 1. The random variable chi-squared: n
c2 =
åY (3.62) i
2
i =1
has the chi-squared probability density function with n = n - 1 degrees of freedom
( )
f c2 =
1
2
v/ 2
é e - c 2 /2 ù é c 2 ù ( v/2) -1 for 0 £ c 2 £ ¥ (3.63) ûë û G (v / 2) ë
and cumulative distribution
( )
F c2 =
1
2 v/ 2 G ( v / 2 )
(
ò
c2 0
( v/ 2 ) - 1
e - Y /2 ( Y )
dY (3.64)
)
If Y in Equation (3.62) is defined as X - X / s then n
c = 2
å
n
2
Yi =
i =1
å i =1
(X - X ) i
s2
2
=
( n - 1) s2 (3.65) s2
Figure 3.12 illustrates the probability density and cumulative chi-squared distributions, respectively. Values of the cumulative chi-squared (χ2) distribution can be obtained from the Excel function F c 2 = CHISQ.DIST c 2 , v, 1 , and the pdf by using f c 2 = CHISQ.DIST c 2 , v, 0 .
( )
(
)
( )
(
)
The inverse of the calculation, the value of χ2 given F(χ2) and v can be obtained by the Excel function c 2 = CHISQ.INV ( F , v ) . Note: Some tables or procedures use c 2 / v. Since Equation (3.62) indicates that χ2 increases linearly with n, and since degrees of freedom is often v = n - 1, the scaling makes
FIGURE 3.12 Chi-squared distribution for v = 8.
55
Distributions
sense. Mostly, this book will not scale χ2 by the degrees of freedom. But be aware that the use of either c 2 / v or χ2 is common. The mean and variance of the chi-squared distribution are v and 2v, respectively.
m = v (3.66)
s = 2v (3.67)
So, if degrees of freedom is 10, an average-like value of the χ2 statistic would be about 10. χ2 = 1 would be an unexpectedly low value, and χ2 = 20 would be unexpectedly high. This distribution has several applications, one of which is in calculating and evaluating probability intervals for single variances from normally distributed populations as shown in Chapters 5 and 6. The chi-squared distribution is also used as a nonparametric method of determining whether or not, based on sample data, a population has a particular distribution, as described in Chapter 7. The chi-squared distribution goes from 0 to infinity, or P 0 £ c 2 £ ¥ = 1.
(
)
The interval
(
)
P c v2,a /2 £ c 2 £ c v2,1-a /2 = 1 - a (3.68)
defines the values for the χ2-distribution such that equal areas are in each tail. The χ2-distribution is not symmetric about the mean as are the Z- and t-distributions. 3.4.8 F -Distribution The F-distribution (named in honor of Sir Ronald Fisher, who developed it) is the distribution of the random variable F, defined as F=
Using Equation (3.65) c 2 =
( n - 1) s 2 s2
=
U / v1 c12 / v1 = (3.69) V / v2 c 22 / v2
n s2 s2 F=
s12 / s 12 (3.70) s22 / s 22
where U and V are independent variables distributed following the chi-squared distribution with v1 and v2 degrees of freedom, respectively. The symbol F in Equation (3.69) does not represent any cumulative distribution but is a statistic, specifically, the ratio of two χ2 statistics, each scaled by their degrees of freedom. The probability density function of F is
v - 2 /2 G ( ( v1 + v2 ) / 2 ) æ v1 öv1 /2 F( 1 ) f (F) = (3.71) ç ÷ ( v + v )/2 G ( v1 / 2 ) G ( v2 / 2 ) è v2 ø (1 + ( v1 / v2 ) F ) 1 2
and the cumulative distribution of F is
CDF ( F ) =
ò
F 0
f ( F ) dF (3.72)
56
Applied Engineering Statistics
FIGURE 3.13 Characteristic shapes of the F-distribution.
The family of F-distributions is a two-parameter family in v1 and v2. The shape of the F-distribution is skewed (more of the area under the curve to the left side of the nominal value, a longer tail to the right), as illustrated in Figure 3.13. The range of all members is from 0 to ∞. This distribution is used to evaluate equality of variances. The F-distribution is termed “robust” by statisticians, meaning that the results of such statistical comparisons are likely to be valid even if the underlying populations are not normally distributed. The uses of the F-distribution are explained in Chapters 5, 6, and 12. Values of the pdf(F) can be returned by the Excel function pdf ( F ) = F.DIST c12 /c 22 , v1 , v2 , 0 , and of the cumulative F-distribution by CDF ( F ) = F.DIST c12 /c 22 , v1 , v2 , 1 . The inverse of c2 the distribution returns the chi-squared ratio for a given CDF value 12 = F.INV ( CDF , v1 , v2 ) . c2 If the chi-squared ratio is 3.58058 and the numerator and denominator degrees of freedom are 6 and 8, then the CDF value is 0.95. If, however, you choose to call #1 as #2, then the chi-squared ratio would be 0.279284, and the degrees of freedom would be 8 then 6. With these reversed values the CDF value is 0.05 the complement to the first.
(
( )
)
3.4.9 Log-Normal Distribution Many processes (especially particle-creating processes such as prilling, crystal growth, grinding, and attrition) yield a bell-shaped distribution that is skewed to the left (a long tail on the right). Empirically, research has shown that if f) is plotted versus Ln X or log X instead of X, the long tail is contracted, and the graph may appear normal in shape. This transformation effect is based on observation, not any theory derived from fundamental phenomena as were the previous distributions. Figure 3.14a illustrates both pdf and CDF of a skewed distribution from an optimization study. The question the study sought to answer is, “What is the distribution the number of leap-overs (player moves) to convergence, when leap-over distance is randomized?” The figure presents the results from 1,000,000 simulations. Figure 3.14b displays the same data when the abscissa is log transformed. The data is not from the theoretical analysis, but from a finite number of trials (only 106), and as a result there are small discontinuities in the curve. (See Ch 35, of Rhinehart, R. R., Engineering Optimization: Applications, Methods, and Analysis, 2018, John Wiley & Sons, New York, NY.)
57
Distributions
FIGURE 3.14 Log-normal distributions: (a) original data, (b) log-transformed data.
Although the abscissa number of leap-overs is a discrete count, similar shapes and logtransformed normalcy or the distribution is characteristic of many continuum variables, such as particle size (diameter) distribution. One limitation of the log transformation is that the random variable X cannot have any negative values. There is no universally accepted probability density function for the log-normal distribution. From direct substitution of Ln x into the normal distribution, we obtain
f (x) =
é ( Ln x - m )2 ù 1 ú , 0 < x < ¥ (3.73) exp ê 2s 2 s 2p êë úû
where
m Ln x =
1 n
n
åLn x (3.74) i
i =1
and
s 2 S2 =
1 n -1
n
å ( Ln x - Ln x ) (3.75) 2
i
i =1
58
Applied Engineering Statistics
Note: The mean is not the average of x but the average of the ln(x). Similarly note that the value of sigma is not that of the x-data but the log of the x-data. Note: The variable X, probably has dimensional units, such as diameter in mm. However, the argument of the log function must be dimensionless. One could solve this issue by considering that the x-values are first scaled by a unit dimension: x¢ = x [ units ] /1[ units ]. This does not change numerical values but does make the argument of the log dimensionless. Note: The ln ( x¢ ) = ln ( x / 1) = ln ( x ) , mean, and variance are all dimensionless, which would make f(x) in Equation (3.57) dimensionless. To make f(x) dimensionally consistent with all other continuous probability density functions, f(x) is divided by the scale factor 1 [units of x]: f (x) =
é ( ln x / 1 - m / 1)2 ù exp ê ú 2s 2 1s 2p êë úû 1
é ( ln x - m )2 ù 1 exp ê = ú 2s 2 s 2p êë úû
(3.76)
Since multiplication and division by unity does not change values, the second part of Equation (3.76) is our preferred form of the log-normal probability density function. However,
f (x) =
1 ln s g
é ( ln D - ln Dn , md )2 ù ú (3.77) exp ê 2 2p ê ú 2 ( ln s g ) ë û
where Dn,md is the number-median diameter and σg is the geometric standard deviation, has also been reported as useful. See Exercise 3.21 for derivation of a distribution of x assuming the log(x) is normally distributed. The result is
pdf ( x ) =
1 1 × e x 2p s ln x*
æ ö 1 ç ln ( x ) - ln ( x ) ÷ - ç ÷ 2ç s ÷÷ ln x* ç è ø
( )
( )
2
(3.78)
where x* =
x (3.79a) x
and
1
*
-1
*
( ) 1.645 ln ( x0.95 ) (3.79b)
s ln x* =
or
( ) 1.645 ln ( x0.05 ) (3.79c)
s ln x* =
59
Distributions
Analytic expressions for F(X) are intractable. You will have to calculate F(X) vs ln X by numerical integration. Using the log-transformed data we can estimate the mean of the log of X and the variance of the log of X and use normal (Z) statistics to test hypotheses on log-transformed populations. Since ln X is monotonic with X, if it is found that ln x1 > ln x2 then we may usually accept that x1 > x2. We must caution you, however, that the units on the lognormal probability density function may be the reciprocal of the meaningless log of the units on X, the representation is not standardized, and the conclusions concerning ln(X) comparisons may not translate to X comparisons. Graphically, the log-transformed distribution is a convenient visual aid. However, for hypothesis testing, we recommend nonparametric methods. 3.4.10 Weibull Distribution The Weibull distribution is one of many functions that are heuristically created, as opposed to those derived from a particular probability model. Due to the choice of parameter values, the Weibull distribution is very flexible and is commonly used to describe life distributions in reliability work.
b f (t ) = h
ætö ç ÷ èh ø
b -1
CDF ( t ) = 1 - e
e
ætö -ç ÷ èh ø
ætö -ç ÷ èh ø
b
,
t ³ 0 (3.80)
b
, t ³ 0 (3.81)
where t is the product time to failure, η > 0 is the characteristic life, and β is the shape factor. If β = 1, the Weibull distribution reduces to the exponential distribution with ƞ = 1/λ. If β < 1, the probability of an event w.r.t. time drops faster than an exponential. If β > 1 the probability of any single event w.r.t. time shifts to a like-normal distribution but slightly skewed with a tail to the left. Figure 3.15 illustrates some of the Weibull probability density functions for values of β, each with ƞ = 1.
FIGURE 3.15 Weibull distributions; (a) α = 0.95, β = 1 (like the exponential), (b) α = 6, β = 2 (like the normal).
60
Applied Engineering Statistics
To determine values of ƞ and β that make the Weibull distribution describe your experimental data, simply adjust ƞ and β until the Weibull CDF matches the shape of your experimental data. As a caution, software packages that provide Weibull distribution values, might not use the ƞ and β notation of Equation (3.80). The Excel pdf function is WEIBULL.DIST ( x, a , b , 0 ) , in which the Excel alpha is the shape factor, beta, in Equation (3.80) and the Excel beta is the characteristic life, ita, in Equation (3.80). So use WEIBULL.DIST ( x, b ,h , 0 ) . The Excel Function WEIBULL.DIST ( x, b ,h , 1) will return the CDF value.
3.5 Experimental Distributions for Continuum-Valued Data All the distributions discussed so far have been the theoretical or semi-theoretical distributions that are commonly used to aid in the interpretation of survey or experimental data. Before we can consider experimental distributions, we need to define a few terms that are commonly used. Classes: Groups into which data are distributed. These could be linguistic categories (such as “upper-right quadrant”, “pass”, “dog”), or numerical values. If numeri5 6 7 cal, these could be discrete values (such as “1, 2, 3, 4, …”, “1 , 1 , 1 , ¼ ) or an 8 8 8 interval of continuum valued data. If the linguistic categories or discrete values are known, the histogram is simply the count in each category. What follows is for continuum valued data. Number of Classes: As a first choice, the number of classes should be about the square root of the number of data. Class boundary: The numerical value dividing two successive classes. Make these convenient values for a reader to interpret. Class length: The numerical difference between the boundaries of a class. This should be substantially greater than measurement uncertainty. Class mark: The mid-value of a class. Class frequency: The count (or frequency) which values of observations occur. Relative frequency: The frequency expressed as a decimal fraction of the total number of observations, the portion in a class. The use of histograms (vertical bar charts) is a convenient way to display data. However, choosing too many classes may cause the resulting histogram to appear noisy or even to approach a point distribution; so much fine detail is present that the overall picture is lost. In these situations, it is usually impossible to determine the type of distribution followed by the data. If too few (1–4) classes are used, the data are so clumped that nothing has been gained; often, the classes are about the same size. In our experience, for a histogram to be a meaningful representation of the population pdf, you will need at least 50 (and likely more than 100) data divided into 7–12 classes of equal length to give a good representation of the population distribution.
61
Distributions
The frequency histogram plots the class boundaries and class marks on the abscissa and the frequency (number in that bin) or relative frequency on the ordinate. Each class is represented by a rectangle centered on the class mark with a height proportional to the corresponding frequency. Empty classes may exist simply because no data fell in those intervals. Having too many open intervals probably indicates that the class length is too small (that too many classes have been used). Frequency polygons are another convenient way to summarize the behavior of a large collection of experimental data. The upper class boundaries (abscissa) and cumulative frequencies (ordinate) of the corresponding classes locate points that, when connected by a straight line, form the cumulative frequency polygon. This is akin to a CDF but discretized by bin intervals. As a result, the percentage of values of the experimental variable expected to be less than or equal to any specified value can be estimated directly from the polygon. As with the Z distribution, you can also find values expected to be greater than some predetermined value of the experimental variable or between two values of the variable. A better way to obtain such probabilities is to fit the appropriate theoretical distribution to the relative frequency histogram and to use the corresponding probability function for predictive purposes. Example 3.14: Air pollution samples are analyzed for both naturally occurring and industrial contaminants. The results of analyses for sulfate (SO 24 ) content in air over a 72-day period are given below expressed as parts per million in air. Prepare a table of grouped frequencies for the data employing suitable class lengths and boundaries. Plot the resulting frequency histogram and the cumulative relative frequency polygon. What distribution seems to best fit the data? What are the upper and lower quartiles and 90% limits? What are the average and standard deviation of the data? 3.14 0.08 2.00 2.11 3.55 3.51 3.69 0.98 2.01
1.87 1.95 5.57 6.99 7.80 10.83 7.17 1.93 2.58
1.00 6.38 2.96 9.02 8.90 14.71 1.95 3.30 2.01
1.12 2.61 2.68 3.24 1.95 1.24 0.63 0.00 1.24
1.03 0.10 2.44 1.25 1.85 1.23 1.24 1.66 2.38
5.88 1.96 2.62 0.00 2.07 4.11 1.84 3.60 0.00
0.91 0.00 7.61 8.06 0.00 1.75 1.78 0.00 1.83
1.86 0.00 0.80 0.00 6.85 0.00 0.00 0.00 1.14
Method: Place the data in a single column in Excel then use the Histogram chart to create a histogram. Right click on the chart to change the bin intervals. Here are three examples: Note as illustrated in Figures 3.14.1 a, b, and c, the presentation and interpretation of the histogram of only 72 elements is strongly dependent on the user choice of the bin interval. In Figure 3.14.1a, with only three bins, one cannot see the detail and the data appears to be ideally from an exponential distribution. From Figure 3.14.1b, there are too many bins to see the distribution trend. Figure 3.14.1c uses the recommended number of
(
)
bins about equal to the square root of the number of data 72 = 8.485¼ , and adjusted æ (14.71 - 0 ) ö = 1.733¼÷÷ to a convenient value of 2, resulting in 8 the nominal bin width çç è 8.485¼ ø bins.
62
Applied Engineering Statistics
FIGURE EX 3.14.1 a) data histogram with bin interval of 5; b) with a bin interval of 0.2; c) with a bin interval of 2.
Distributions
63
FIGURE EX 3.14.2 Empirical CDF (relative frequency polygon) for sulfate concentrations.
Perhaps the data has an exponential distribution, but there is an odd high count of 5 in the bin with a class mark of 7. If n = 5 counts is the expected number, the Poisson distribution indicates that the 90% limits are between n = 1 and n = 9. There is the substantial uncertainty the bins with very low counts. So, maybe the distribution is exponential. To generate an empirical CDF, the relative frequency polygon, first sort the data from low to high, then assign a CDF value to each data in increments of 1/n. Then plot the CDF value i/n w.r.t. the data value. This is illustrated in Figure 3.14.2. The markers represent the 72 data values, and they are connected by a dashed line to preserve visual order, but not to imply that there is a consistent mechanism or any expectation of between point values. The quartiles represent the data with CDF values of 0.25 and 0.75, which are 1 ppm and 3.3 ppm. Fortunately, some points fall exactly on the quartile values. The upper and lower limits that contain 90% of the data, means that 10% of the data are in the upper and lower extremes. Customarily, this means that 5% of the data are above the upper limit, and 5% are below the lower limit. These would represent data at the 0.05 and 0.95 CDF values. However, there are no data points exactly at those CDF values; so linear interpolating, the lower 5% limit is 0 ppm, and the upper 95% limit is 8.396. This of course should be rounded to match the implied precision of all the data, reporting an upper limit value of 8.40 ppm.
3.6 Values of Distributions and Inverses 3.6.1 For Continuum-Valued Variables For continuum-valued variables, x, the cumulative distribution function is the probability of getting a particular value or a lower value of a variable. It is the left-sided area on the probability density curve, often expressed as alpha. It is variously represented as CDF ( x ) = F ( x ) = a = p . Here we’ll use the CDF(x) notation. For continuous-valued variables, x, the probability distribution function, pdf(x), represents the rate of increase of probability of occurrence of the value x. An alternate notation is pdf ( x ) = f ( x ).
64
Applied Engineering Statistics
FIGURE 3.16 CDF and pdf illustrations for continuous-valued distributions.
The relation between CDF(x) and pdf(x) is
CDF ( x ) =
ò
x xminimum
pdf ( x)dx (3.82)
Where xminimun represents the lowest possible value for x. In a normal distribution xminimun = −∞. For a chi-squared distribution xminimun = 0. The left-hand sketch in Figure 3.16 illustrates the CDF and the right-hand sketch the pdf of z for a standard normal distribution (the mean is zero and the standard deviation is unity). At a value of z = −1, the CDF is about 0.158, and the rate of increase of the CDF, the pdf is about 0.242. The notations are 0.158 = CDF (−1) and 0.242 = pdf (−1). In both you enter the graph on the horizontal axis, the z-value, and read the value on the vertical axis. For continuous-valued variables the inverse of the CDF is the value of x for which the probability of getting the value of x or a lower value is equal to the CDF(x). The inverse would enter on the vertical axis to read the value on the horizontal axis. If the inverse question is, “What z-value marks the point for which equal or lower z-values have a probability of 0.158 of occurring?” then we represent this inverse question as z = CDF -1 (a ) . In this illustration, -1 = CDF -1 ( 0.158 ) . The inverse of the right-hand pdf graph is not unique. If the question is to determine the z-value for which the pdf = 0.242, there are two values, z = −1, and z = +1. 3.6.2 For Discrete-Valued Variables For discrete-valued variables, x, likely a count of the number of events, the cumulative distribution function is the probability of getting a particular value or a lower value of a variable. It is the left-sided area on the probability density curve, often expressed as alpha. It is variously represented as CDF ( x ) = F ( x ) = a = p . Again, we will use the CDF(x) notation. For discrete-valued variables, x, the point distribution function, pdf(x), represents the probability of an occurrence of the value x. An alternate notation is pdf(x) = f(x). Here, pdf(x) is a probability of a particular value of x, not the rate that the CDF is increasing. Unfortunately, the same symbol is used in continuum-valued distribution. The relation between CDF(x) and pdf(x) is
CDF ( x ) =
x
å pdf ( x ) (3.83)
xminimum
Distributions
65
FIGURE 3.17 CDF and pdf illustrations for discrete distributions.
where xminimum represents the lowest possible value for x. Normally xminimum = 0, the least number of events that could occur. The left-hand sketch of Figure 3.17 illustrates the CDF and the right-hand sketch the pdf of s, the count of the number of successes, for a binomial distribution (the number of trials is 40, and the probability of success on any particular trial is 0.3). Note that the markers on the graphs represent feasible values. The light line connecting the dots is a visual convenience. It is not possible to have 10.3 successes. At a value of s = 10, the CDF is about 0.309, meaning that there is about a 31% chance of getting 10 or fewer successes. The pdf is about 0.113, meaning that the probability of getting exactly 10 successes is about 11%. The notations are 0.309 = CDF(10) and 0.113 = pdf(10). In both you enter the graph on the horizontal axis, the s-value, and read the value on the vertical axis. For discrete-valued variables the inverse of the CDF is the value of s for which the probability of getting the value of s or a lower value is equal to the CDF(s). The inverse would enter on the vertical axis to read the value on the horizontal axis. If the inverse question is, “What s-value marks the point for which equal or lower counts have a probability of 0.309 of occurring?” then we represent this inverse question as s = CDF -1 (a ). In this illustration, 10 = CDF -1 ( 0.309 ) . The inverse of the right-hand pdf graph appears to be not unique. However, it might be. If the question is to determine the s-value for which the pdf = 0.113, there is only one value, s = 10. It appears that an s-value of about 13.5 could have such a CDF value, but the count must be an integer. The pdf of S = 13 is 0.126, and the pdf of S = 14 is 0.104. Although one could ask, “What count value, or lower, has a 30% chance of occurring?” it is impossible to match the 30% CDF = 0.3000 value. S ≤ 9 has a CDF of about 0.196 which does not include the target 0.3000. S ≤ 10 has a CDF of about 0.309 which does match. S = 10 is the lowest value that includes the target CDF. One convention is to report the minimum count that includes the target CDF value.
3.7 Distribution Properties, Identities, and Excel Cell Functions 3.7.1 Continuum-Valued Variables 3.7.1.1 Standard Normal Distribution The statistic z is defined as
66
Applied Engineering Statistics
Z=
x-m (3.84) s
where μ is the mean of a variable x, and σ is the standard deviation of the variable. The z-value is dimensionless. As a capital Z it refers to a data value, as a lowercase z it refers to the variable. The mean of the z-variable is μ = 0, and the standard deviation of the z-variable is σ = 1. The distribution CDF is CDF ( Z ) =
1 2p
ò
Z
e-z
2
-¥
/2
dz (3.85)
The Excel cell functions are
Z = NORM.INV (a , m , s )
CDF ( Z ) = NORM.DIST ( Z , m , s , 1)
pdf ( Z ) = NORM.DIST ( Z , m , s , 0 )
where α is the CDF value, which is labeled as the probability. The 4th variable in the NORM. DIST function is a trigger to return either the cumulative or the probability distribution value. For the standard normal z-statistic, use μ = 0, and σ = 1. The distribution is symmetric. Accordingly,
NORM.INV (a , 0, 1) = -NORM.INV ( ( 1 - a ) , 0, 1)
NORM.DIST ( Z , 0, 1, 1) = 1 - NORM.DIST ( -Z , 0, 1, 1)
3.7.1.2 t-Distribution The t-statistic is defined as
T=
X-m (3.86) s/ n
In which s is the sample standard deviation, not the true population sigma; and X is the average, not the true population mean. The degrees of freedom is υ, which is often n − 1. The Excel cell functions are
T = T .INV (a , u )
CDF ( T ) = T .DIST ( T , u , 1)
pdf ( T ) = T .DIST ( T , u , 0 )
where α is the CDF value, which is labeled as the probability. The distribution is symmetric. Accordingly,
T .INV (a , u ) = -T .INV ( -a , u )
T .DIST ( T , u , 1) = 1 - T .DIST ( -T , u , 1)
67
Distributions
3.7.1.3 Chi-Squared Distribution The Excel cell functions are
c 2 = CHISQ.INV (a , u )
CDF c 2 = CHISQ.DIST c 2 , u , 1
pdf c 2 = CHISQ.DIST c 2 , u , 0
( )
(
)
( )
(
)
where α is the CDF value, which is labeled as the probability, and υ is the degrees of freedom. The χ2 distribution is not symmetric. The minimum value is zero. Since χ2 nearly increases linearly with degrees of freedom, it is often reported as χ2/υ.
c 2 /u = CHISQ.INV (a , u ) /u
CDF c 2 /u = CHISQ.DIST u c 2 /u , u , 1
(
((
)
)
)
3.7.1.4 F-Distribution F is a ratio of sample variances scaled by the population variance. F=
s12 / s 12 (3.87) s22 / s 22
The Excel cell functions are
F = F.INV (a , unumerator , udenominator )
CDF ( F ) = F.DIST ( F , unumerator , udenominator , 1)
pdf ( F ) = F.DIST ( F , unumerator , udenominator , 0 )
where α is the CDF value, which is labeled as the probability, and υ is the degrees of freedom for the numerator and denominator values of the sample standard deviations. The F-distribution is not symmetric, the minimum value is zero. One can choose which sample is labeled 1 (and placed in the numerator) and the other labeled 2 (placed in the denominator). If one ratio is unusually small, then the other will be unusually large, and the extreme changes from the left tail to the right tail. Accordingly,
F.INV (a , u1 , u2 ) = 1/F.INV ( ( 1 - a ) , u2 , u1 )
F.DIST ( F , u1 , u2 , 1) = 1 - F.DIST ( 1/F , u2 , u1 , 1)
68
Applied Engineering Statistics
3.7.2 Discrete-Valued Variables 3.7.2.1 Binomial Distribution The cumulative binomial distribution is
CDF ( S ) =
S
æ nö
å çè k ÷ø p (1 - p ) k
n-k
(3.88)
k =0
The Excel cell functions are
s = BINOM.INV ( n, p, a )
CDF ( s ) = BINOM.DIST ( s, n, p,1)
pdf ( s ) = BINOM.DIST ( s, n, p, 0 )
Here, s is the number of successes in n number of trials where the probability of a success in any one trial is p. Alpha, α, is the CDF value associated with s or fewer successes. All variables are dimensionless. Further s and n must be integers and 0 £ n, and 0 £ s £ n. If you input a non-integer value in the BINOM.DIST function the Excel function truncates either s or n to the integer value. For example, s = 12.001 and s = 12.999 are both truncated to 12. If you specify an in-between value for α, the CDF value, in the BINOM.INV function Excel will return the next larger s. 3.7.2.2 Poisson Distribution The cumulative Poisson distribution function is
CDF ( x ) =
x
å k =0
l k e -l (3.89) k!
where x is the number count of events within a time interval (or a distance, area, or space interval, or a per item basis), and λ is the average number for the population. Here λ = μ. The Excel cell functions are:
CDF ( x ) = POISSON.DIST ( x , m , 1)
pdf ( x ) = POISSON.DIST ( x , m , 0 )
Unfortunately, there does not seem to be an inverse function, but it is fairly easy to use a trialand-error search (such as interval halving) to determine the count that matches a CDF value.
3.8 Propagating Distributions with Variable Transformations Often, we know the distribution on x-values and have a model that transforms x to y. For instance, y = Ln(x). The question is, “What is the distribution of y?”
69
Distributions
Figure 3.18 reveals the case of y = a + bx 3 when the distribution on x (on the abscissa) is normal. Note: For the range of x-values shown, the function is strictly monotonic, positive definite. As x increases, y increases for all values of x. There are no places in the x-range where either 1) the derivative is negative or 2) zero (there are no flat spots in the function). The inset sketches indicate the pdf (dashed line) and CDF of x and y, about a nominal value of x0 = 2.5 and the corresponding y0 = a + bx0 3 . Note that the pdf of x is symmetric, and that of y is skewed. The CDF of x indicates the probability that x could have a lower value. For any x there is a corresponding y, and since the function is strictly monotonic, the probability of a lower y-value is the same as the probability of a lower y-value. Then
CDF ( y = f ( x ) ) = CDF ( x ) (3.90)
Between any two corresponding points x1 and x2 separated by Dx = x2 - x1 , there are the two dy corresponding points y1 = f ( x1 ) and y 2 = f ( x2 ) separated by Dy = f ( x2 ) - f ( x1 ) @ Dx dx dy for small Δx values (meaning that is relatively unchanged over the Δx interval). Since dx CDF ( y 2 ) = CDF ( x2 ) and CDF ( y1 ) = CDF ( x1 ), the difference is also equal, and by definition:
y2
x2
y1
x1
òpdf ( y ) dy = òpdf ( x ) dx (3.91)
For small Δx intervals, the integral can be approximated by the trapezoid rule of integration, and in the limit of very small Δx,
pdf ( y ) = pdf ( x ) /
FIGURE 3.18 Illustration of a nonlinear distribution transformation.
dy (3.92) dx
70
Applied Engineering Statistics
To obtain the CDF(y) numerically integrate the pdf(y). Using the trapezoid rule of integration, with y sorted in ascending order. CDF ( yi +1 ) = CDF ( yi ) +
1 é pdf ( yi +1 ) + pdf ( yi ) ùû ( yi +1 - yi ) (3.93) 2ë
Initialize CDF ( y very low ) = 0.
dy This is only true if y = f(x) is a strictly positive definite function (if > 0 for all values dx in the range being considered). Example 3.15: Derive the pdf(y) when y = Ln(x), and x0 = 1, and x is a continuum variable that is uniformly distributed with a range of 0.6. If the average, or nominal, x-value is 1 and the range is 0.6 then it varies between 0.7 and 1.3. In the uniform distribution a = 0.7 and b = 1.3, and pdf ( x ) = 1.6666¼ for 0.7 £ x £ 1.3 , otherwise pdf(x) = 0. dy 1 = , then Since dx x
pdf ( y ) = 1.66x , 0.7 £ x £ 1.3
pdf ( y ) = 0, x < 0.7 or x > 1.3
(
)
Example 3.16: If y = a + b 1 - e - x/s and x is normally distributed, NID ( m x , s x ) , what is the pdf(y)? dy b - x/s b = e . As long as > 0 the function is strictly monoThe derivative of y w.r.t. x is s dx s tonic positive. Then
pdf ( y ) =
1 æ x - mx ö ÷ sx ø
- ç 1 e 2è 2p s x
2
s x /s e b
With a = 1, b = 7, s = 3, μx = 5, and σx = 1.5 the two graphs illustrate the functions. The first figure plots y(x) and pdf(x) and CDF(x) w.r.t. x. The second figure plots pdf(y) and CDF(y) w.r.t. y. Notice that the pdf(x) is symmetric, and the 50%ile value of CDF ( x ) = m x . The pdf(y) is not symmetric, and the 50th percentile value of
(
)
CDF ( y ) = a + b 1 - e - mx /s .
71
Distributions
b , x > 0, or y = a + bx 2 , x < 0 , or x y = a - bx , b > 0 , the analysis is similar but since larger x-values lead to smaller y-values, the subscripts on y need to be reversed. y 2 = f ( x1 ) and y1 = f ( x2 ) . And For a strictly negative monotonic function, such as y = a +
CDF ( y = f ( x ) ) = 1 - CDF ( x ) (3.94) y2
x2
y1
x1
òpdf ( y ) dy = - òpdf ( x ) dx (3.95)
For small Δx intervals, the integral can be approximated by the trapezoid rule of integration, and in the limit of very small Δx,
pdf ( y ) = - pdf ( x ) /
dy (3.96) dx
To obtain the CDF(y) numerically integrate the pdf(y). Using the trapezoid rule of integration, with y sorted in ascending order,
CDF ( yi +1 ) = CDF ( yi ) +
1 é pdf ( yi +1 ) + pdf ( yi ) ùû ( yi +1 - yi ) (3.97) 2ë
initialize CDF ( y very low ) = 0. Note: if the data are sorted in ascending order of x, then the y-values are in descending order. Then, Initialize CDF ( y very high ) = 1. dy This is only true if y = f(x) is a strictly negative definite function (if < 0 for all values dx in the range being considered).
3.9 Takeaway Whether the data distribution is normal (Gaussian) or not, it has a mean and a variance. Just because you can get an average and standard deviation from your data does not mean that the data is normally distributed.
72
Applied Engineering Statistics
The normal distribution is the one that will most frequently fit your data, but the others are important in specific cases. Be sure to choose the theoretical distribution that was derived from principles (data attributes) that best match your application. For continuum-valued variables, and using a practical viewpoint, it does not matter whether you use the ≤ or the < symbol (or similarly the ≥ or the > symbol). If < or >, you can get as close to the equality value as you wish, and effectively be at the ≤ or ≥ location. However, with discrete variables ≤ and < (or similarly ≥ and >) are different. Take care with discretized variables whether they represent the count within a category or the category.
3.10 Exercises 1. Derive Equations (3.7a) and (3.7b). 2. Derive the mean and variance relations for any one of the distributions of a discrete-valued variable. The uniform, and Poisson distributions are not too difficult. 3. Derive the mean and variance relations for any one of the distributions of a continuum-valued variable. The uniform, and proportion distributions are not too difficult. The exponential might be fun. 4. Over a 20-year driving history, a particular person “earned” two tickets. This represents l = 0.1 éë tickets per year per individual ùû . How many tickets per year would be expected in a city of 50,000 individuals with a similar driving style? 5. Derive that the average of z in Equation (3.55) is zero and the standard deviation of z is unity. 6. Show that the sum of all f(xi) values for a discrete distribution is 1 or show that the integral of a pdf(x) over all x-values is 1. 7. Match your choice of a continuum distribution to the data of Example 3.14 and determine the distribution coefficients that best fit your choice to the data. 8. Compare a normal distribution to a uniform continuum distribution, using the same mean and variance for both. 9. Use the Gaussian pdf model to show that the inflection points on the pdf curve (where the shape changes between concave to convex) are at the ±1σ deviation from the mean. 10. Repeat Example 3.15 using y = a + bx 2 . b 11. Repeat Example 3.15 using y = a + . x 12. If one process has an average of one event per 10 years, and a second process has an average of 0.05 events per year, would you expect the second process, over a two-year period, to have the same number of events as the first? 13. Graph pdf(x) and pdf(y) when y = ln(x). Use the exponential pdf(x).
(
)
14. The logistic model is CDF ( x ) = 1 / 1 + e ( ) , where s is a scale factor and c is the center. Show that if c = μ and s = σ/1.7, for your choice of parameter values, that the logistic CDF is very similar to the normal CDF. -s x -c
73
Distributions
15. Show that the analytical derivation of the inverse of the logistic CDF is simply done. 16. Show that the analytical derivation of the pdf of the logistic CDF is simply done. 17. If one delivery truck in a service has an average of 2 flat tires per year (l = 2 éëflats per year per truck ùû ), the λ value per year if there are three trucks in the fleet all with the same service is l = 6 éëflats per year per fleet ùû . Show that the probability of 0, 1, 2, 3, and 4 flats in a year as calculated for the fleet is the same as that calculated for the three individual trucks. For the fleet P ( x ) = 6 x e -6 / x !, and for an individual truck P ( x ) = 2 x e -2 / x !. Hint: How can one get 3 flats in the individual trucks? The answer is there are 10 ways: (3,0,0), (0,3,0), (0,0,3), (2,1,0), (2,0,1), (1,2,0), (1,0,2), (0,1,2), (0,2,1), and (1,1,1). The probability of the event (2,0,1) is P ( 2 for Truck A AND 0 for Truck B AND 1 for Truck C ) = P ( 2 ) * P ( 0 ) * P ( 1). 18. Use Figure 3.6 to determine the probability that x will be between 2 and 3 inclusive P ( 2 £ x £ 3 ), and also not including the value of 3, P(2 £ x < 3) . 19. Use the inverse of Equation (3.92) to derive a model of the distribution of x if the log transformation of x is nearly normal. Here y = ln(x). Scale x by its average, x * = x / x , now x * is centered on 1, so that y = ln x * = ln ( 1) = 0. This means that μy = 0, and if the distribution of y is nearly normal then 5% of the area is at CDF ( y = -1.645 s y ) = 0.05 . Apply Equation (3.92) twice. First to transform pdf(y) to pdf(x*), then pdf(x*) to pdf(x). 20. If you drop a marble and lose sight of where it went to hide, then randomly search for it, how many places must you look to be 99% confident of finding it? You decide the appropriate value of the probability of finding it on any particular search. First consider that there are an infinite number of places it could hide. Then consider that there are only a finite number of places.
( )
4 Descriptive Statistics Descriptive statistics are values of attributes which are calculated from sample data and are used either to describe sample characteristics or to estimate population parameters. The descriptive statistics most often used in engineering are the mean and the standard deviation. But there are many others.
4.1 Measures of Location (Centrality) Often termed the arithmetic average, the arithmetic mean is the primary statistic that locates the sample. Usually, it is simply termed the average or mean. If the histogram of sample values is created, the arithmetic mean or arithmetic average of all sample values is the centroid of the distribution. The sample mean is calculated by
X=
1 n
n
åX (4.1) i
i =1
where X is the average of the n sample members and Xi are the individual sample values. It can be shown that the population mean is the expected value of the sample mean. (If interested in the calculus and proof see the sections on expectations or expected values in any statistical theory or sampling theory text, or Section 4.3 here). But what is important is knowing that the sample mean, X , is the best estimator of the population mean μ with respect to being consistent, efficient, and unbiased. (An estimator is consistent if the values it predicts become closer and closer to the true value of the parameter as the size of the sample increases. An estimator is unbiased if its expected value is the value of the parameter itself. An efficient estimator is one that not only is unbiased but also has the smallest possible variance. Efficient estimators are often called “best” estimators because of those characteristics.) If the Xi values can only have discrete values, such as quality points for grades (an A = 4, a B = 3, etc.), then some values of the data are repeated, and you can use a weighted average of a group of numbers, the weighted mean can be found from
å f X (4.2) X= å f k
i =1 k
i
i =1
i
i
where the frequencies, fi, are the weighting factors associated with their respective Xi categories. Here f i = ni / N , where N is the total number of all data and ni is the count of items in the kth classification. In this case the denominator term DOI: 10.1201/9781003222330-4
å
k i =1
f i = 1. 75
76
Applied Engineering Statistics
Weighted means are useful for far more than just calculating grade point averages. If some of the Xi values are more important than others, you might need a weighted average of a group of numbers. As an example, you might be interested in a characteristic particle size (diameter) that represents the surface area of particles, have screened the particles into size classifications, and have the weight of particles in each screening category. There are more particles in the smaller screening size than in the same weight of a larger size. Here the fi values would not simply be the weight in each screen category. The mean might not be a feasible value. Roll a die many times. The average point value per roll will be around a value of 3.5 but that is not a feasible value on any particular roll. Contrasting the conventional arithmetic mean is the geometric mean, a representation of a characteristic value where the product of attributes is important.
Xgeo = N
Õ
N
Xi (4.3)
i =1
An alternate yet is the harmonic mean, where reciprocals are used to characterize the feature, as in heat transfer where individual coefficients represent conductance not resistance, but an average value of resistance is desired
X harmonic = éê ë
å
-1
Xi -1 ùú (4.4) i =1 û N
The median is value of the middle Xi value if N is odd, or the average of the two middle values if N is even. Sort the data in either ascending or descending order then the middle value is such that half the sample values are larger than the median and the other half, smaller. If N is odd, the median would be a feasible value, because it was one of the sample values; but if N is even, the average of the two middle values might not be feasible. The mode is simply the most frequently occurring sample value, which could be relevant in describing disparate categories, choices, preferences, etc. As an example of disparate categories, shoppers may buy 4 cans of soup and 2 heads of lettuce. However, the X-values in another example might have consistent units. The mode would be a feasible value. If the histogram of data is symmetric, then the median and mode are both reasonable representations of the arithmetic average. Proportion (also termed portion, fraction, probability) is a ratio of the number of events in a particular category to the total number of events, or trials.
p = n / N (4.5)
Odds is a statistical/probability term for the ratio of the probability of an event, p, to the probability of “not-an-event” q = ( 1 - p ) .
Odds =
p p (4.6) = q 1- p
Almost all of your work will involve the arithmetic mean and proportion, as estimates of centrality or location, and most statistical tests are developed for those measures of centrality, however, some nonparametric procedures are evaluations of the median, not the mean.
77
Descriptive Statistics
4.2 Measures of Variability You must consider how the data are distributed around that statistic of centrality. The most popular method of reporting variability is an estimate of the population variance, defined in three equivalent forms as follows:
SX2 =
å
n i =1
(X - X ) i
n -1
2
=
å
n
2 i
X - nX
i =1
n -1
2
=
å
å
2
n Xi2 - æç Xi ö÷ / n i =1 i =1 è ø (4.7) n -1 n
The estimated variance is the sum of the squares of the deviations of the individual data points from the arithmetic mean value of the sample divided by (n − 1). You may wonder why the sample mean uses division by n but the sample variance uses division by (n − 1). The answer is simple, although not obvious: Each statistic is divided by the number of independent data points (or degrees of freedom) used for its calculation. Because the value of X is used in calculating sample variance, that X value is presumed to be the truth. Accordingly, you are free to choose any values for (n − 1) of the Xi sample values were used to calculate X but the last Xi value is constrained to create the same X value. There only (n − 1) independent choices or degrees of freedom involved in the calculation of the sample variance. Note: The summation is over all sample values. Here, the Xi values represent the n observations from a large population (possibly of infinite number of possible observations). However, if the sample is of all possible data, then divide by n not (n − 1). If n is very large, then it is inconsequential whether n or (n − 1) is used, because the error will be insignificant relative to the inherent uncertainty in SX2 . The dimensional units on the number of data, n, is considered to be dimensionless, so the units on the sample variance, SX2 , are the square of the units on the data. The standard deviation of a sample is the positive square root of the variance, or
SX = SX2 (4.8)
Note: The standard deviation has the same dimensional units as the average and as the data. Another measure of variability often used is the coefficient of variation or CV, defined as
CV =
SX (4.9) X
The coefficient of variation, CV, expresses the standard deviation as a proportion of the mean. CV is dimensionless. Yet another widely used and misused measure of variability is the standard error. This statistic is obtained by first putting the sample variance on a “per observation” basis and then taking the positive square root as done for the standard deviation. The standard error of the mean is thus defined by
SX =
SX2 S = X (4.10) n n
As you can see, the standard error is always smaller than the standard deviation of the individual data, indicating that the sample mean is less variable than the original data
78
Applied Engineering Statistics
from which it was calculated. Consider looking at the values within each sample. One may be large, one small, and another in between. Now consider an average of 5 samples. The only way the average could be large is for each of the samples to be equivalently large, but this is improbable. In 5 samples some will be small and some in the middle. Accordingly, the variability of an average will be less than that of the individual samples. This relation is termed the central limit phenomena and will be derived in Chapter 8. Again, the dimensional units on sample size, the number of data, n, is considered to be dimensionless; so, the variance on the sample average has the same units as the variance on the individual data. If this is suspicious, consider that the real equation is SX n = SX 1, and since 1 = 1, it can be removed for convenience. Do not confuse these two estimates of variability; the sample standard deviation is used to test hypotheses about individual population values, but the sample standard error is used to test hypotheses about the mean of the population from which the sample was drawn. You should distinguish SX and SX for another reason. If the sample data are badly scattered, some people will report the standard error as if it were the standard deviation, thus trying to conceal what they perceive as poor data. When you hear the word “standard” used in describing the variability of data, always ask whether standard deviation or standard error is meant. If in doubt, ask to see the calculations. If the population is normally distributed (and most of the measuring situations you’ll encounter will be), X and SX2 are the best values for μ and σ2, the parameters of the normal distribution. However, if the population is not normally distributed, you can estimate the population parameters from the sample characteristics X and SX2 as described in Chapter 3. Data Range, R, is also a valid measure of variability, it is the highest value less the lowest value
R = MAX ( X ) - MIN ( X ) (4.11)
where X represents the vector (or listing) of all data, and MAX ( X ) indicates the value of the maximum of all values. This two-point calculation of the measure of variation is simpler than the sample standard deviation, but it only uses two data values. The sample standard deviation uses all n values and gives a more representative value. The X-values in Equation (4.11) could also be paired deviations between data, or residuals between data and a model. Percentiles are frequently used to represent data variation. These could be values that represent the lowest 25% or upper 75% of values, or lowest 10% or upper 90%, etc. Conceptually, these could be found by sorting the data then finding the particular value. For instance, if there are n = 1,000 data, sorted, then the 250th value would represent the 25th percentile value. Likely, however the value would be interpolated between two neighboring values. If there were 11 data, the 25th percentile value could be interpolated between the second and third in the list. Alternately, if the distribution is presumed to be normal, the percentiles could be estimated from the mean and variance (see Chapter 5).
4.3 Measures of Patterns in the Data A run is the sequence of ordered data with a like property. The statistic Runs is the number of runs in a dataset of dichotomous, exclusive outcomes. For example, when data are compared to a model, the residual is either + or −. In this sequence of signs + + − + − − − − + there are 5 runs (underscored). As another example, here is a coin flip of Heads and Tails
79
Descriptive Statistics
sequence, also with 5 runs, H H T H T T T T T H H. Alternately, the categories could be off/on, or 1/0. The number of runs is often used to determine whether the deviations between model and data are randomly occurring. If there are too few runs, it indicates that there is one or more long sections where the data are on one side of a model, indicating that the model does not match the process that generated the data. The data could be ordered chronologically or with respect to each input or response variable. In looking at residuals, values will be either + or −, above or below the model. There is a chance, however, that a residual will have a value of zero. In this case it is not a zero-crossing; include the data with the previous run. For example, this set of residuals has 4 runs: −3, −2, 0, −1, +4, +1, 0, +2, +1, −2, −3, +1, +3, 0 Example 4.1: What is the expected probability of a run of l, if the probability of either outcome is 0.5? The Geometric distribution generates the probability of a sequence of data with like property which ends with a data of the other property, when the expectation is that the probability of a data having a positive or a negative residual is the same p = ( 1 - p ) = q = 0.5 . f ( x = k ) = p k -1q = p k
Here k is the number of trials prior to getting one outcome of the other kind. So, using l as the length of a run the distribution of l with p = 0.5 is f ( l ) = p l = 0.5l
For example: l
p(l)
1 2 3 4 5
0.5 0.25 0.125 0.0625 0.03125
Note: A run cannot have an l = 0 value or a value greater than the number of data, N. If there is only one data value, it is a run of 1. If the data are alternating in their property, the number or runs is N. So, there are limits on l. 1£ l £ N
Example 4.2: What is the average run length (ARL), if the dichotomous outcomes are equally probable?
ARL =
l = N =¥
l = N =¥
l =1
l =1
å
lf ( l ) =
å lp = 2 éëaverage number of data per run ùû l
¥
It might be fun to confirm that
ålp = 2 through simulation. l
l =1
80
Applied Engineering Statistics
Example 4.3: Given N data, what is the expected number of runs in that set?
nexpected # of runs =
N [data ] ARL éëdata per run ùû
So, if there are N = 10 data to compare to a model, and the model was correctly matched the phenomena that generated the data, and the experimental errors on the data were independent, then the expected number of runs in the residuals would be
nexpected # of runs =
10 [data ] = 5 [runs ]. 2 éëdata per run ùû
Although one expects the number of runs to be N/ARL, with a finite sample size, there is a range of possible outcomes that could be obtained. Similarly, if you have a fair coin and flip it 10 times you expect to get 5 Heads (H) and 5 Tails (T). However, of course, because of the event vagaries in a small sample, you don’t expect to see exactly the 50/50 outcome expected in the population. Seven H and 3 T would be a reasonable outcome. Appendix Table A.3 (a and b) provides critical values of the distribution of runs, for finite N values.
Signs is a statistic that is simply the count of the number of “+” or “−” values, without regard to order. If the model goes through the data, then the number of “+” and “−” residuals should be nearly 50/50. Appendix Table A.1 gives critical values of the statistic signs. Correlation between two variables X and Y means that when X is high (or low) Y tends to be high (or low). Or it could be the opposite, when one is high the other tends to be low. Correlation indicates that there a relation between the two variables X and Y. Be aware that it does not mean that there is a cause-and-effect relation. The values of X and Y could both be consequences of a third variable. Correlation also does not mean that the relation is linear. It could be nonlinear. There is a diverse number of ways to create a correlation statistic. One is
r=
å
n i =1
( xi - x ) ( y i - y )
n
( xi - x ) i =1
( n - 1) å i = 1 1
=
å
n
2
å
n i =1
( yi - y )
2
(4.12)
( xi - x ) ( y i - y )
sx s y
The second form of the relation is achieved by dividing the numerator and denominator by (n − 1). The numerator term is called a covariance, a measure of how x and y co-vary. If xi and yi are jointly above or below their averages, then the elements in the numerator will all be positive, and the sum will become large. In the opposite case, if yi is below its average when xi is above, and vice versa, then the elements in the numerator will all be negative, and the sum will become negatively large. Alternately, if there is no relation between xi and yi then half of the elements in the numerator will be positive and half negative, and the
81
Descriptive Statistics
sum will tend to be around zero. So, if r is large positive there is a correlation, if r is large negative there is a negative correlation, and if r is small (around zero) there is not a detectable correlation. The quantification of large and small depends on the variation in the variables. So, the numerator is scaled by similar measures of the xi and yi variation. If there is perfect, linear positive correlation, yi = a + bxi , then y = a + bx , and as a result, r = 1. If there is perfect linear negative correlation, r = −1. And if there is zero correlation, ideally, r = 0. See Chapter 13 for alternate measures. Autocorrelation means self-correlation. If one deviation is high, the influence that caused that to happen persists, and the next deviation will likely be high also. As an example, on a partly cloudy day the cloud shadows pass by, but they shade one spot on the ground for a minute or so. The shade persists for more than a microsecond. In the shade, the temperature drops. If one temperature measurement is low, then the next (a second later) will probably be low also, until the cloud passes. Alternately, the data could be oscillating high to low, if for instance a controlling mechanism was over-correcting. In those examples the data represents a single variable. Also, it is in chronological order, but if could be ordered by another variable. Autocorrelation would indicate that the data are not independent, but that some influence is persisting. There are many structures for an autocorrelation statistic, and one can look at the adjacent values, or every second value, or third, etc. One autocorrelation statistic is very similar to the correlation statistic above. In this structure it is an autocorrelation of adjacent variables of “lag-1” (adjacent values in an ordered sequence).
å å n
r1 =
ri ri -1
i=2 n
ri 2
=
1 ( n - 1)
å
n i=2
( xi - x ) ( xi -1 - x ) sx 2
(4.13)
i =1
The variable ri in the sums is termed a residual, typically it is the difference between model and data, but here it represents the difference between data and average. Note that there are n − 1 terms in the numerator sum. Similar to the r in Equation (4.12), -1 < r1 < +1, and if r1 @ 0 there is no evidence of autocorrelation. The data might show the same number of values above a model as below, which would be expected if the model was representative of the process that generated the data. But, it could be that many of the + residuals are much greater than the characteristic residual. So, even if the count of signs is nearly 50/50, there is still a skew or bias in the data. A Sign-Rank Sum of Deviations is a common metric to indicate this. Sort the deviations by absolute value, then rank largest to lowest, then sum the ranks with a “+” (or with a “−”) deviation. If the sum is too high or too low, it indicates a skew in the residuals (see Chapter 7). Appendix Table A.2 reports critical values of the Wilcoxon Signed-Rank statistic. Often, we are classifying data (good products vs faulty products, the letter A or B), and have a count of the number of events in all the categories. We might be comparing treatments for a disease, raw material in a process, preferences in age groups, techniques for training a dog, or the success of an Artificial Intelligence algorithm. The treatments may lead to one of two (or more) outcomes. Place the data in a contingency table, Table 4.1. Here, the entry nA,1 represents the number of times Treatment A led to Outcome 1. If the two treatments are identical, if they have the same impact on the outcome, then you expect nA ,1 = nB ,1 , and nA , 2 = nB , 2 , but either vagaries in the experiments or unequal numbers of tests will not make them equal. So, calculate an expected value for each of the classifications. For instance, the expected value for nA,1 could be based on the total number
82
Applied Engineering Statistics
TABLE 4.1 An Example of a Contingency Table Treatment A
Treatment B
nA,1 nA,2
nB,1 nB,2
Outcome 1 Outcome 2
of experiments of Treatment A and the ratio of Outcome 1 to the total number of A and B experiments
EA ,1 = (nA ,1 + nA , 2 )
(nA ,1 + nB ,1 ) . (4.14) (nA ,1 + nA , 2 + nB ,1 + nB , 2 )
Alternately, the expected values could come from historical data or other experience. The chi-squared (χ2) statistic is the sum over all categories of the squared deviation of observed from expected, scaled by expected, for each category. k
c = 2
å i =1
(Oi - Ei ) Ei
2
(4.15)
If the value for χ2 is large, then the two treatments are probably not equal. Skewness is a measure of nonsymmetry of the histogram or pdf, and kurtosis is a measure of flatness, alternately, of the largeness or the tail area, compared to a normal pdf. These characterizations of distributions are just mentioned here, but of little practical consequence in the authors’ experiences.
4.4 Scaled Measures of Deviations There are several statistics that are used to quantify the magnitude of a deviation, relative to some base situation. The t-statistic is a normalized deviation between the average and true or expected mean (or other averages). It is scaled by the standard error of the average, so it is dimensionless.
t=
X-m (4.16) s/ N
The t-statistic indicates the number of standard errors that the average is from the mean. You might recognize that this is similar to the CV, except that standard error of the average, not standard deviation of the data is used as the measure of variation. There are several variations on what to use in the denominator estimate of numerator variation which depend on assumptions about the value of the standard deviation. See Chapter 6 for details on those separate cases. 2 2 Chi-squared is a ratio of two variances, with one presumed to be known. c = s s 2 , usually, and in this book, it is defined as the ratio times the degree of freedom.
83
Descriptive Statistics
c2 =
( N - 1) s 2
s2
(4.17) 2
2 Take care as to how it is defined. Some sources report c / ( N - 1) = s s 2 . 2 In a contingency table use, with enough data, the calculated value c =
k
å i =1
(Oi - Ei ) Ei
2
is
approximately distributed as the true ratio of variances with one presumed to be known. The F-statistic is a ratio of two variances with neither presumed to be known F = s1
2
s2 2
(4.18)
There is yet another r-statistic, r2. When one set of data, y, (expected to be a response or outcome) is plotted with respect to (w.r.t.) another set of data, x, (expected to be a cause or influence), an r2ratio is a conventional measure of how well a linear model, y = a + bx , fits the data. If a perfect fit, then yi = y i = a + bxi , and each residual, di = ( yi - y i ) = ( yi - a - bxi ), the difference between model and data will be zero. Then the sum of the squared deviaN
tions (SSD) will be zero, SSD =
åd
i
2
= 0. Alternately, if there is no trend, if b = 0, then
i =1
a = y , and the sum of the squared deviations will be the (n − 1) times the y-data variance, N
SSD1 =
N
åd = å ( y - y ) i
i =1
2
i
2
= ( n - 1) SX 2 . Here SSD1 means the sum of squared residuals N
i =1
from a model with one coefficient. SSD2 =
å ( y - a - bx ) i
2
represents the residual SSD
i
i =1
from a 2-coefficeint model. The ratio of the reduction in SSD of the 2-coefficeint model to the 1-coefficient model is termed r2
å i =1 ( y i - y ) N
SSD1 - SSD2 r2 = = SSD1
2
-
å
N i =1
( yi - a - bxi )
å i =1 ( y i - y ) N
2
2
(4.19)
The r2 value ranges between 0 (if there is no relation between y and x) to 1 (if the linear model perfectly relates y to x).
4.5 Degrees of Freedom This is not so much a characterization of the data, as it is an indication of residual flexibility to the data after some attributes have been fixed. DoF = u = N - k , where N is the number of data and k is the number of model coefficients or characterizations used in the comparison. If an average, X , is calculated from the data, and if we accept the average is the truth, then all of the data but one could have any value, but the last one must have a value that makes X true. Then N − 1 of the data values are free to change, and the DoF = u = N - 1. If a model is regressed to the data, and the model has four coefficients, then DoF = u = N - 4.
84
Applied Engineering Statistics
Example 4.4: Heichelheim obtained values for the compressibility factors for carbon dioxide at 100°C, over the pressure range from 1.3176 to 66.437 atm. (Heichelheim, H. R., The Compressibility of Gaseous 2,2-Dimethyl Propane by the Burnet Method, Ph.D. Dissertation, Library, University of Texas, Austin (1962), with permission.) Calculate the mean, variance, standard deviation, standard error, and the coefficient of variability for the portion of his data listed below. Compressibility factors 0.9966 0.9956 0.9936 0.9913 0.9873 0.9821 0.9747
0.9969 0.9957 0.9938 0.9912 0.9874 0.9823 0.9750
0.9971 0.9960 0.9940 0.9915 0.9980 0.9829 0.9758
You should get n
X=
å n = 0.98946667 xi
i =1
n
SX2 =
å
(x - X)
i =1
i
n-1
2
= 0.00005976
SX = SX2 = 0.00773022
SX =
SX = 0.00168687 n
CV =
SX = 0.00781251 X
Example 4.5: The following data represent a random subset from a study by one author (Rhinehart) of the academic trend in chemical engineering (ChE) students at Oklahoma State University. The first column represents the science, technology, engineering, math (STEM) grade point average (GPA) of students in their freshman and sophomore years. The second column represents the GPA of the same student in their upper level major ChE courses. The results have been sorted by STEM GPA. The study was intended to be useful in advising students. Fr. & So. STEM GPA 1.706 1.914 2.143
Jr. & Sr. ChE GPA 3.020 2.588 3.392 (Continued)
85
Descriptive Statistics
Fr. & So. STEM GPA
Jr. & Sr. ChE GPA
2.171 2.235 2.382 2.500 2.588 2.600 2.600 2.647 2.676 2.676 2.912 3.086 3.143 3.171 3.257 3.371 3.400 3.412 3.429 3.429 3.471 3.486 3.486 3.500 3.706 3.829 4.000 4.000
2.891 3.154 2.431 2.500 2.553 3.512 2.569 3.231 3.314 2.471 3.667 3.281 3.261 3.686 3.294 3.922 3.561 3.627 3.314 3.559 3.872 3.391 3.609 3.809 3.882 3.000 3.804 3.769
What are the arithmetic average and standard deviation of the two columns? What is the correlation r-statistic? Segregate the data into four categories: The students below average and above average in STEM GPA, and for each those below and above average in Major GPA. What is the count of number of students in each category? If there was no relation between columns, then there would be (ideally) an equal number of counts in each of the four quadrants. What is the chi-squared statistic value for the observed counts in the four quadrants? You should find
XSTEM = 2.9976¼
X Major = 3.2881¼
sSTEM = 0.6167 ¼
sMajor = 0.4658¼
r = 0.6627 ¼
86
Applied Engineering Statistics
Category
Observed count
High STEM, high GPA High STEM, low GPA Low STEM, high GPA Low STEM, low GPA
Eeach =
14 3 4 10
14 + 3 + 4 + 10 = 7.75 4
c 2 = 9.11987 ¼
4.6 Expectation The statistics above have been calculated using sample data. A sampling of the data might have a large number of observations (measurement values), but it is not the entire population; so, values such as sample average and sample variance are not the exact values for the entire population. “Expectation” is the term that means using the entire population to get the descriptive statistics. Of course, one never has the infinite number of measurements that represent the entirety of realizations (possible values) that could be obtained by sampling from the population. But, if one believes that a particular mathematical form of the probability distribution is valid, then one can use it to define population statistics. The expectation for the average is the f- or pdf-weighted sum. Starting with Equation (4.2) and using the area under the pdf curve to represent the number of Xi values, and recognizing that in the limit of very small dx the sum becomes the integral, and the integral of the pdf is 1
å f X = å (pdf dx)X = å f å (pdf dx) k
E(X ) = X
N =¥
k
i =1 k
i
i =1
¥
E(X ) =
ò ò
x pdf ( x ) dx
-¥ ¥
-¥
pdf ( x ) dx
i
i =1 k
i
=
i =1
ò
¥
i
i
i
x pdf ( x ) dx = m (4.20)
-¥
If you are inclined to enjoy placing the formula for a particular pdf from Chapter 3 into Equation (4.20) and integrating, you’ll find that E ( X ) = m . In a similar manner, with even more mathematical joy,
(
E (X - m )
2
)=ò
¥ -¥
(x - m )
2
pdf ( x ) dx = s 2 (4.21)
87
Descriptive Statistics
4.7 A Note about Dimensional Consistency 4.7.1 Average and Central Limit Representations The arithmetic average is represented as X=
æ Xi = ç ç i =1 è N
å
1 N
ö
N
åX ÷÷ø / N (4.22) i
i =1
Here X is the arithmetic average of N samples and Xi represents the individual samples in the average. Although the second version is numerically identical to the first, the second is dimensionally inconsistent. N is not dimensionless, a value might be N = 6 samples, not just a number, 6. Including units in brackets considering that X represents the weight in lbs in a sack, the equation is
X [ lbs ] =
1 éësampleùû N éësampleùû
æ Xi [ lbs ] ¹ ç ç i =1 è N
å
ö
N
åX [lbs] ÷÷ø / N éësampleùû (4.23) i
i =1
The second version now is obviously dimensionally inconsistent. The reduction in variance due to the central limit theorem is usually written as
s X = s Xi / N (4.24)
Here s X is the ideal standard deviation of the average of N samples and s Xi represents the standard deviation of individual samples in the average. This assumes that the variance on each sample is identical to all others. This is a very powerful and often-used concept. However, it is also dimensionally inconsistent. To maintain dimensionally consistency, some would prefer to see it written as
s X = s Xi
1 (4.25) N
which now shows the argument of the square root function is dimensionless. However, computationally with or without the 1, the numerical outcome is identical. 4.7.2 Dimensional Consistency in Other Equations (An Aside) Engineering and science often remove unity values from equations, because when doing the calculation, they have no impact on the result. Other common examples are Newton’s 1 First Law F = ma, and pressure drop in a fluid system DP = CD r v 2. In both, the dimen2 é kg - m ù sional unifier gc is missing. In SI units the value of gc = 1 ê , and with a unity value 2 ú ëN -s û it does not matter in a numerical calculation. But it does matter in other systems of dimensional units.
88
Applied Engineering Statistics
DPv . Here Q represents voluIn sizing a flow control valve, the formula is Q = Cv f ( x ) G metric flow rate and Cv has the same units, representing the flow rate through the fully open valve. Here, G represents specific gravity and f(x) is the fraction of maximum flow rate because of the valve position. Both are dimensionless. However, DPv is the pressure drop across the valve, which might have the units of psig or kPa. The equation is dimensionally incorrect but can be fixed by not removing the unity pressure scale factor, ξ = 1, DPv Q = Cv f ( x ) . Gx The argument of the logarithm must be dimensionless, but it is often simply represented as y = ln(x). With x and y having dimensional units, the equation is incorrect. Since ln ( x = 1) = 0, and subtracting zero to both sides of the equation y = ln ( x ) - ln ( x = 1) and combining the two log terms, y = ln x éë xunits ]/1[ xunits ùû the right-hand side (RHS) is dimensionally consistent, but since the log of a value is dimensionless, the equation is still incomplete. However, if the RHS is multiplied by 1 éë yunits ùû , then y = 1 éë yunits ùû ln x éë xunits ]/1[ xunits ùû . Now the numerical value is identical, and the units are dimensionally consistent.
(
)
(
)
4.8 Takeaway The most common measures are arithmetic average (for data centrality) and sample standard deviation (for data variability). These are the best sample statistics to estimate the value of the mean and sigma of the normal (Gaussian) distribution. Most data will be nearly normally distributed, making the average and standard deviation very meaningful. However, a good portion of data that you may be analyzing may not be normally distributed; and just because you can calculate an average and standard deviation for the data does not mean the use of Gaussian model-based analysis is justified. Be aware of the diverse measures of centrality, variability, patterns, and scaled metrics. Check equations for dimensional consistency.
4.9 Exercises
1. Derive Equation (4.2) from Equation (4.1). 2. If the weighting factor in Equation (4.2) is f i = ni / N , show that the denominator term
å
k i =1
f i = 1.
3. Some liquids have entrained particles, which would plug small orifices in the process line. This is undesirable. Plugging value, X, is a measure of the undesirability. To get the X-value for a batch of liquid, pump a sample through a filter at a uniform flow rate. As particles build up on the filter, the pressure drop across the filter increases. Plugging value is the volume of liquid that causes specified
Descriptive Statistics
pressure drop. If volume V of liquid with a plugging value of X1 is mixed with the same volume of liquid with a plugging value of X2, show that the mixture plugging value should be calculated as a harmonic mean. 4. Reconsider Exercise 4.3, and define a weighted harmonic mean if the two blended liquid volumes were to be V1 and V2. 5. Derive Equations (4.7b) and (4.7c) from (4.7a). 6. Derive the second expression in Equation (4.12) from the first. 7. Derive the second expression in Equation (4.13) from the first. 8. Use the data from Example 4.2 and plot the upper-level major GPA w.r.t. the lowerlevel STEM GPA. Ask your plotting routine to best fit a linear trend to the data and to display the r-squared correlation value. Compare this value to the square of the correlation r-value in Example 4.5.
9. Use Equations (4.20) and (4.21) to determine the μ and σ values for the contin1 uum uniform distribution where pdf ( x ) = for a £ x £ b , or else outside of that b-a range, pdf(x) = 0. That is not too difficult an analytical integration exercise. 10. You might want to practice calculating the various descriptive statistics in this chapter on your own data.
89
5 Data and Parameter Interval Estimation
5.1 Interval Estimation Given knowledge of the distribution and the parameter values, what value might one sample from the population yield? What is the expected range for possible sample values? This chapter shows how to answer that question. 5.1.1 Continuous Distributions An example may clarify the question. Example 5.1: If you knew that the experimental data is normally distributed with μ = 8 and σ = 1.5, what could be the value of one measurement, x? The sample value could be x = 500 or it could be x = −35, because the distribution permits values between ±∞, but such extreme values are improbable. A more useful question is, “What is the expected range for possible sample values?” The thumbnail sketch shows the upper and lower 10% limits and the corresponding standard normal z range. The upper 10% region will contain 10% of the data with z values greater than about 1.28, similarly, 10% of the data will be in the lower 10% region with z values less than about −1.28. This means that 80% (= 100% − 10% − 10%) of the samples will have z values between about −1.28 and +1.28. In Excel the function NORM.INV ( CDF, m , s ) returns the x-values, or NORM.INV ( CDF, 0, 1) returns the z-values. Here the 1.2815515… value has been truncated to 1.28 to be convenient but not undermine the implied precision of the μ = 8 and σ = 1.5 values.
Using the inverse of the definition of z, x = m + zs , and given the distribution parameter values, the value of x at the upper z-limit is x = 8 + 1.28 (1.5 ) = 9.92 . And the value of x at the lower z-limit is x = 8 - 1.28 (1.5 ) = 6.08 . As a result, one would expect that 80% of the sample x-values would be between 6.08 and 9.92. DOI: 10.1201/9781003222330-5
91
92
Applied Engineering Statistics
Note: This example did not assign units to the x, μ, or σ values. The variable might be the time that the mail is delivered, or the high outdoor temperature of your location on 27 April, or the impurity composition of cement. Whatever it is, x, μ, and σ will have the corresponding dimensional units. Note: Whether you use the inverse function, or a graph, or a table, the procedure is to define the CDF limits, then determine the value of the variable at those limits. This example estimated the x-range that would include 80% of the sample values. But, you might want to have greater surety about what might happen with a sample. So, you might want to know what the 95% limits are. Or, if the outcome is less critical, the 50% limits may be what you are seeking. Where safety and life are involved the 99.99% confidence may be desired. See Chapter 10 for a discussion on choosing appropriate confidence limits. Example 5.2: Use the normal distribution and μ = 8 and σ = 1.5 values of Example 5.1 to determine the x-values representing 95% of the possible observations. These are about at the z = ±1.96 values, giving an x range of 5.06 £ x £ 10.94 , and the 50% limits are about at the z = ±0.67 values, giving an x range of 6.995 £ x £ 9.005 . Accordingly, the range of possible values that you might report are strongly dependent on the choice of the CDF interval, alternately understood as the confidence interval, probability interval, or percentage chance of happening. So, you cannot just report an interval, you need to include the probability. The probability that the sample will have a value between 5.06 and 10.94 is 0.95. This could be stated as:
P ( 5.06 £ X £ 10.94 ) = 0.95 There is a 95% chance that the sample value will be included between 5.06 and 10.94.
Further, this example indicated the same level of concern for obtaining extreme high as extreme low values. The confidence interval was the central set of values, with half of the possibility of extreme values in the extreme high region and half in the extreme low region. But you might only have a concern about one side. For instance, if you are risking an investment in a business, any upper extreme return will be acceptable, and you might only be interested in the lower 25% of values. By contrast, if you want to pick up the mail on the way home after work, early mail deliveries are not an issue, but if a very late delivery, it will not be available when you get home. So, you might only be interested in the schedule that results in only a 1% chance of not getting mail. Example 5.3: Consider a one-sided concern in Example 5.2 above. If one was only interested in the upper 95% of values, then the z-values for CDF = 0.05 and CDF = 1 are −1.645 and +∞, translating to x-values of 5.5325 and +∞ which would be reported as the probability that the sample will have a value greater than 5.5325 is 0.95.
P ( 5.5325 £ X £ +¥ ) = 0.95
Both Examples 5.2 and 5.3 are true. But they present different results. So, more than just specifying the confidence interval, one must also specify how it is apportioned. If not explicitly stated, the custom, is to equally apportion the extreme probabilities, so C that a C% level of confidence (c as a fraction) means that 1 = 1 - c is the probability of 100
93
Data and Parameter Interval Estimation
being in either extreme, so that the CDF values that mark the extremes are ( 1 - c ) / 2 for the lower CDF value and 1 - ( 1 - c ) / 2 for the upper CDF value. The Level of Significance, α, is defined as the total extreme area under a pdf curve.
a = 1 - c (5.1)
If the probability of extreme values is apportioned equally CDFlower = a / 2 (5.2) CDFupper = 1 - a / 2
this means that the probability of getting a value in the lower extreme is CDFlower, and the probability of getting a value in the upper extreme is 1 – CDFupper. Certainly, the upper and lower probabilities do not have to be apportioned equally, one can choose any upper or lower probabilities that sum to the desired value of α. If not equally apportioned, as was the case with Example 5.3, to properly convey the meaning, report each side individually. Example 5.4: Repeat the Example 5.3, but use the 90% interval, with 1% allocated to the lower and 9% to the upper, then
CDFlower = 0.01 CDFupper = 0.91
The z-values are
zlower = -2.326 zupper = 1.341
and the x-values could be presented as
P ( X £ 4.51) = 0.01 P ( X ³ 10.01) = 0.09
Regardless of the continuous distribution model, the procedure to estimate the range that a sample value might provide is: 1. Define the distribution, and parameter values. 2. Determine the confidence interval desired, and the probability allocation to the lower and upper limits. Be sure that the values are appropriate to the context. Use this to assign the CDFlower and CDFupper values. 3. Determine the lower and upper statistic values from the inverse of the distribution. 4. If the statistic is a scaled value, un-scale it to determine the lower and upper X-values. 5. Report the lower and upper X-values along with the qualifying givens from Steps 1 and 2. Example 5.5: The standard deviation of a population is σ = 1.234 μg/L. What might be the 95% limits on the standard deviation of a sample with n = 10 data values?
94
Applied Engineering Statistics
Assume that the variance will be chi-squared (χ2) distributed with u = n - 1 = 9 degrees of freedom. The 95% limits means that c = 0.95. Using Equation (5.3) a a = 1 - 0.95 = 0.05 choosing to split the extreme areas equally, CDFlower = = 0.025 and 2 a CDFupper = 1 - = 0.975 . In Excel, values for the inverse of the χ2 distribution are found 2 using CHISQ.INV ( CDF,u ). The values are 2.70038… and 19.0227…. Using the definition 2 of c 2 = u s 2 from Equation (4.16), and solving for s = s c 2 / u the 95% limits on s are s 0.67593… μg/L to 1.7940… μg/L. Assuming that the variance is χ2 distributed, and rounding to values that are equivalent to the given σ = 1.234, P ( s £ 0.676 m g/L ) = 0.025 P ( s ³ 1.794 m g/L ) = 0.025
or with implied equal extreme areas as
P ( 0.676 m g/L £ s £ 1.794 m g/L ) = 0.95
there is a 95% chance that the standard deviation values could range between 0.676 μg/L and 1.794 μg/L. Example 5.6: If n = 10 values are sampled from a normal distribution with μ = 8 and σ = 1.5, what could be the 80% range of values of the average of the n measurements? When data are normal, the standard deviation of an average is calculated as
(
)
s X = s X / n , the z statistic for the average is then zX = X - m /(s X / n) . From z one can calculate the limits on the average X = m ± zXs X / n . The 80% range will encompass 80% of the events, leaving 20% (α = 0.2) in the extremes. Centering the extremes are the CDF values of 0.10 and 0.90. The corresponding z-values are ±1.2815515¼ which translate to about Xlower = 7.39 and Xupper = 8.61.
(
)
P 7.39 £ X £ 8.61 = 0.80
There is an 80% chance that the average of 10 randomly sampled values could be included between 7.39 and 8.61.
The procedure is similar for other continuous distributions (uniform, exponential, lognormal, etc.). 5.1.2 Discrete Distributions The procedure for interval estimation in discrete distributions is the same as that for continuum-valued variables, except that since CDF and X-values can only have particular values, the CDF and/or x-limits need to be truncated (up or down) to the best nearest feasible value. Example 5.7: Accepting that data are binomially distributed with an individual probability of success as p = 0.7, in n = 15 trials, what is the 70% range of the number of successes?
Data and Parameter Interval Estimation
95
The thumbnail sketch of the solution is included. Note that abscissa values range from 0 (the minimum possible number of successes) to 15 (the maximum number). The markers, circles, indicate point values, and the dotted line connecting the dots is there as a visual aid to sequence. It does not suggest that in between values are possible. One cannot have 3.456 number of successes. The number of successes must be an integer. The dashed lines represent the CDF range and corresponding number of successes.
70 = 0.3 . With conventional splitting of the two extreme 100 a a regions equally, the CDF values are CDFlower = = 0.15 and CDFupper = 1 - = 0.85 . As 2 2 it turns out there are no data points at those CDF values. The closest CDF values are 0.13114… at a count of 8, and 0.87317… at a count of 12. So the answer is that the range is 8 £ X £ 12 , but the probability of that interval is not the specified 70% of the problem statement, it is the CDF difference of 0.87317… − 0.13114… = 0.74203…. As a , the answer is The 70% range requires a = 1 -
P ( 8 £ X £ 12 ) = 0.74 .
There is a 74% chance that the number of successes will be between 8 and 12, inclusive.
Regardless of the discrete distribution model, the procedure to estimate the range that a sample value might provide is: 1. Define the distribution, and parameter values. 2. Determine the confidence interval desired, and the probability allocation to the lower and upper limits. Be sure that the values are appropriate to the context. Use this to assign the CDFlower and CDFupper values. 3. Adjust the CDF values to best match feasible values. 4. Determine the lower and upper statistic values from the inverse of the distribution. 5. If the statistic is a scaled value, un-scale it to determine the lower and upper X-values. 6. Report the lower and upper X-values along with the qualifying givens from Steps 1 and 2. Step 3 is fairly important. In Excel, the function BINOM.DIST ( x , n, p,1) returns the CDF value. But values for x and n must be integers. Excel truncates a noninteger x and n values. It does not round the value to the next nearest integer. Similarly, the inverse function BINOM.INV ( n, p, CDF ) returns the x value of the next higher feasible CDF value. For example, the CDF of x = 9, n = 15, p = 0.7 is 0.2783…, and that for x = 10, n = 15, p = 0.7 is
96
Applied Engineering Statistics
0.4845…. If you use the inverse to determine an x-value for a CDF of 0.2783 (a bit less than 0.2783…) it indicates 9. At 0.2784 (a bit more than 0.2783…, but still far from the 0.4845…) it indicates 10, the next higher value, which should not appear until a CDF value of 0.484509…. In Step 3, look at the feasible count and CDF values just above and below those indicated from Step 2, and choose the closest to the values of the CDF or the count that best match the application context and intent. If you are using another software environment, you need to understand whether it truncates, rounds, or rounds up discrete values. The procedure is similar for other point distributions. Example 5.8: Accepting that data are Poisson-distributed with an average number of successes λ = 7.03, what is the 50% range of the number of successes? The thumbnail sketch of the solution is included. Note that abscissa values range from 0 (the minimum possible number of successes) to 15 (but the upper limit is unbounded). The markers, circles, indicate point values, and the dotted line connecting the dots is there as a visual aid to sequence. The number of successes must be an integer. The dashed lines represent the CDF range and corresponding number of successes.
50 = 0.50 . With conventional splitting of the two extreme 100 a a regions equally, the CDF values are CDFlower = = 0.25 and CDFupper = 1 - = 0.75 . These 2 2 are the quartiles. As it turns out there are no data points at those CDF values. The closest CDF values are 0.296983… at a count of 5, and 0.725172… at a count of 8. So the answer is that the range is 5 £ X £ 8 , but the probability of that interval is not the specified 50% of the problem statement, it is the CDF difference of 0.42827…. As a result, the answer is The 50% range requires a = 1 -
P ( 5 £ X £ 8 ) = 0.43. There is a 43% chance that the number of successes will be between 5 and 8, inclusive.
5.2 Distribution Parameter Estimation Given knowledge of the distribution and a data value, what range of distribution parameter values might reasonably have generated that data value? Contrasting the question in Section 5.1, here the data has been acquired and the data value is known, the distribution parameter values are unknown.
Data and Parameter Interval Estimation
97
FIGURE 5.1 Illustration of an 80% probable range on the mean.
Figure 5.1 represents data that has occurred with an experimental value of the statistic at 5, on the horizontal axis. The statistic might be a sample average as an estimate of the population mean, a sample standard deviation as an estimate of the population sigma, or a sample proportion as an estimate of the population probability. The dashed vertical line represents the data location. If that experimental value was the true value of the distribution parameter, then you would know the true distribution and would be able to estimate the expected range of a data value that might be generated by the population. The question here is what value of a population parameter could have generated the data. The CDF curve to the right has a parameter value (an average value, a 50th percentile value) of about 6.5. It has a high probability of generating data in the 5.5 to 7.5 range and could generate a data value as extreme as about 3 or 9. The lower of the two horizontal dashed lines indicates the probability that it could have generated such an extreme data value as 5 (or lower), is about 10%. The CDF curve to the left has a parameter value of about 3.5, and the probability that it could have generated such an extreme data value as 5 (or greater), is about 10%. The objective here is to determine the values of the right and left distributions that could have generated the experimental data value with a desired confidence. If the center of the right-hand curve were shifted toward the left, there would be a higher probability that the population could have generated the data. Here, with population values between 3.5 and 6.5 it is not unexpected that the data value of 5 could have been generated. Here, between the population values of about 3.5 and 6.5, there is an 80% chance that the population could have generated the data. In Figure 5.2, the population parameter values are shifted to 3 and 7. There is still a chance that either population could have generated the extreme data value of 5 (or more extreme), but here only 2.5% chance for either. The combined extreme probability is 0.05 = (2.5% + 2.5%)/100%; so, with population parameter range between 3 and 7, there is only a 5% chance that a data value of 5 would be considered extreme. The range of 3 to 7 includes the population parameter values that could have generated an observation value of 5 with less than a 5% chance. In this section, given the distribution model, and a desired confidence, we’ll calculate the population parameter value that could have generated that data. The procedure is: 1. Choose the population model that best fits the attributes of the data. 2. Choose a confidence interval that best matches the application context, and how to allocate the two extreme probabilities. Here, again, c represents the confidence that either might have generated the data, and a = 1 - c is the combined extreme,
98
Applied Engineering Statistics
FIGURE 5.2 Illustration of a 95% probable range on the mean.
improbable area. Nominally, α/2 would be the area in each tail, defining CDF of the right-most curve at the data value x to be CDF ( x ) = a / 2, and CDF of the leftmost curve at the data value x to be CDF ( x ) = 1 - a /2 . 3. Determine the value of the population parameter, p, which makes CDF ( x , plower ) = 1 - a / 2 , and the one that makes CDF ( x , pupper ) = a /2. Or use your alternate choices for the split of the extreme areas. This may need to be a rootfinding exercise unless an inverse built-in function is available to do this. 4. If the statistic is a scaled value, un-scale it to determine the lower and upper distribution parameter values. 5. Report the plower and pupper values along with the assumptions and choices in Steps 1 and 2. Note: If the distribution is symmetric (such as the normal and t-distributions) then assuming that the data value is the true mean and asking what are the extreme values that it might have generated, gives the same answers as the method of this section. Note: Some nonsymmetric distributions, such as the exponential, CDF = 1 - e - x/m , are explicitly invertible. Given the value of the sample, x, and a desired CDF value, the population parameter can be calculated as m = - x/ éëln ( 1 - CDF ) ùû . However, if the distribution is not symmetric or not analytically convenient (most are not), then use the method of Section 5.2.1. 5.2.1 Continuous Distributions We’ll explain the procedure with an example. Example 5.9: The Gulp-a-Cup Coffee Company utilizes spray-drying in their coffee production process. Nozzles employing internal mixing were recently installed for trial runs to determine the entrance pressure. What values of the new gas pressure range correspond to the 99% confidence limit for the population mean? Sample data are: Entrance gas pressure (psig) 52.00
51.00
51.80
51.75
51.30
50.85
50.25
49.00
48.65
48.00
The average entrance pressure is 50.46 psig. The sample standard deviation is 1.4331007… psig. The standard error of the mean, SX , is calculated as SX = SX2 / n = 0.45318625¼
r
99
Data and Parameter Interval Estimation
psig. Although, there is suspicion that the data are not normally distributed (note the ending values are all 0 or 5, and the unexpected number of integer values masquerading as decimal-valued numbers), the average of 10 should be about normally distributed. If we knew the population variance, we could use the z-statistic to estimate the range, but since the standard deviation is based on the sample, we’ll use the t-statistic, as the sample has 10 observations, v = n - 1 = 9. Desiring the 99% limits, the extreme area is 1%, α = 0.01, and splitting the extreme areas equally the two CDF values are CDFupper = a /2 = 0.005 and CDFlower = 1 - a /2 = 0.995 . The objective is to find the values of the t-statistics for the upper and lower curves. In Excel we desire T .DIST ( tlower , 9, 1) = 0.995 and T .DIST ( tupper , 9, 1) = 0.005 . One could use any root-finding procedure to determine that the values are tupper = 3.2498355¼ and tlower = -3.2498355¼ . The Excel Solver Add-In is convenient for root-finding, but it may need reasonable initial guess of the values. However, in this case, one could also use the Excel inverse functions T.INV (.005, 9 ) and T.INV (.995, 9 ) to return the values of ±3.2498355¼. Inverting the t-formula to generate the mean, m = X - tSX , then approximately P ( 48.99 psig £ m £ 51.93 psig ) = 0.99
We can reasonably expect that the average operating pressure will be between these limits if we use any other group (sample) of the same type of nozzles.
Note: Since the t-distribution is symmetric, the t-values are symmetric about zero
( ±3.2498355¼) . Consequently, the mean is symmetric about the average. In this special
case one can assume that the data value is the true mean and use the Section 5.1 approach to determine the range of values that it might generate. The range of values will be the same, but the concept of Section 5.2 is different, and equal values for the two procedures only happen of the distribution is symmetric. If the distribution is not symmetric, not analytically invertible, or an inverse function is not available, then one is required to perform root-finding on the distribution. Example 5.10: The standard deviation from a sample of n = 5 data is 3.05 mm. What might be the 95% limits on the variance of the population that generated the data? Variance is the square of the standard deviation. From the information presented, the variance of the sample is 9.3025 mm2, and variance is typically χ2 distributed. More precisely V = u s2 / s 2 is χ2 distributed. We’ll divide the extreme areas equally, so that we are seeking CDF values of 0.025 and 0.975. The question is what are the s 2 lower and s 2 upper values to make the sample s2 represent the 0.025 and 0.975 limits? The degrees of freedom value is, u = n - 1 = 4.
(
)
(
The operation is to determine 0.025 = CHISQ.DIST c 2 upper ,u , 1 , and 0.975 = CHISQ.DIST c 2 lower ,u , 1
(
)
0.975 = CHISQ.DIST c 2 lower ,u , 1 . Using root-finding the values are c 2 lower s = 11.14328¼ and c 2 upper s = 0.484418¼. Conveniently, Excel also provides the inverse function. Using CHISQ.INV ( 0.975, 4 ) and CHISQ.INV ( 0.025, 4 ) also returns the c 2 lower s = 11.14328¼ and c 2 upper s = 0.484418¼ values. æ 9.3025 ö = 76.8137 ¼ mm 2 Translating χ2 to population variance: s 2 = u s2 / c 2 , s 2 upper = 4 ç 0.484418¼ ÷ø è æ 9.3025 ö æ 9.3025 ö 2 2 2 = 4ç ÷ = 76.8137 ¼ mm , and s lower = 4 ç 11.14328¼ ÷ = 3.33921¼ mm . 0 484418 ¼ . è è ø ø
)
100
Applied Engineering Statistics
(
)
P 3.34 mm 2 £ s 2 £ 76.81 mm 2 = 0.95
Translating variance to population standard deviation: P (1.83 £ s £ 8.76 ) = 0.95
Note: The low and high extreme values are not equidistant from the sample value. The lowest possible value of a variance is zero and the highest is unbounded. The χ2 distribution is not symmetric.
5.2.2 Discrete Distributions The procedure is identical to that above when the distribution parameter values are continuum-valued (such as probabilities, or average). Example 5.11: One mixing experiment over a one-week period indicates that 3 particles bounced out of the tank. The population of rogue particles is hypothesized to be Poisson-distributed. What are the 95% limits on the true mean, λ, on the number of particles bouncing out per week? The 95% limits will be the center, with the extreme area split on either side. a a = 1 - 95% / 100% = 0.05. So, the two CDF values are = 0.025 and the complement 2 a 1 - = 0.975. The high possible value for the true mean will use the sample data value 2 as its 0.025 CDF value, and the low possible value for the true mean will use the sample data value as its 0.975 CDF value. The objective is to find values for λ such that POISSON.DIST ( 3, lupper , 1) = 0.025 , and POISSON.DIST ( 3, llower , 1) = 0.975. Using a rootfinding algorithm, llower = 1.0898¼ and lupper = 8.7672¼
P ( 1.09 particles per week £ l £ 8.77 particles per week ) = 0.95
Note: The answer is not symmetric about the data. The high possible average of 8.77 is 5.77 particles/week more than the sample finding of 3 particles/week. If symmetric, on the low side of 3 particles/week would be 3 - 5.77 = -2.77 particles/week. But the value of the average cannot be less than zero. The low extreme value of 1.09 p/week is about 1.91 away from the sample finding. Example 5.12: When my rain gauge indicates that we had 1 inch of rain, what might the true average have been? The nominal rain drop has a size of about 0.08” (inches) diameter. Its volume then is 0.00026808… in3. My rain gauge has a 1” × 1” opening so when it reads 1” of water, it contains 1 in3, which means it had accumulated about 3,730 drops of rain. Modeling the rain drop in a space and time interval as Poisson-distributed, and the experimental sample value as 3,730 drops per square inch of area per one rainfall duration, the question is what is the true population mean (Poisson formula lambda) that could have generated that sample value? Using the 95% interval, the question is, “What low value for λ could have generated the 0.975 upper limit sample value of 3,730, and what high value for λ could have the 0.025 lower limit of 3,730?” Using the Excel function POISSON.DIST ( 3730, l , 1) and root finding, the values of lambda are
101
Data and Parameter Interval Estimation
llower = 3, 612 rain drops
lupper = 3, 851 rain drops.
Note: The deviations from average are 118 and 122, which is not symmetric because the Poisson distribution is not symmetric. Returning count of drops to inches of height in the rain gauge,
P ( 0.97 £ rain height, in £ 1.03 ) = 0.95.
Satisfyingly, the 95% interval for the true rainfall amount, which could be about ±3% from my 1” reading, is only about 1/32 of an inch, not visibly detectable. Note: Often when the number count is high the normal distribution is a reasonable approximation for the Poisson or binomial. If the sample value of 3,730 is the true population mean then the standard deviation of the Poisson distribution is its square root, 61.0737…. Since the standard normal distribution is symmetric, asking “What population mean could have generated the sample average?” is the same as asking, “If the population mean is 3,730, what might the sample value be?” At the 95% confidence, the boundary CDF values are 0.025 and 0975, for which the z-values are ±1.95996¼. Then the limits on the mean are X ± zs = 3370 ± 1.96 * 61.0737 for which
llower = 3, 610
lupper = 3, 849. These are very close to the values generated by the Poisson distribution.
Example 5.13: They told me it was a fair coin, with a 50/50 chance of winning. I flipped it 20 times and only won 3 times. That is a possible realizable run of bad luck, I imagine; but I also want to know, “Is 3 out of 20, within the 99% range on the probability of winning an individual flip?” Here, we’ll split the extreme area in half. Then α = 0.01 and the CDF extremes are 0.005 and 0.995. The binomial distribution determines x-number of wins in n trials where the probability of a particular win is p. So, we are solving for the value of p to make CDFlower = 0.995 = BINOM.DIST ( x , n, plower , 1), and CDFupper = 0.005 = BINOM.DIST ( x , n, pupper , 1) . The values are plower = 0.03575¼ and pupper = 0.44946¼
P ( 0.035 £ p £ 0.449 ) = 0.99.
It appears that the 99% limits on the possible value of the probability of a win do not include 0.5. I’d be inclined to call it a foul coin. In Excel, the BINOM.INV function returns a count of the outcome associated with the CDF. x = BINOM.INV ( n, p, CDF ) . If the coin is fair, p = 0.5, and BINOM.INV ( 20, 0.5,.005 ) = 4 and BINOM.INV ( 20, 0.5,.995 ) = 16 . If the coin is fair, one expects the 99% range of outcomes to be within 4 and 16 wins. The 3 -win count is outside this range, corroborating the analysis above, but this inverse function does not provide the population parameter value. Example 5.14: The developers of a new, less expensive, method of manufacturing compression rings for 1/4 inch copper refrigeration tubing claim that their procedure has a
102
Applied Engineering Statistics
failure rate of only 0.1%, that is, 1 failure per 1,000 rings. You didn’t believe their claim, so you had your purchasing agent order a sample of 1,000. Of that sample, three rings were defective. Construct the 99% confidence limit for the expected proportion of failures. We’ll choose to split the extreme area equally, a = 1 - 0.99 = 0.01, so that we are solving for the value of p to make CDFlower = 0.995 = BINOM.DIST ( x , n, plower , 1), and CDFupper = 0.005 = BINOM.DIST ( x , n, pupper , 1) . The p-values are 0.0006729… and 0.01093377… P ( 0.00067 £ p £ 0.0109 ) = 0.99.
As the interval for p contains 0.001, corresponding to the fraction of defective rings claimed in the new manufacturing method, you’ll have to conclude that the developers’ claim may be legitimate. At this point, you have no way to tell whether the new method is better than the standard method of manufacturing the compression rings at a 99% confidence level.
5.3 Approximation with the Normal Distribution We can often take advantage of the central limit theorem, which states that as the number of samples increases, the resulting distribution of their means, X , approaches the normal distribution with mean μ and variance σ2. Thus, the random variable for proportion, x/n, the number of successes per number of trials, has an approximate normal distribution with mean p and variance pq/n. For large n and not near extreme values of either p or q or we can use the Z-statistic to construct an approximate confidence interval for p if Z is expressed as Z=
P-p (5.3) sp
Substituting, we have
Z=
P-p P-p = (5.4) Sp pq / n
Proceeding in the usual manner, of equal allocation of the extreme probabilities, we define the confidence interval for the proportion p as
P ( za /2 < Z < z1-a /2 ) = 1 - a (5.5)
Proceeding, to transform Equation (5.5), we have the desired approximate confidence interval on the proportion P:
æ (1 - P ) < p < P + z (1 - P ) ö÷ = 1 - a (5.6) P ç P - z1-a /2 P 1-a /2 ´ P ç n n ÷ è ø
where P = x/n and Q must be neither 0 nor 1.
103
Data and Parameter Interval Estimation
Example 5.15: Repeat Example 5.14 but use the normal approximation for the distribution. The sample values are Q = 0.003, P = 0.997. The sample size of n = 1000 is large and neither P nor Q is near 0 or 1. Using the Excel function NORM.INV ( CDF, m , s ) with NORM.INV ( 0.995, 0, 1) to return z0.995, we find z0.995 = 2.575829¼ and z0.005 = -2.575829¼. Now we can construct the interval for the proportion of failures:
æ 0.003 ( 0.997 ) 0.003 ( 0.997 ) P ç 0.003 - 2.575 < q < 0.003 + 2.575 ´ ç 1000 1000 è
ö ÷ = 0.99 ÷ ø
P ( -0.0014533 < q < 0.0074533 ) = 0.99
Again, as the interval for q contains 0.001, corresponding to the fraction of defective rings in the new manufacturing method, you’ll have to conclude that the developers’ claim may be legitimate.
Note: The normal distribution is unconstrained. Values can range from −∞ to +∞, but the proportion is constrained. The p-value and the q-value can only be between 0 or 1 (inclusive). In using the normal distribution as an approximation to the binomial, near to the p or q limit the normal distribution will permit out-of-range values, such as q = -0.0014533 . Although the normal approximation to a particular distribution can often be justified, our preference is to use the distribution that best matches that expected for the data population.
5.4 Empirical Data It could very well happen that you have a bunch of data (perhaps more than 1,000) representing a process, do not have a model for the distribution, and are asked to determine the upper and lower limits for either the data of the range of the distribution parameter values. 5.4.1 Data Range To estimate the confidence limits for the data, first sort the data from low to high value. There will be a high and a low value, representing the particular sample, not the population. You could report the data range, R = xhigh - xlow , but this depends on the vagaries in the particular sample and is not the 100% possible range. Further, it uses only two of all of the values you collected and ignores the information in the other values. A better measure is to use the empirical CDF of the data to estimate the 90% range or quartiles or suchlike. In the sorted data, there will be n data values. Assign to them index numbers 1 £ i £ n. Assign a CDF value to each data using CDF = i/n. As usual, choose the desired confidence interval, and its allocation to the two extreme limits, then search for the sorted data for the i/n values representing the CDFlower and CDFupper values. Probably the i/n values of your data will not exactly match the desired lower or upper CDF values, so interpolate. Both of those approaches are simple to implement. But still, those methods use just a few data values in the sample and ignore all of the information in the rest of the data. So, a preferred technique for more comprehensive analysis is to fit a best matching distribution
104
Applied Engineering Statistics
to the data, then use the techniques of Section 5.1 to determine the limits on the presumed distribution model. If you can expect the population to be normally distributed, then calculate the average and standard deviation of the sample, presume this to be the population mean and sigma, then use the techniques of Section 5.1 to determine the limits on the presumed distribution model. Example 5.16: Use the data from Example 3.14, presume that the population is normally distributed, and estimate the 95% range on the data that might come from the population. The average and standard deviation of the sample are approximately 2.785417 ppm and 2.910418 ppm, respectively. The 95% interval means that the extreme area is 95 a = 1= 0.05 , and assigning equal probabilities to the extreme low and extreme 100 high values means that we are seeking the population values representing the CDFlower = 0.025 and CDFupper = 0.975 values. Since the sample parameters have been estimated from the data, we’ll use the t-distribution, with u = n - 1 = 72 - 1 = 71 degrees of freedom. Using the Excel function T .INV ( CDF ,u ) the t-values are about ±1.9939434 and using the definition of the t-statistic to calculate the x-values, x = m + ts , the 95% limits are estimated to be
P ( -3.02 ppm £ X £ 8.59 ppm ) = 0.95.
Note: The limits are rounded to match the implied precision of the data. Note: The procedure is correct, but the negative concentration value is not feasible. A result that is not reasonable should be an indication that the assumptions need to be revisited. In Example 3.14, the experimental distribution of the data suggested that the data might be exponentially distributed, not normal.
Note: Just because you can calculate an average and standard deviation of a sample, does not mean that the population is normal. Use a distribution that seems to best match the attributes of the process that generated the data, or best match the shape of the empirical distribution. Example 5.17: Repeat Example 5.16, but presume that the population distribution is exponential, as is suggested by the empirical CDF of Example 3.14. In an exponential distribution, the population mean and standard deviation are both the reciprocal of the exponential factor, α. This alpha is not the level of significance for the confidence interval, but the multiplier for the x-value in the exponential distribution CDF ( x ) = 1 - e -a x . The reciprocals of the sample average and sample standard deviation are about 0.3590313 ppm and 0.343593 ppm. The closeness of these two values to the expected one α value is a reassuring check on the presumption that the data come from an exponential population. We’ll use an average of the two as a best estimate of the population parameter value. a = 0.351303/ppm . With the same CDFlower = 0.025 and CDFupper = 0.975 values, the inverse of the distribution, x = - ln (1 - CDF ) / a , can be used to solve for the extreme x-values.
P ( 0.07 ppm £ X £ 10.50 ppm ) = 0.95
Note: Using the exponential distribution, which appears to better represent the empirical CDF returns, more reasonable values for the 5% extreme boundaries of the data that could be generated by the process that generated the experimental data.
Data and Parameter Interval Estimation
105
5.4.2 Empirical Distribution Parameter Range The question in this section is not what might be the data range that the population could generate, but what range of population parameter values might have generated the dataset. One has the dataset. A preferred method is called bootstrapping.
1. Consider that the sample represents all aspects of the population. The sample size, n, is large enough to reveal representative extreme values, and all the vagaries that might be expressed by the data-generating experiment. We will call the sample of n data the surrogate population. 2. Assume a population distribution. 3. Specify the confidence limits desired for the population parameter values. 4. Randomly sample n number of data from the surrogate population. Sample with replacement, meaning that if a particular data is taken from the surrogate population, it remains in the surrogate population and might be sampled again. For example, if the four data values in the surrogate population are 1, 2, 3, and 4, then a random sampling with replacement might generate the sampling 3, 2 3, and 1. Note that the data value of 4 is missing from the sample, that the data value 3 is repeated, and the data of the sample are not in the same order of the surrogate population. This sample is termed a realization, and it represents a dataset of n values that could have been generated by the surrogate population. 5. Use the data in the realization set of Step 4 to generate the population parameter values (such as average, standard deviation, proportion, etc.) of the assumed population from Step 2. These might be directly calculated from the data, or by best fitting the presumed distribution to the data. This one realization of the population parameter values. 6. Record the parameter values for that realization. 7. Repeat Steps 4–6 many times. Probably this means 100, or more, realizations. Now you have 100 or so realization values for the population parameters.
8. Create an empirical CDF of the parameter values and characterize the distribution of those values. Most likely they will appear normally distributed, but it may be exponential, log-normal, or other. The distribution of parameter values will not necessarily have the same distribution as that of the surrogate data in Step 2. 9. Use the technique of Example 5.15 or 5.16 to generate the confidence limits on the distribution parameter values from Step 7.
How many realizations are needed? A nominal number is 100, as stated in Step 7. Fewer may be fully functional, or more may be needed. It all depends on the vagaries in the surrogate population data, the number of data, and the precision needed on the parameter values. Once the bootstrapping procedure is automated, it is a trivial exercise to run it again. If a new randomized bootstrapping procedure gives the same results (within desired precision) as the prior one, then 100 realizations was enough. If the results are not equivalent, then increase the number of realizations, perhaps by a factor of 5, and repeat.
106
Applied Engineering Statistics
5.5 Takeaway Interval estimation has two aspects, and for each there are two situations. When you are interested in a single data value: In one aspect you are given the distribution and associated parameter values and are asked to determine the range of values that a sampling might generate. In the second aspect, you are given the distribution, but not the parameter values, and you are given a data value, and you wish to determine the range of parameter values that could have generated the data. Don’t get them mixed up. When you have a set of data, again you could be either seeking the possible limits that the population might generate or the range on population parameter values that might have generated the data. Don’t get them mixed up. In either case, you need to specify a level of confidence on the interval, as well as the allocation of the upper and lower extremes. This is specific to a particular context. Traditionally, the default is the 95% confidence interval with equal allocation of the extremes. But that does not make it right for your particular application. Chapter 10 discusses choices for confidence interval. If the number of samples is high and the values are far from the constraints, then the normal distribution is a reasonable approximation to your data. If you do not know what the underlying distribution is, using the normal may be the best you can do. But seek to use the distribution that best matches that revealed by the empirical data. If your analysis leads to infeasible values, you probably need to change the distribution model.
5.6 Exercises 1. Data taken from the filter located in one section of the pilot plant are used to determine the specific cake resistance of a slurry. Several values of the variable, expressed in ft/lbm, have been calculated from data taken during the past month. Based on these values, what is the 95% confidence interval for the variability of the specific cake resistance? Specific cake resistance (ft/lb m) 2.49 × 1011 2.40 × 1011 2.43 × 1011 2.30 × 1011 2.53 × 1011
2.67 × 1011 2.60 × 1011 2.50 × 1011 2.54 × 1011 2.55 × 1011
2. Show the details of the calculations leading to the results in Examples 5.3 and 5.4. 3. Repeat Example 5.12 but consider that the rain gauge has a 1/3 inch diameter opening. What is the uncertainty on the rainfall if the amount in the gauge reads 1 inch? 4. Use the bootstrapping technique of Section 5.4.2 on the data of Example 3.14 to calculate the 95% range on the average. Compare your results to Example 5.17.
6 Hypothesis Formulation and Testing – Parametric Tests
6.1 Introduction A hypothesis is based on a supposition or claim about something. The supposition might be, “This is a fair coin.” Or, “Treatment A gives a yield that is 10% better than Treatment B.” Or, “Including “ratio” to the controller strategy reduces process variability.” Or, “The moon is the same size as the sun.” Just because something is supposed does not mean it is true. But if the supposition is true, then it might be manifest in some attribute of the data. The hypothesis is a statement about what you expect to see in the data. The hypothesis is about the attributes of the population, not about the attribute of the sample. For example, Sample 1 data are 3, 4, and 5; and Sample 2 data are 4, 5, and 6. The averages are X1 = 4 and X 2 = 5 . There is no question that the average of Sample 2 is greater than the average of Sample 1. There is no question that the averages are not equal. But these are samples from a population, not the truth about the population. The question is not about the sample. The question would be about the population. The related question may be, “Are the population means equal?” The two sample sets could come from the same population. Just because the sample averages are different does not mean that the population means are different. The hypothesis must be tested. You consider what attributes the supposition might express, then do experiments and collect data that could be used to test the hypothesized aspect of the data. For example, if it is a fair coin, then after a bunch of flips, you expect the number of Heads (H) to equal the number of Tails (T). But the experimental data is a sample of the population, and it does not represent the definitive truth about the population. There will be a range of the experimental results that will support the hypothesis, and a contrasting range that will indicate that the hypothesis is not likely true. Consider the fair coin supposition as an archetypical illustration to explain the method of this chapter: How could one test the fair coin supposition? An answer is to flip a coin many times and record the number of H and T, if the results are 50/50, the coin is likely fair. The hypothesis is that the number of H is equal to the number of T. In an experiment of 20 flips, we might get 8 H and 12 T. It is not a 50/50 distribution, but this does not mean the coin is not fair. This is a likely outcome of a fair coin with only 20 trials. Figure 6.1 illustrates the number of successes from the binomial distribution with n = 20 trials and p = 0.5 (a fair coin). The horizontal axis is the possible count of the number of successes (either H or T, depending on your choice of what constitutes a success), and the vertical axis is the binomial CDF. The dashed lines indicate that experiencing fewer than DOI: 10.1201/9781003222330-6
107
108
Applied Engineering Statistics
FIGURE 6.1 CDF of the number of successes from a binomial distribution with n = 20 and p(H) = 0.5, and roughly 90% limits.
6 successes only has about a 5% chance of happening, and 14 or more successes about an equal chance. Between, the values of 6 through 13 successes are within the about 90% possibility of occurrence with a fair coin. Only 8 wins in 20 flips is not an improbable event. It does not provide strong evidence that the coin is foul. Because of random events and the vagaries associated with experimental testing, the sample data will not reveal the true mean or true variance of the population. So, we need to see if the experimental statistic is so extreme that it probably would not be from the hypothesized situation. This will use techniques in Chapter 5, which means that a probability distribution that is representative of the population of outcomes needs to be chosen. Also, so does a level of confidence. In the coin flipping illustration, the distribution of successes should be the binomial. If one chooses the 50% level of confidence, the level of significance, the extreme areas is a = ( 1 - 0.5 ) = 0.5 . And equally splitting the extremes, the range of extreme outcomes represents the quartiles (CDF = 0.25 and CDF = 0.75). The hypothesis is P(H) = 0.5 and using n = 20 the binomial distribution model reveals that the probability of the number of heads being between 9 and 11 is 50%, P ( 9 £ x £ 11) = 0.5 ; and since the experimental outcome of 8 H is outside of that range one might claim “Foul!”. But should we reject the hypothesis, accuse the flipper with misconduct, and presume to extract justice for being defrauded, when there is only 50% confidence? Some people might claim “Foul!” if they lost on a single flip, but that is not adequate evidence to be confident in the accusation. What level of confidence is appropriate? In the US legal system, for criminal cases, it is the intuitive “beyond a reasonable doubt”. In conventional economic business decisions, it is 95%. If 95% on the coin flipping with n = 20, a fair coin could provide between 5 and 14 successes, P ( 5 £ x £ 14 ) @ 0.95 . And since the experimental outcome of 8 H is not outside of that range, one cannot claim “Foul!” with a 95% confidence. One cannot reject the hypothesis at the 95% confidence level. Note: Not rejecting the hypothesis, does not mean the hypothesis is true. It may be a trick coin that has a 0.45 probability of flipping a T. It may mean that the number of samples is not enough to see the level of detail to be able to confidently reject the hypothesis. Also, it may mean the statistic and test chosen is not able to provide a legitimate test. We cannot use experimental evidence to claim that a hypothesis is true. Here are some examples from human history: They once hypothesized that the aether conveyed light and other forms of electromagnetic radiation through space, and based on that magical substance, the Maxwell Equations modeled electromagnetic transmission. The equations worked, and remain useful, but that evidence does not prove the presence of the aether.
Hypothesis Formulation and Testing
109
They once hypothesized that heat was a fluid-like substance called caloric, and the differential equations to describe how this mystical fluid flowed within material remain useful today. Indirect supporting evidence cannot prove that the aether or caloric exist. If for instance, the coin was not fair, with a probability of flipping a H as p(H) = 0.45, then getting between 6 and 13 successes is not unlikely (it is within the 95% of possible outcomes). Figure 6.2 compares the distribution with p = 0.50 to that of p = 0.45. The p = 0.45 value shifts the distribution slightly to the left. One could get a 50/50 split (10 each) with 20 flips of a trick coin with a p = 0.45. Getting the expected outcome does not prove that the fair-coin hypothesis is true. Getting an 8/12 split does not prove that the fair-coin hypothesis is false. There are no absolutes. Accordingly, the hypothesis testing will either “reject the hypothesis at the specified confidence” or “not reject the hypothesis at the specified confidence”. The “not reject” claim does not mean we have proved that the hypothesis is true. It just means either of two things: 1) The hypothesis might be true. It also could mean: 2) The hypothesis might not be true, and we have insufficient evidence to confidently claim it is not true. The parallel in the legal system in the US is to begin with the hypothesis that the accused is innocent. The jury sees the evidence, then either declares “guilty” (meaning there was enough evidence to confidently claim that the accused did commit the crime) or “not guilty” (meaning that there was not enough confidence to claim guilty). Note that “not guilty” does not mean “innocent” (although the accused might want to claim that a “not guilty” verdict proves innocence). Unfortunately, the tradition in statistical hypothesis testing, uses “accept” instead of “not reject”. If the hypothesis cannot be confidently rejected, if the statistical decision is “not reject”, the statistical term is to “accept” the hypothesis. However, accepting the hypothesis does not prove it is true. The hypothesis being tested is usually the null hypothesis, meaning that there is no difference between a population and its expected value, or between the population parameters of two or more treatments. The symbol is H0. For every hypothesis, H, there is an alternate hypothesis, HA, which is the complement, or opposite, of the hypothesis. Rejection of the hypothesis automatically requires acceptance of the alternate hypothesis (but again, statistical “accept” does not mean proof that HA is true). No matter how careful you are, the probability of an erroneous claim exists. Uncertainty in hypothesis acceptance/rejection will be reflected in either a confidence interval about a parameter estimate, or confidence level in the acceptance/rejection of the hypothesis.
FIGURE 6.2 Comparing the binomial CDF with n = 20, and p(H) = 0.5 (right-most CDF) and p(H) = 0.45 (left-most CDF). The dashed lines at 5 and 13 successes indicate the 95% confidence interval for the trick coin.
110
Applied Engineering Statistics
6.1.1 Critical Value Method Here is an outline of the procedure for statistical hypothesis testing based on critical values:
1. Based on your supposition, hypothesize some expected feature of the data, and the alternate to that. The hypothesis will be about the population, not a sample. 2. Expect what an experiment might reveal if the hypothesis is true and define an experimental procedure with a measurable outcome that could either reveal or counter the expected outcome. 3. Define the statistic for the test data. It might be any one of the several descriptive statistics described in Chapter 4 – a count, or a ratio of deviation from expected scaled by the standard deviation. 4. Choose the probability distribution that matches the characteristics (properties) of the test data associated with the statistic. You could use theoretical analysis, intuition, previous experience, or a match to the empirical data histogram or CDF as your basis for selecting the distribution. 5. Choose a level of confidence that is appropriate for the context. It might be 50% if the consequences of rejecting hypothesis that is true is not critical, or 99.9% if you want to be very sure of only rejecting a hypothesis that is false. It might equally split the extreme regions, or only consider one side as extreme. In the absence of special context, the 95% limit is the traditional level of confidence for economic/ business decisions, and the extreme areas are split equally. 6. Do the experiment and collect data. 7. Use the data to formulate the statistic (average, count, z, t, χ2, F, etc.). 8. Use the chosen distribution and confidence to determine the expected range on the sample statistic corresponding to the confidence interval if the hypothesis would be true. 9. If the sample statistic value is beyond the probable range of the distribution statistic, reject the hypothesis. If it is within the probable range, accept the hypothesis. 10. Report the decision in a manner that clearly reveals the choices you made in Steps 1, 2, 3, 4, and 5, and in a manner that does not cause the audience to misinterpret the accept/reject conclusion.
Note: As a caution, the conclusion will be dependent on the choices you make in Steps 1, 2, 3, 4, and 5. Continuing the fair coin illustration:
a. Suppose in Step 1, you hypothesize that it is not a fair coin, that the chance of a H is only 0.45. Then the data would have you accept the hypothesis and claim that the coin was foul because if p(H) = 0.45 then p ( H ) ¹ 0.5 . b. Suppose in Step 2, you decide that a test will be to look at the bottom side of a flipped coin to check that it had a T and a H on opposite sides. Then if the H and T coin was beveled and weighted to make the aerodynamics of a flip have a mere 10% chance of being a H, the test will be collecting irrelevant data. The test will find that in 100% of the flips the coin will have both a T and an H. But having two sides does not mean that they are equally probable. Be sure that the hypothesis is relevant to the supposition.
Hypothesis Formulation and Testing
111
c. Suppose in Step 3, that you choose a normal distribution, not the binomial. Then the basis for any conclusion will not be grounded in the ideal truth about the events. d. Suppose in Step 4, you choose a 5% confidence, then the lower 47.5% and upper 52.5% of expected outcomes are both 10, and any reasonably expected fair outcome other than x = 10 would lead you to reject the fair coin hypothesis. e. Finally, in Step 5, if you wanted to show that the coin was foul, you could contrive an experimental protocol so that the flipping is not exactly unbiased. It may be a mechanical flipper designed to only make one flip during the path, with each flip starting with a H-up coin orientation. Be sure that the experimental process is fair. Critically review your choices in Steps 1–5 to ensure that your choices are legitimate. 6.1.2 p -Value Assessment The critical value procedure is one approach to hypothesis testing. There are alternate approaches to rejecting the hypothesis, to assessing the degree of violation or improbability of a data outcome if the hypothesis is true. The issue with the critical value approach is: If the statistic value is very near to, but does not exceed, the critical value, the statistical statement is “accept the hypothesis”. For instance, if the expected number of successes in flipping a fair coin is 1,000 (out of 2,000 flips), and the 99% confidence is chosen, the two critical values are 942 (or fewer) and 1,058 (or more) successes. If there are only 943 successes, the conclusion would be to accept the fair coin hypothesis because there is not enough evidence to reject the hypothesis at a 99% confidence, because 943 is not in the critical region. Instead of being directed to think the coin is fair, at least, the audience should be informed of how close the statistic is to the critical values. Although the statement is “accept the fair coin hypothesis”, the trial outcome was one event out of 2,000 away from being rejected. The trial outcome of 943 successes is very close to the critical value of 942 or fewer which would lead to the claim “reject with a 99% confidence”. This closeness to the reject decision needs to be revealed. The p-value approach is one. The p-value is the probability that the data could have an outcome as (or more) extreme if the hypothesis is true. For instance, in the coin flipping situation of 943 successes out of 2,000 trials, the CDF associated with that is 0.005747…. In Excel use the BINOM.DIST ( s, n, p,1) function, BINOM.DIST ( 943, 2000, 0.5, 1) = 0.005747 ¼. There is a 0.005747 probability of getting 943 successes or fewer. Since the fair coin hypothesis is a two-sided test that is rejected if either too few or too many successes, the p-value is twice that value or ~0.0115. There is only a 1.15% chance that a fair coin could have such an extreme outcome. So, the qualification to the “accept the fair coin hypothesis” statement could be extended to include, “But, the p-value, the probability of a fair coin producing such a lopsided outcome is 0.0115 – a 1.15% chance. It is fairly improbable that the coin is fair.” For a one-sided t-test the p-value is T .DIST ( T , u , 1) if T < 0, or 1 - T .DIST ( T , u , 1) if T > 0. Since the t-distribution is symmetric, for either case
pvalue1- sided , t distribution = 1 - T .DIST ( T , u , 1) (6.1)
112
Applied Engineering Statistics
For a two-sided t-test the p-value is
pvalue2 - sided , t distribution = 2 éë1 - T .DIST ( T , u , 1) ùû (6.2)
Note the absolute value of T is used for both. Here is a modified outline of the procedure for statistical hypothesis testing which uses p-values to determine the improbability of the hypothesis: 1. Same. 2. Same.
3. Same. 4. Same. 5. Same. 6. Same. 7. Same. 8. Use the chosen distribution and determine the probability (p-value) that the population could have generated such an extreme value of the sample statistic if the hypothesis is true. 9. If the p-value is improbable relative to the choice in Step 5, reject the hypothesis. If it is within the permissible range, accept the hypothesis. 10. Report the decision and p-value in a manner that clearly reveals the choices you made in Steps 1, 2, 3, 4, and 5, and in a manner that does not cause the audience to misinterpret the accept/reject conclusion. Note: Again, as a caution, the conclusion will be dependent on the choices you make in Steps 1, 2, 3, 4, and 5. The advantage of the p-value approach is that it reveals the magnitude of the deviation of the hypothesis. It conveys more than just the accept/reject decision. For instance, the critical value for a test statistic might be 12.34, and any larger value would reject the hypothesis. If the test statistic value is 12.33, nearly rejecting the hypothesis, but not, it leads to the “accept” decision. This black/white accept/reject dichotomy does not relay to the audience how close the decision is. The p-value would report that such an extreme value of the test statistic has a probability of 0.051, indicating it is close to the standard 95% confidence interval limit. Further, a test statistic of 12.35 gives the same “reject” decision as one that might have a value of 278.3. Reporting a “reject” decision does not reveal that the second value is not even close. However, its p-value might be 0.000001. Reporting the p-value indicates the strength of justification for rejecting (or accepting) the hypothesis. 6.1.3 Probability Ratio Method For an equality hypothesis, such as H : mB = m A + k , which can be expressed as a null hypothesis H : mB - m A - k = 0 , the test has a two-sided rejection region. Representing the means would be the averages, and the statistic would have XB - X A - k , perhaps as a t-staX - XA - k . tistic scaled by the standard deviation on the difference of the averages, T = B sXB - X A
113
Hypothesis Formulation and Testing
If the hypothesis is true, the ideal t-value is zero. Reject if the t-value is too extreme, either “+” or “−”. If the distribution is symmetric, and the hypothesis is true, then the chance of getting a T > 0 is the same as a T < 0, 50%. P(T > 0) = 0.5, and P(T < 0) = 0.5. If, however, the hypothesis is not true, then there will be a higher probability of getting either “+” or “−” values. For example, if the data indicates XB - X A - k = +5, and that value better represents mB - m A - k than a value of 0, then the chance of getting a negative value from the experiment is lower than a 50% chance. Figure 6.3 Illustrates the two cases with the CDF of t-statistic distributions. The dashed curve represents the hypothesis mB - m A - k = 0 . It is symmetric about the T-value of zero, and the CDF value of T = 0 is 0.5, half of the expected t-values from experiments should be >0 and half 0, and mB - m A = XB - X A , then P ( T < 0 ) = 1 - T .DIST ( T , u , 1) . If the experimental T < 0, and mB - m A = XB - X A , then P ( T > 0 ) = 1 - T .DIST ( -T , u , 1) . These can be generalized using the absolute value of the experimental T. If mB - m A = XB - X A , then P ( T having the other sign ) = 1 - T .DIST ( T , u , 1) . The ratio of P ( T having the other sign | mB - m A - k = 0 ) = 0.5 to P T having the other sign | mB - m ng the other sign | mB - m A = XB - X A = 1 - T .DIST ( T , u , 1) for testing the null hypothesis of a symmetric distribution is the probability ratio.
(
)
Pr =
0.5 (6.3) 1 - T .DIST ( T , u , 1)
If the experimental T-value is close to the hypothesized zero, then the CDF value will be close to 0.5, and the probability ratio will be around unity. However, if the experimental T-value is not close to the hypothesized zero, then the denominator will have a low value and the probability ratio will have a large value. 0.5 @ 3.8. the chance of getting a t-value of 1.2 if the As illustrated in Figure 6.3, Pr = 1 - .87 hypothesis is true is about 2 × 0.13 = 0.26, which is far greater than the normal 0.05 criteria
FIGURE 6.3 Illustrating the probability of getting a t-value on the other side of zero.
114
Applied Engineering Statistics
for rejecting the hypothesis. So, a probability ratio of 3.8 is not adequate cause to reject. However, if the probability ratio is 20:1, it means that there is 20 times greater chance of getting the sign of T than if the null hypothesis were true. Perhaps 20:1 odds is strong enough evidence to reject. Not unexpectedly, the three criteria are related. For the two-sided t-test, the probability ratio is the reciprocal of the p-value.
é1 - T .DIST ( T , u , 1) ù û pvalue2 - sided , t distribution = 1 / Pr = ë 0.5 (6.4) = 2 éë1 - T .DIST ( T , u , 1) ùû
The probability ratio is valid when considering the two-sided null hypothesis for a symmetric distribution. The Pr is not the traditional statistical metric, but the odds that it represents seems to be a more familiar metric to most people than either the critical value or the alpha level of significance. 6.1.4 What Distribution to Use? The normal distribution describes most data, and more so the average of data, and is therefore the most widely used model for comparing the mean of continuous distributions. However, the normal distribution is not the appropriate choice for all datasets. Chapter 7 provides test procedures to determine whether or not the normal distribution is a valid model. When comparing averages to an expected mean or when comparing averages to averages, when sigma is known, the standard normal distribution is probably the correct choice. When sigma is estimated from the data, the t-distribution is probably the correct choice. When comparing sample standard deviation to an expected sigma the chi-squared distribution is probably the correct choice. When comparing two sample standard deviations, the F-distribution is probably the correct choice.
6.2 Types of Hypothesis Testing Errors We discuss the sources of errors (not human mistakes, but alternately, naturally occurring deviations, vagaries, noise) in Chapter 8. As a result of experimental errors, the characteristics of a sample from the same population may vary from sample to sample. Although the sample is expected to be representative of the population, it may not always be. As the estimates of the population parameters may vary with the sample, it is possible that one of two types of hypothesis testing errors may be made when using sample data to evaluate the hypothesis. Just because you have a hypothesis does not mean that it is true. We suppose that something is true and create a hypothesis about an expected feature of the test outcome. It might be true, but it might be false. A Type 1 error occurs when a true hypothesis, H, is rejected. We use the Greek letter α to indicate the probability of committing a Type 1 (T-I) error, i.e., rejecting a hypothesis when
Hypothesis Formulation and Testing
115
it is in fact true. The probability of not making a T-I error is (1 − α) and is the basis of the confidence intervals in Chapter 5. α is also called level of significance, indicating the chance we are willing to take of committing a T-I error. As an example, Figure 6.1 shows the CDF of number of successes (flipping a H) for a fair coin p(H) = 0.5 and n = 20 trials. There is a possibility of only getting 2 heads, or only 4, or even 15, but these are extremely rare outcomes. The probability of getting 6 or fewer wins (successes) or 13 or more (s £ 6, s ³ 13 ) is roughly 10%. In that example a @ 0.1. If the fair coin acceptance region is from 7 to 12 heads inclusive (7 £ s £ 12 ), then there is about a 10% chance, a @ 0.1, of rejecting the fair-coin hypothesis when it is true. That is about a 10% chance of making a T-I error. A Type 2 error occurs when we accept a hypothesis, H, when it is actually false. The Greek letter β represents the probability of committing a Type 2 (T-II) error. The power of a test is defined as (1 − β), which is the probability of not rejecting a false hypothesis. The probability of committing a T-II error depends, partly, on the degree of wrongness in the false hypothesis. For example, if the hypothesis is that the coin is fair, but in reality p(H) = 0.499, the fair-coin hypothesis is false. But it would take hundreds of thousands of flips to differentiate the outcome from the hypothesized p(H) = 0.5. So, in all practicality, after a reasonable number of flips, the data would reveal 7 £ s £ 12 , and the false hypothesis would be accepted. That is a T-II error. By contrast, if the coin p(H) = 0.9, then in just a dozen flips the fair coin hypothesis would be rejected. As an illustration, see Figure 6.4. The CDF of the fair coin, p(H) = 0.5, is to the right and the CDF of a trick coin, p(H) = 0.4, is to the left. The 90% fair-coin hypothesis acceptance region is a number of successes 7 £ s £ 12 , roughly α = 0.1. However, the trick coin will also generate successes between 7 and 12, inclusive, and the rejection area for 6 or fewer is about 13% for the trick coin and for 13 or more is about 1%, so the probability of rejecting a trick coin as being fair is about 14% or accepting the trick coin as a fair coin is about 86%. Here the chance of making a T-II error is 0.86, β = 0.86. The probability of committing a T-II error depends on the choices of both α and n as well as the magnitude of the deviation from the hypothesis. In Figure 6.5 for the trick coin p(H) = 0.2 (left-most CDF), and α = 0.1 and n = 20 are as before. With the larger deviation from the hypothesis there is only a 9% chance that the trick coin will give a count in the ≤6 or ≥14 rejection region. Here β = 0.09. In Figure 6.6 for the trick coin p(H) = 0.45 (left-most CDF) and α = 0.1, as in Figure 6.4, but here n = 200. With the larger number of samples there only a 42% chance that the trick coin
FIGURE 6.4 Comparing the binomial CDF with n = 20, and p(H) = 0.50 (right-most CDF) and p(H) = 0.45 (left-most CDF). The vertical dashed lines at 6 and 13 successes indicate the 90% confidence interval for the fair coin, but the horizontal lines are the corresponding CDF values for the trick coin.
116
Applied Engineering Statistics
FIGURE 6.5 Comparing the binomial CDF with n = 20, and p(H) = 0.5 (right-most CDF) and p(H) = 0.2 (left-most CDF). The vertical dashed lines at 6 and 14 successes indicate the 90% confidence interval for the fair coin, but the horizontal lines are the corresponding CDF values for the trick coin.
FIGURE 6.6 Comparing the binomial CDF with n = 200, and p(H) = 0.5 (right-most CDF) and p(H) = 0.2 (left-most CDF). The vertical dashed lines at 88 and 112 successes indicate the 90% confidence interval for the fair coin, but the horizontal lines are the corresponding CDF values for the trick coin.
FIGURE 6.7 Comparing the binomial CDF with n = 20, and p(H) = 0.5 (right-most CDF) and p(H) = 0.45 (left-most CDF). The vertical dashed lines at 9 and 11 successes indicate the 40% confidence interval for the fair coin, but the horizontal lines are the corresponding CDF values for the trick coin.
will give a count in the ≤88 or ≥112 rejection region. Here β = 0.58, an improvement over β = 0.86 for the n = 20 case in Figure 6.4. As a final exploration of the impact of choices on the T-II error, in Figure 6.7 for the trick coin p(H) = 0.45 (left-most CDF) and here n = 20, as in Figure 6.4, but here α = 0.6. With the larger α, a greater chance of making a T-I error, a lower confidence of 40%, there only a 28% chance that the trick coin will give a count in the ≤9 or ≥11 rejection region. Here β = 0.28.
117
Hypothesis Formulation and Testing
Further, the probability of committing a Type-II error is not the complement of the probability of committing a T-I error, i.e., b ¹ ( 1 - a ) . Consider two populations, X1 and X2, for which μ1 is really less than μ2. However, when the samples are collected and the data averaged, you find that X1 > X 2 . If the hypothesis is H: μ1 ≤ μ2, the data may cause us to reject H, committing a Type-I error. Conversely, the data might have been such that X1 X 2. Depending on the variances involved, we might accept H: μ1 = μ2, thus making a T-II error. Unfortunately, no test of a null hypothesis has yet been devised that simultaneously minimizes both types of error. However, increasing the number of samples can, but it increases experimental cost. Hypotheses are accepted or rejected based on the choice of α, the probability of a T-I error. Various considerations to guide that choice are given in Chapters 10 and 17. However, β, the probability of a T-II error is also important. The value of β depends on the degree that the hypothesis might be wrong, the number of data, and the alpha-value that sets the acceptance limits. Economic considerations for these aspects are also given in Chapter 10.
6.3 Two-Sided and One-Sided Tests There are several basic choices for the hypothesis and its alternate. Using the population mean μ as the example of a continuum-valued statistic, the four choices could be represented as:
H 0 : m = m0
vs.
H A : m ¹ m0 (6.5)
H : m £ m0
vs.
H A : m > m0 (6.6)
H : m ³ m0
vs.
H A : m < m0 (6.7)
H : m ¹ m0
vs.
H A : m = m0 (6.8)
The first hypothesis says that you expect the population mean to have a particular value, μ0. The alternate hypothesis in this case has two possibilities: HA1: μ < μ0 and HA2: μ > μ0. Because there are two options, evaluation of the null hypothesis given in Equation (6.5) requires a two-sided, or two-tailed, test. So does the hypothesis of Equation (6.8). On the other hand, the hypotheses of Equations (6.6) and (6.7) are one-sided, any statistic to one side would lead to acceptance, and only one extreme side would lead to rejection. Note: In this discussion, the statistic does not have to be the mean, as illustrated in Equations (6.5) to (6.8). It could be a variance, or a scaled statistic such as t, χ2, or F. There are several other hypothesis choices that are essentially equivalent to those four. These two are essentially equivalent to Equations (6.6) and (6.7).
H 0 : m < m0
vs.
H A : m ³ m0 (6.9)
H 0 : m > m0
vs.
H A : m £ m0 (6.10)
If the possible values for μ were discretized values, perhaps counting integers, and μ0 = 5 then the hypothesis m < m0 could only permit numbers 0 through 4. While the nearly
118
Applied Engineering Statistics
equivalent hypothesis m £ m0 could permit numbers 0 through 4 and 5. These test procedures are equivalent, because the test for m < 5 has the same rejection/acceptance areas as the test for m £ 4. Although, mathematically for continuum-valued variables the m £ m0 condition includes the μ0 point value, and the m < m0 condition excludes the point value, the m < m0 condition permits being infinitesimally close to that point. However, in any practical application with continuum-valued variables, numerical truncation on the digital values will exceed that infinitesimal amount; and further, error on the data values and the mismatch between the ideal distribution and the true sample-generating phenomena will be even larger. In theory, we can differentiate, less-than from less-than-or-equal-to; but in reality, we cannot. There are four more possible hypotheses! Hypotheses could also use not-greater-than and not-less-than, but these are identical to less-than-or-equal-to and greater-than-orequal-to. And, not-less-than-or-equal-to and not-greater-than-or-equal-to are equivalent to greater-than and less-than. Equations (6.6) and (6.7) are complementary to each other. The method of rejecting/ accepting the less-than hypothesis is the same as rejecting/accepting the greater-than hypothesis, except for using the rejection region to the right, instead of to the left. These use the same procedures. Equation (6.8), the not-equal hypothesis, is not practical. Except in mathematical concept, nothing is equal. We cannot make things equal. The can of juice indicates it is 8 oz. That is a nominal or target value. To have the contents be exactly 8 oz means that it cannot be over or under by one molecule. Nothing is truly equal. In reality, for continuum-valued variables, the hypothesis H : m ¹ m0 is effectively always true. Chance might make it appear true for discrete variables. So, essentially there are only two procedures that need to be considered. A two-sided test for Equation (6.5), and the identical (but right vs left) one-sided test for Equations (6.6) and (6.7). After selecting the significance level of a two-sided test, it is customary to divide the chance of committing a T-I error into halves and to assign one half to each of the alternate hypotheses. These halves define the critical regions for the test of the null hypothesis. The values of the test statistic defining the inner limit between each too-extreme region, the acceptance region, are called the critical values. The too-extreme regions are termed critical regions. If the value of the test statistic calculated from sample data falls in either critical region, we reject the hypothesis. It is possibly false, and we accept the alternate hypothesis as possibly true. For this reason, the critical regions are also termed rejection regions. In the case of a two-sided test, the area under the pdf curve in each rejection region is conventionally set to 100 (a /2 ) percent of the total area under the curve. The CDF values of the critical region are α/2 and (1 - a /2 ). All the remainder of the area is the acceptance region. If values of the test statistic fall in the acceptance region, we accept the null hypothesis as possibly true. Figure 6.1 shows the rejection (outside the dashed lines) and acceptance regions (between the dashed lines) for a two-sided test of a hypothesis of the type in Equation (6.5) for the hypothesis m = m0. Alternately, if the hypothesis is that of Equation (6.8), m ¹ m0 , then the rejection region would be between the dashed lines and the acceptance region the more extreme values beyond either line. Figure 6.8 illustrates the one-sided rejection region associated with Equation (6.6) for roughly a 90% confidence. If the hypothesis is m < m0 then the acceptance region is below the CDF = 0.9 (or for a statistic value to the left of the dashed line on the horizontal axis), and the rejection region is above and including (or to the right). If the hypothesis is or m £ m0 , then the acceptance region is below and inclusive of the CDF = 0.9 value. Again, the
Hypothesis Formulation and Testing
119
FIGURE 6.8 Illustrating a one-sided test with acceptance region with lower CDF values and a statistic to the left. Representing a continuum-valued statistic which is unbounded on either side, such as the t-statistic.
FIGURE 6.9 Illustrating a one-sided test with acceptance region with higher CDF values and a statistic to the right. Representing a continuum-valued statistic which is bounded on one side such as the χ2 or F-statistic.
statistic does not have to be the value of the mean, it could be a count, average, variance, t, χ2, etc. The complement situation is used for the hypothesis m > m0 . Again, the statistic does not have to be the value of the mean, it could be a count, average, variance, t, χ2, etc. In u s2 Figure 6.9 the illustration is for the χ2 statistic, c 2 = 2 , and the hypothesis s > s 0 . The s0 acceptance region is above the CDF = 0.1 (or for a statistic value to the right of the dashed line on the horizontal axis), and the rejection region is below (or to the left) and inclusive of the critical value. Thinking about permissible values of μ or σ for the hypothesis will reveal whether the CDF for 90% confidence should be 0.9 or 0.1. In Figure 6.9 the hypothesis is s > s 0 . If the ratio s / s 0 is large, or very large, or infinity, then you accept that s > s 0 is possibly true. In contrast, if the ratio s / s 0 is small, or the extreme of zero, then you have confidence in rejecting the s > s 0 hypothesis. The rejection region for the hypothesis s > s 0 is unusually small values, the low CDF region. Some use the mnemonic that the less-than or greaterthan symbol in the alternate hypothesis points to the rejection region. We use the terms “accept” and “reject” with regard to hypothesis testing throughout this and subsequent chapters to be consistent with the terminology you will find in practice. The actual meaning of “accept” is “there is insufficient evidence (from the sample) to confidently reject the hypothesis.” We stress that acceptance of either the hypothesis or its alternate does not imply that that one is absolutely true and the other is absolutely false. Acceptance indicates only that the hypothesis could be true. The actual meaning of
120
Applied Engineering Statistics
“reject” is “there is sufficient evidence (from sample data) to confidently reject the hypothesis, we believe that the hypothesis is untenable, it is possibly false.” The determination of the critical region(s) proceeds as illustrated above whether the test statistic is t, χ2, F (or others). In a test using the average of a sample (a test about the mean) it is usually acceptable to use the normal distribution. Even if the data are not normally distributed, the average of more than several data values will be nearly normally distributed. If the variance value is known, use the standard normal statistic, z. The t-statistic and distribution is used when hypotheses on the mean must be tested but the population variance σ2 is not known. The χ2 and F statistics are used to test hypotheses about the variances of one or two populations, respectively.
6.4 Tests about the Mean The Z tests concerning the mean of a single population and those involving the means of two populations are strictly valid only when the corresponding populations are normally distributed, and the population variance is known. If the variances of the populations are estimated by S2 (and if the population empirical CDF appears approximately normal), the t-test must be used for tests concerning the mean. We will now discuss some examples illustrating the concepts of testing hypotheses concerning the mean. Example 6.1: Suppose a normally distributed population has a variance of 6. We don’t know the population mean μ, but we expect its value to be μ = 2. By experiment, we collect a sample of 20 items for which X = 3.1. We want to test the null hypothesis H0: μ = μ0 = 2 against the alternate hypothesis HA: μ ≠ 2. This test is two-sided because two rejection regions are defined by HA. Since the variance is known, we will use the standard normal distribution to test the hypothesis. We select α = 0.05 as the significance level for the test, and dividing the two rejection regions in equal sizes, the critical CDF values are α/2 = 0.025 and 1 - a /2 = 0.975 . The associated critical values of Z are zα/2 = −1.95996… and z1 − α/2 = 1.95996.... In Excel these values can be obtained from the function NORM.INV ( CDF, 0, 1) . This can be stated as:
P ( za /2 < Z < z1-a /2 ) = 1 - a
P ( -1.96 < Z < 1.96 ) = 0.95
Where the rounded value 1.96 is used instead of 1.95996…. The symbolic representation means the 95% limits of the z-statistic value are between −1.96 and +1.96. The thumbnail sketch of the standard normal CDF shows these limits.
121
Hypothesis Formulation and Testing
We next calculate the value of the data statistic, Z, based on sample data: Z=
X - m0
s /n 2
=
3.1 - 2 = 2.0083¼ 6 / 20
Because 2.008 > 1.96, the result is outside the 95% confidence limit on Z. The experimental data is too extreme to be believed probably from data generated by a μ = 2 population. Accordingly: We therefore reject H0, accept HA, and conclude with 95% confidence from the Z test that μ is probably not 2. To support this with a p-value, first determine the CDF value that z = 2.0083 represents. The Excel function NORM.DIST ( z, 0, 1, 1) returns 0.977694, meaning that the chance of getting z = 2.0083 or higher values is 1 − 0.977694 = 0.022306. But since this is a two-sided test, the chance of such an extreme value is twice that. The p-value for the test statistic is 0.04461. There is only a 4.5% chance of the experimental outcome being so extreme if the hypothesis is true. The probability ratio is 0.5/0.022306 = 22.4. If the data represents the truth, the chance of having a such an extreme “+” deviation on the statistic is 22.4 times more probable than if the null hypothesis were true. This p-value of 0.044 is marginally less probable than the 95% specification of 0.05. If one chose a 96% level of confidence, the test would accept the hypothesis. We could have tested the hypothesis by calculating the 95% confidence interval on the population mean about the sample average using the inverse of the t-formula. For this situation, s X = s / n = 0.5772256, z = ± 1.96 as above, and
(
)
P X + za /2s X < m < X + z1-a /2s X = 1 - a
P ( 3.1 - 1.96 ( 0.5477256 ) < m < 3.1
+1.96 ( 0.5477256 ) ) = 0.95
P ( 2.0264 < m < 4.1735 ) = 0.95
Again, the population mean μ is not inside the 95% confidence interval (CI) about X , so we reject H0 and conclude that the population mean is probably not 2. The result is the same, as you should have expected. Note: Rounding the several values in this example did not affect the conclusion. As a rule of thumb when you round values, keep two more decimal digits than those that might have an impact on the result. Example 6.2: A particular normal population has a variance of 5. Is it reasonable to expect that μ ≤ 10? To answer this question, a 16-member sample was taken from the population and was found to have a mean of 12.46. In this instance, μ0 = 10. For this example, let us select α = 0.02 representing the 98% confidence. The hypothesis is H: μ ≤ μ0 = 10. The corresponding alternate hypothesis is HA: μ > μ0 = 10. This is a one-sided test because only one rejection region is defined by HA. For α = 0.02 as the significance level, we want Z1 − α = Z0.98 to define the critical region. The value can be found from the Excel function NORM.INV ( 0.98, 0, 1) = 2.053748¼. The rejection region (critical region) is that with a CDF > 0.98, which is to the right of z = 2.053748….
122
Applied Engineering Statistics
The confidence interval for testing this hypothesis is one-sided. We will reject H if Z > za
(
)
Equivalently, the confidence interval for μ is obtained by substituting Z = X - m /s X into the confidence interval for Z. We will reject H if X - za s X > 10 .
(
)
From the data, we calculate s X = 5 / 16 = 0.559017 and Z = X - m / s X = 4.40058. We have Zdata = 4.40058 > Zcritical = 2.0538 , or equivalently, 12.46 − 1.1481 > 10 for μ. As either Z and μ fall within their respective rejection ranges, the hypothesis H: μ ≤ 10 is rejected by the Z test as possibly false at the 2% significance level. But this does not relay whether the rejection was overwhelming or just close. The p-value is found by determining the CDF associated with the test statistic and converting that to the probability of that or a more extreme value. CDF = NORM.DIST ( 4.40058, 0, 1, 1) = 0.9999946 . Since the rejection area is one-sided, the p-value is 1 – 0.9999946 = 5.4 × 10−6. The justification for rejecting the hypothesis is very strong!
Note: Neither Example 6.1 nor 6.2 used data values that had dimensional units. Most likely your data will have units, and the units on X , X , m0 , and σ will all have the same dimensional units. The Z-statistic will be dimensionless. Note: Examples 6.1 and 6.2 might represent the common business situation of qualifying a new supplier of a raw material, device, or test procedure, or new operator. The mean and variance of the hypothesized population would represent the traditional treatment, and the question in Example 6.1 is, “Is the new treatment equivalent?” The data could represent any quality metric (product yield, completion time, manufacturing cost, etc.). Whether one test, based on one hypothesis on one of several important metrics, accepts or rejects, you should also test the other key performance metrics. To test hypotheses on the population mean μ when the value of σ2 is unknown and X - m0 X - m0 estimated by sample standard deviation, we must use the test statistic T = = SX S/ n with v = n - 1 degrees of freedom.
(
)
Example 6.3: In a quality test, painted surfaces are subjected to intense light and heat to accelerate the rate of appearance of paint defects. The specification requires the mean time to first appearance of a defect to be greater than 10.5 hrs, μ > 10.5 hrs. After the supplier of one component in the paint formulation changed, the company sampled 10 items and the quality test showed an average of 10.21 hrs. The standard deviation of the sample data was 0.74 hrs. The sample average is less than 10.5 hrs, but could the population mean still meet specification? Certainly, some samples will have lower values and some higher. Maybe this set of 10 samples just happened to come out with more of the low values. Can we claim that we are still on-specification?
123
Hypothesis Formulation and Testing
The supposition is that we are still on-spec. If so, the hypothesis is that the new population mean is greater than 10.5 hrs: H : m > 10.5 hrs . We’ll use the conventional 95% confidence, meaning that α = 0.05. Since the sample provides the standard deviation, we’ll use the t-statistic with u = n - 1 = 10 - 1 = 9 degrees of freedom. This is a one-sided test; we only reject the hypothesis if it is improbable that the population mean is above 10.5. The experimental statistic value is
T=
X - m0 X - m0 10.21 - 10.5 = = = -1.2392709¼ SX S/ n 0.74/ 10
The one-sided t-critical value can be found using the Excel T.INV function. If the sample average is too small, then its population mean could not be above the spec. So, we reject the hypothesis at the too-small, left extreme where the CDF = α
ta ,n = t0.05 , 9 = T .INV ( 0.05, 9 ) = -1.833112¼
We see that Tdata = -1.2392709¼ ³ tcritical = -1.833112¼. Since the experimental t-value is not beyond the 95% confidence limit, we accept the hypothesis that the population mean with the new supplier could still be above 10.5 hrs. Although there is not enough evidence to reject the hypothesis at a 95% level of confidence, the new average of 10.21 hrs is cause for suspicion. At the 85% confidence level, the critical t-value is −1.0997… and the 10-sample results would cause the hypothesis to be rejected. Alternately, if increased testing of 20 samples gave the same average and standard deviation, the experimental T-value would be −1.75259… and the t-critical would be −1.72913…, and the hypothesis would be rejected. So perhaps an action should not be to unconditionally approve the new supplier, but to perform more tests. Statistics should support rational decision-making, not be the decision. The p-value associated with the experimental t = -1.2392709¼ and υ = 9 is T .DIST ( t ,u , 1) = 0.12329¼ indicating only a 12% chance that the sample could have come from a population with a mean above 10.5 hrs. It is possible, but there is only a 12% chance. Perhaps not enough of a possibility to make a blanket acceptance of the new supplier. Example 6.4: A new vertical elutriator was calibrated upon receipt before use in cotton dust sampling. The flow rates, in liters per minute, are shown below. Our concern is whether this elutriator complies with the standard flow rate range of 7.2 to 7.6 L/min as specified by 29 CFR 1910.1043. 7.66 7.79
7.43 7.55
7.32 7.40
7.34 7.67
The supposition is that the new device is compliant. This example will explore three hypotheses about the data, statistical tests. First, we’ll determine whether the population mean of these calibration data (sample) might be 7.4 L/min, the midpoint of the allowed range. 1. We assume that the population is approximately normal with unknown variance. 2. The hypothesis is H0: μ = μ0 = 7.4 L/min and HA: μ ≠ 7.4 L/min. 3. As σ2 is unknown, T will be the test statistic. 4. Choose α = 0.05 (the 95% confidence). There are two rejection regions for the extreme large and extreme small values of the possible mean. So, the CDF values of a /2 = 0.025 and 1 - a /2 = 0.975 will define the limits.
124
Applied Engineering Statistics
5. T is distributed with n - 1 = n = 7 degrees of freedom and the values of t from the Excel function T .INV ( CDF , v ) , are ± 2.3646¼ for the two-tailed test involved. 6. For the sample data, X = 7.52 , Sx = 0.173534…, and sX = 0.0613537 ¼ (all in L/ min); and T is calculated as
T=
X-m = 1.9558¼ SX
As –2.3646 < T = 1.9558 < 2.3646, we accept H0 as a result of the t-test and conclude with 95% confidence that μ could be 7.4 L/min. However, the specification was not about the mean of the population being at the mid-point of the range, it was about whether the mean might exceed the range. What we really wanted to know was whether the allowed flow rate limits are likely to be exceeded. We should be confident that μ is not less than 7.2 L/min and that μ is not greater than 7.6 L/min. That range is ±0.2 L/min. For the second test, also statistically legitimate but incompatible with the situation, we’ll see if the range for the population mean is less than ±0.2 L/min. The possible range on μ is found from X ± tv ,1-a /2SX
= ±2.3646¼( 0.0613537..) @ ±0.145 L/min
The estimated 95% tolerance on the population mean is smaller than the number allowed (±0.2 L/min). However, that second claim also does not address the issue. With a possible 95% interval on the population mean being 7.52 ± 0.145 L/min, the upper 95% value is 7.665 L/ min, which exceeds the 7.6 L/min limit. This third analysis leads to the conclusion: The elutriator is unacceptable because there is a significant probability that the flow rate is either 7.665 L/min, and it is that the upper range exceeds the CFR limits. In this example, we show the potential for the inappropriate use of statistics. We have demonstrated that it is possible to choose a statistical hypothesis that does not correspond to the engineering/business question that must be addressed. You must always be careful to select the proper hypothesis and to use the correct test for the evaluation of that hypothesis. Our primary concern in this situation was whether the elutriator complied with the 7.2 to 7.6 L/min range allowed by OSHA. Since the mean might be greater than 7.665 L/min, the answer must be “no”. The performance specification might not be met. Although, the t-test of the mean told us that the elutriator flow rate was “acceptably close” to the tolerance midpoint, in this case that does not matter. The second analysis indicated that the 95% confidence interval on the mean is less than the specified range. Although that is true, it also does not address the question, “Does the probable flow rate of the elutriator match the allowed range?” The CDF-value associated with the mean violating the lower limit is 0.0004…, which indicates there is hardly any chance of that. However, the CDF-value associated with the possibility of the mean violating the upper limit is 0.18…, which is too large a possibility to accept the hypothesis. Appropriate action must be determined. Here are some options: 1) Do more tests. More tests might shift the average so that the limits are not exceeded, so take more samples. 2) Redesign the device so that the average flow rate is a bit smaller. However here are some other actions: 3) Choose to use the 80% confidence. This will permit acceptance. 4) Claim that the 7.79 sample value is an outlier that can be rejected. This leaves 7 samples and shifts the average lower. Now the 7-sample data meets spec. (But those last two actions are just being gamey and tend to violate engineering ethics. Don’t shape the test so that the outcome seems to support a desired supposition.)
125
Hypothesis Formulation and Testing
6.5 Tests on the Difference of Two Means Three cases must be considered when comparing two means. They are the cases for which variances 1) known, 2) unknown but presumed equal, and 3) unknown with no reason to presume them equal. The most common confidence intervals and corresponding tests are presented below. 6.5.1 Case 1 (s 12 and s 22 Known) For random samples X1i and X2i from different populations with known variances s 12 and s 22 , we formulate H0: μ1 = μ2 + k (or μ1 − μ2 = k) and HA: μ1 − μ2 ≠ k. The appropriate test statistic is
Z=
(
X1 - X 2 + k
)
s / n1 + s / n1 2 1
2 2
(6.11)
where n1 and n2 are the sample sizes. If you are hypothesizing that the two populations that generated the sets of samples that resulted in the two averages, X1 and X 2 , have equal means then k = 0. If you are hypothesizing that the means are equal, H : m1 = m2 , or that they differ by the amount k, H : m1 = m2 + k , then use a two-sided test. Reject if either the data-based z-value is less than za /2 or greater than z1-a /2 . If you are hypothesizing that one population has a larger or smaller mean, then use a one-sided test. If H : m1 > m2 + k then reject if the data-based z-value is less than the zα. If H : m1 < m2 + k then reject if the databased z-value is greater than the z1-a . Since the z-distribution is symmetric, za = - z1-a and za /2 = - z1-a /2. Example 6.5: In a beginner-level competition, female gymnasts perform two routines on each of four apparatus. At the end of a meet the all-around winners are based on the sum of their 8 scores. At the beginner level, an informal competition level, routines are often judged by two judges on a maximum basis of a score of 10 in increments of 0.1. The gymnast’s score for a routine is the average of the two judges scores. Judging follows rules, but the judges’ opinions of elegance or amplitude, might be a bit different; and more importantly, at the lower levels, there are so many deductions happening so rapidly that judges miss things. If the judges’ scores of a routine differ by more than 1 point, they converse to normalize, and re-score. If the top two all-around scores, the sum of the 8 routine scores, are 56.85 and 55.25, the girl with 56.85 points gets the firstplace blue ribbon and the girl with 55.25 points gets the second-place red ribbon. Can we claim that the one gymnast was better than the other at that meet, or is that difference within judging uncertainty (and an appropriate claim is the two girls had equivalent performance)? If the 1-point allowable difference in judge’s scores represents the range which has uniformly distributed values, then from Chapter 3, the standard deviation on an indi1 = 0.28868 points, and then the standard deviation on the routine vidual score is 12 0.28868 score, the average of the two judges scores, is = 0.204124 points. Since the girls 2 have similar scores and scores from the same judges, we’ll assume that this is a case of equal variances, known. Propagating uncertainty of a sum, the 8-score total has a standard deviation of 0.204124 8 = 0.57735 points .
126
Applied Engineering Statistics
56.85 - 55.25 = 1.959592. 1 1 0.57735 / + 1 1 For the hypothesis that the two scores are equal, at the 95% confidence level the twosided critical value of z is ±1.95996 , as illustrated in the sketch. Since the experimental z-value is within that range, we accept, at the 95% confidence that the two scores come from the same population, that the girls’ performances were equivalent. If this is the case, then assignment of the blue and red ribbons might not be based on the girls’ performance but because of the vagaries of sampling. Note: the data z-value is very near to the reject value. Although the statement is “accept the null hypothesis” the data is nearly at the reject value. The Z-statistic value from the data is
Alternately, one could ask if the 56.85 score represented a population of scores (the truth about one girl’s performance) that was better than the other. For the hypothesis that the 56.85 represents a population with a higher mean than the 55.25, at a 95% confidence level, the one-sided z-critical is −1.64485, as illustrated in the sketch. Since the actual Z-score of 1.959592 is not beyond that value, we can accept the hypothesis that the one girl outscored the other.
This may seem to be a contradiction. We have accepted both the m1 = m 2 and m1 > m 2 ! But realize that accepting the m1 = m 2 hypothesis does not mean it is true, it might be, but there is not enough evidence to reject it. Similarly, accepting the m1 > m 2 hypothesis does not mean it is true, it might be, but there is not enough evidence to reject it. The degree to which the accept result is made can be represented by the p-values, the probability of the data T being that or more extreme. For the H : m1 = m 2 the p-value is 0.0500435, is nearly at the reject level, and the probability ratio is 19.98, also a very suspicious value should the null hypothesis be true. For the H : m1 > m 2 the p-value is 0.025021, also nearly at the reject level. So, although the statistical decision was to accept the greater than hypotheses, it is nearly rejectable. Alternately, one could hypothesize that the lower scoring girl was actually the better and appears worse because of the vagaries of judging. H : m1 < m 2 . At the 95% level the
127
Hypothesis Formulation and Testing
z-critical is +1.644854. Since the experimental T-value of +1.959592 is beyond that value, we can reject the hypothesis. But, again, the action does not mean that the hypothesis is not true. It simply means there is less than a 5% chance of it being true.
So, we accept both H : m1 = m 2 and H : m1 > m 2 . What action to take? If we choose action based on H : m1 = m 2 we might give two first-place awards and elevate the third highest gymnast to second place. I think this best represents the truth of the situation, but it is not practical. Where would we find gymnastic meet organizers who are also competent in statistical analysis? And think of how the youth will perceive adults who are admitting that judges and referees do not own the truth. It would create a chaos of parenting challenges! It would diminish the incentive to practice harder. So, the practical action is to pretend that although a score may not match what the gymnast actually did, it is what happened at the meet when judges are included in the outcome. And then, a 55.31 even beats a 55.30! Again, statistical analysis is not the decision of what to implement. Use it to guide and support action but ground the action in all aspects of the situation.
6.5.2 Case 2 (s 12 and s 22 Both Unknown but Presumed Equal) In this situation, the sample variances are pooled. The result is a pooled estimate of the sample variance S 2p , where
2 P
S
å =
=
n1 i =1
(X
1i
- X1
)
2
+
å
n2 i =1
(X
n1 + n2 - 2
2i
- X2
)
2
(6.12)
( n1 - 1) S + ( n2 - 1) S
2 2
2 1
n1 + n2 - 2
The corresponding test statistic is
T=
(
X1 - X 2 + k
)
SP 1 / n1 + 1 / n2
(6.13)
The statistic is described by the t distribution with v = n1 + n2 - 2 degrees of freedom. Again, with the equality condition m1 - m2 = k , if you are hypothesizing that the two populations that generated the sets of samples that resulted in the two averages, X1 and X 2 , have equal means then k = 0. If you are hypothesizing that the means are equal, H : m1 = m2 , or that they differ by the amount k, H : m1 = m2 + k , then use a two-sided test. Reject if either the data-based t-value is less than tv,a /2 or greater than tv,1-a /2 . If you are hypothesizing
128
Applied Engineering Statistics
that one population has a larger or smaller mean, then use a one-sided test. If H : m1 > m2 + k then reject if the data-based t-value is less than the ta , v . If H : m1 < m2 + k then reject if the data-based t-value is greater than the tv,1-a . Since the t-distribution is symmetric, tv ,a = -tv ,1-a and tv ,a /2 = -tv ,1-a /2 . Example 6.6: A certain heat exchanger that had been performing poorly was taken out of service and cleaned thoroughly. In order to test the effectiveness of the cleaning, measurements were made before and after to determine the heat-transfer coefficient. These results from ten use-then-clean cycles, in Btu/hr ft2 °F were as follows: Run no. 1 2 3 4 5 6 7 8 9 10
Before
After
90.5 87.6 91.3 93.2 85.7 89.3 92.4 95.3 90.1 83.2
93.4 90.4 99.6 93.7 89.6 88.1 96.7 94.2 98.6 91.1
Did the cleaning of the heat exchanger significantly improve the heat-transfer coefficient? Note: It was not necessary to have equal numbers of observations for the “before” and “after” data.
1. Assume normal populations (before cleaning = 1, after cleaning = 2). 2. The hypothesis is that cleaning increased the heat transfer performance: H: μ1 − μ2 < 0 vs HA = μ1 − μ2 ≥ 0. For this test, k = 0. 3. The test statistic is T as given by Equation (6.13) with v = 10 + 10 − 2 = 18. 4. Choose α = 0.05, the conventional value. 5. The critical value of T is t18,0.95 = +1.73406. We will reject the hypothesis if the data T is greater than this value. 6. The sample averages are 89.86 and 93.54 and the standard deviations are 3.6028 and 3.8546 (all in Btu/hr ft2 °F). The similarity of the standard deviations supports the Case 2 condition (variances are presumed equal). 7. Then SP = 3.73086 and T = −2.20558. 8. The thumbnail sketch illustrates the CDF of t and the critical value. As T < t18,0.95, T is not in the rejection region. The hypothesis (μ1 − μ2 ≤ 0) is thus accepted as possibly true with 95% confidence.
129
Hypothesis Formulation and Testing
The heat-transfer coefficient may have been improved by cleaning. Alternately the hypothesis could have been that the cleaning did not affect the heat transfer coefficient, H : m1 = m 2 . In this case, at the 95% confidence limits, the two critical t-values, t and t are ±2.10092 . Since the T-value from the data, a a v= 18 ,
2
= 0.025
v= 18 , 1-
2
= 0.975
−2.20558 exceeds one limit, we reject the hypothesis that the before and after performances are equal. However, the data T-value is almost at the critical value, so your really should acknowledge that the no-cleaning-effect hypothesis was barely rejected, nearly accepted. Looking at a p-value of getting such an extreme value as ±2.20558 is 0.040653, very nearly the level of significance associated with the 95% confidence. If a 96.1% confidence had been chosen, just a bit more desired surety, the null hypothesis would have been accepted. Although the after-cleaning average heat transfer coefficient is better than before cleaning, the improvement is not statistically overwhelming. Moreover, what might be more important is the cost impact that cleaning has on production, not whether a test outcome is statistically significant or not. Example 6.7: Supplier B is being compared to the currently used Supplier A. Since all product quality aspects are identical, the comparison metric will be production utility cost in $k/month. There are costs associated with the proposed change from A to B which includes product testing, operator training, records revision, and risk associated with the unexpected from something new. The company wants a two-year payback on the costs of any investment in manufacturing, which, in this A-to-B change translates to an $8k/month necessary reduction in production cost. Employee X has been bragging about supplier B, and claims the switch is worth the costs. Employee Y says there might be some benefit of B, but the cost to change over is not worth the utility cost savings. So that you know, the true mean for A is $65k/month and for B is $55k/month, and the sigma for each is $3k/month. So, the savings from a switch from A to B ($10k/month) exceeds the $8k/month threshold. But the true mean is unknowable, so the company performs experiments to compare B to A. Here is the data: Supplier A $k/month 64.39199 64.04099 65.04488 64.30435 68.34815 66.03006 69.88677 66.42114 67.22042 66.27281
Supplier B $k/month 55.32395 59.40266 54.36507 58.65597 57.34935
The sample average, standard deviation, and count for the two treatments are
Average, $k/month Standard deviation, $k/month n
Supplier A
Supplier B
66.19616 1.891179 10
57.0194 2.144015 5
130
Applied Engineering Statistics
The hypothesis is to switch if the benefit is greater than k = $8k/month. Employee X, who favors the switch, tests the hypothesis H : mB < m A - k (the utility cost of using B is lower), but Employee Y, who opposes the switch tests the hypothesis H : mB > m A - k . For either set of tests, standard deviations appear to be equivalent, indicating a Case 2 t-test. The t-statistic for either hypothesis uses the same numerator mB - ( m A - k ) , the same degrees of freedom n = 10 + 5 - 2 = 13 , and the same pooled standard deviation of 1.97243 $k/month. The t-statistic is 1.089242. Employee X uses the 99% confidence (to be very sure of the conclusion), which would reject H : mB < m A - k if the t-value were greater than the critical value of 2.65031. Since the experimental t of 1.089242 is in the acceptance region, Employee X, triumphantly accepts the hypothesis and claims with 99% confidence “We should switch to Supplier B. Told ya!” Employee Y, not to be outdone, uses the 99.9% confidence, which would reject H : mB > m A - k if the t-value were less than the critical value of −3.85198. Since the experimental t of 1.089242 is in the acceptance region, Employee Y, triumphantly accepts the hypothesis and claims with 99.9% confidence, and emphases his triumph with a drop the mic gesture, “Keep using Supplier A.” Of course, the statistical term for “not reject” is “accept”, but “accept” does not mean the hypothesis is true. “Accept” just means that there is not enough information to confidently reject it. And, the greater the confidence in the statistical test does not mean greater surety of the truth of the hypothesis it means greater surety of rejecting it. Employees X and Y both are misusing statistics. Since this is a standard economic decision, Employees X and Y should be using the 95% interval which has the one-sided t-critical of ±1.770933 . Since the experimental t of 1.089242 is within the acceptance region there is not enough evidence to confidently reject either hypothesis. Either hypothesis may be true. However, the p-value for H : mB < m A - k is 0.852, and the p-value for H : mB > m A - k is 0.148, suggesting that rejecting H : mB > m A - k may be more probable than rejecting H : mB < m A - k . Although not statistically definitive, Supplier B seems be the better choice. If one wants to be more certain, do more trials. After more trials, a total of 15 trials each, sample average, standard deviation, and count for the two treatments are:
Average, $k/month Standard deviation, $k/month n
Supplier A
Supplier B
65.35417 2.609777 15
55.44148 2.364961 15
The Case 2 t-value is 2.103342, which exceeds the 95% critical value t0.95 , 28 = +1.701131 of X’s hypothesis, H : mB < m A - k , and does not exceed the 95% critical value t0.05 , 28 = -1.701131 of Y’s hypothesis, H : mB > m A - k . So, the hypothesis that Supplier B does not have enough economic advantage is confidently rejected. The p-value for rejection was 0.0223, there was only a 2.23% chance that it was an erroneous decision. Accept the switch to B.
As you might suppose, the Case 2 t-test is the most general case. It is commonly used in the evaluation of the effects of maintenance, production improvements, operating changes, and raw material/parts source changes. 6.5.3 Case 3 (s 12 and s 22 Both Unknown and Presumed Unequal) This case is often used when major changes in manufacturing methods have been made. A new type of raw material, replacement of a major piece of processing equipment with a
131
Hypothesis Formulation and Testing
unit from a different source, and other factors all lead to the situation in which you have no reason to expect the unknown variances to be equal. This situation requires the use of a modified statistic, such as Satterthwaite’s statistic Tf, which is approximately distributed as Student’s t. The test statistic is calculated like Z in Case 1 as if S12 = s 12 and S22 = s 22 . Use Tf =
X1 - X 2 - ( m1 - m2 ) S12 / n1 + S22 / n2
(6.14)
to test hypotheses involving μ1 − μ2. As usual, n1 and n2 and S12 and S22 are the sample sizes and variances of the two samples involved. In Satterthwaite’s method, the degrees of freedom cannot be calculated exactly but are approximated by
f =
(S
2 1
(S
2 1
)
2
/ n1 + S22 / n2
(
)
2
)
2
/ n1 / ( n1 - 1) + S22 / n2 / ( n2 - 1)
(6.15)
It seems that this is also termed Welch’s method or the Welch–Satterthwaite method. Example 6.8: In the manufacture of a synthetic fiber, the polymer material, still in the form of continuous monofilaments, is subjected to high temperatures under tension to improve its shrinkage properties. The shrinkage test results for fibers from the same source, treated at two different temperatures, are given below. Is the shrinkage after treatment at 140°C less than that after 120°C? Percent Shrinkage 140°C
120°C
3.45 3.64 3.57 3.62 3.56 3.44 3.60 3.56 3.49 3.53 3.43
3.72 4.03 3.60 4.01 3.40 3.76 3.54 3.96 3.91 3.67
1. Assume normal populations (1 = 140°C, 2 = 120°C) with unequal variances. 2. H: μ1 − μ2 < 0 vs HA: μ1 − μ2 ≥ 0. 3. Use Tf = −3.1553 and f = 10.93 ≃ 11 from Equations (6.14) and (6.15), respectively. Note: It is probably more conservative when calculating f to round up to the next higher integer, thus decreasing the acceptance region. 4. Choose α = 0.01. 5. The critical value of Tf is approximately t11,0.99 = 2.7181. 6. From the data, the means are X1 = 3.5355 and X 2 = 3.760 . The variances are S12 = 0.005427 and S22 = 0.045689. From these values, the calculated value of Tf is −3.1553.
132
Applied Engineering Statistics
7. As Tf = −3.1553 < t11,0.99 = 2.7181, we will accept (with 99% confidence) the hypothesis as possibly true based on the t-test.
We now believe that the shrinkage after treatment at 140°C is probably less than that subject to 120°C. Example 6.9: The percent conversion data below were obtained with a spacecraft (zerogravity) reactor for contaminant control. Two different catalysts (MnO2 and CuO) were used for the oxidation of organic materials. MnO2: CuO:
55, 50,
62, 57,
64, 52,
63, 55,
58, 57,
61, 54,
60, 56,
62, 51,
64 55
Because MnO2 is more expensive than CuO, it will be selected only if its efficiency is clearly superior to that of CuO. It has been decided that superiority can be adequately demonstrated if the conversion when using MnO2 is at least 4% higher than that attainable with CuO. A significance level of 0.01 is required. Should Mn02 be specified for the catalytic oxidizers?
1. Assume that the populations (1 = MnO2, 2 = CuO) are approximately normally distributed. 2. H0: μ1 – μ2 ≥ 4 vs. HA: μ1 – μ2 < 4. 3. The test statistic will be t with 9 + 9 − 2 = 16 degrees of freedom. 4. α = 0.01. 5. The critical region for t must be determined after the decision regarding Case 2 (equal variances) or Case 3 (unequal variances) is reached. 6. Fcalc = S12 / S22 = 8.75 / 6.6111 = 1.3235 is within the 99% CI described by F8,8m; so, Case 2 (equal variances) will be used for the t-test. 7. The critical region is t16,0.99 = −2.5835. ( 61 - 54.111) - 4 = 2.2113. 8. From Equation (6.13), T = 1.306441779 As T ≮ −t16,0.99, the calculated T value is in the acceptance region for the one-tailed t-test. MnO2 should be recommended as the catalyst with 99% confidence. The conversion with MnO2 is probably at least 4% higher than that when CuO is used.
6.5.4 An Interpretation of the Comparison of Means – A One-Sided Test Consider two normally distributed variables as illustrated in Figure 6.10. The distribution A on the left represents a population with a mean of 2 and a sigma of 1. The other, distribution B, has a mean of 3 and a sigma of 2. It is possible for both to generate values of about 1 (within the region included between the vertical lines). Distribution B, with the higher mean, generally generates values higher than A, the one with the lower mean; but it also has a higher probability of generating values lower than the population with the lower mean. If you sample one time from each, you could get a x = 3 from A and a x = 1 from B, and mistake which is greater on average. The question is: “If you sample a value from each, what is the probability that one is greater than the other?” The solid vertical lines indicate an interval of Δx about the center of x = 1. In general, call the bin center c, x = 1 = c. The dashed vertical line indicates the midpoint of the interval. The probability of a sample from population A falling in the x = c ± Dx / 2 interval,
133
Hypothesis Formulation and Testing
FIGURE 6.10 Interpretation of the one-sided z comparison.
Dx Dx ö æ pA ç c < x £ c÷ is the area under the pdf curve. This can be estimated in any num2 2 ø è ber of ways. One simple approach is to use the rectangle rule with the midpoint pdf value A = Dx pdfA ( c ), a more accurate approach would be to use the difference in CDF values Dx ö Dx ö æ æ A = CDFA ç c + ÷ - CDFA ç c ÷. The area under the pdf B(x) curve to the right of that 2 ø 2 ø è è x-value is the probability that population B could have generated a greater value, pB ( c > 1 ) . This can easily be evaluated as 1 - CDFB ( c ) . The probability of 1) sampling from distribution A in the Δx interval AND 2) getting a greater sample value than the interval midpoint from B is the product of the two probabiliDx Dx ö æ ties: pA ç c < x £ c÷ pB ( x > c ) = Dx pdf A ( c ) éë1 - CDFB ( c ) ùû . 2 2 ø è The probability of population B giving a higher value than A for any possible c-value is the probability for x = c = 1 OR x = 2 OR c = 3 OR …. It is the sum of all probabilities for all possible x-values.
P ( xB > x A ) =
12 , in Dx increments
å x =-4
P ( xB > x A ) =
ò
+¥ -¥
Dx pdf A ( x ) éë1 - CDFB ( x ) ùû (6.16a)
pdf A ( x ) éë1 - CDFB ( x ) ùû dx (6.16b)
In truth, the sum in Equation (6.16a) should go from x - ¥ to x + ¥ , but as Figure 6.10 indicates effectively values are zero beyond the −4 to +12 range. In the limit, as Dx ® 0, the rectangle rule of integration becomes the integral of Equation (6.16b).
134
Applied Engineering Statistics
Over a wide range of mean and sigma values for populations A and B, Equation (6.16a) gives the same probability value as that from Section 6.5 Case 1 (s 12 and s 22 known) with m - mA one sample, calculating Z = B , and a one-sided comparison. s A2 + s B2
6.6 Paired t-Test When the data occur in ordered pairs (X1i, X2i), I = 1, n (with the same n for each treatment), and we wish to test the difference of the means by use of H0: μ1 − μ2 = 0, we can use the equivalent hypothesis H0: μD = 0, where Di ≡ X1i − X2i, i = 1, n. The alternate hypothesis is HA: μ1 − μ2 ≠ 0, which has HA: μD ≠ 0. This sort-of-paired observation happens when there is no presumption that all of the measurements are from the same population, and when n is the same for both samples. For instance, in testing Fertilizer 1 and Fertilizer 2, we may have adjacent plots of land across the country in differing climates and soil composition; and crop productivity will be affected by the soil and climate as well as the fertilizer. In another example, male Gymnasts A and B perform routines on parallel bars, rings, pommel horse, high bar, etc. Each apparatus has its own attributes requiring different abilities (balance, speed, strength, flexibility, etc.). The apparatus is the same for each gymnast, but the compatibility of the boys’ skills is not identical on every apparatus. Again, individual scores do not all come from the same population. As a final example you may have two designs for a product, and ask possible customers to rate each design. Some people will favor sleek over strong, blue over green, or utility over aesthetics; so, each persons’ rating of any design will not come from the same population. In these cases, the average of all the scores might remain a best indication of the difference, but the standard deviation of all scores will not represent a single population. Accordingly, we’ll use the paired differences as the data. Now the average of the differences will be the same as the difference of the averages, but the standard deviation of the differences will represent the population variation. Here is a diversion example. When my grandkids come to visit we like to measure their heights on the measuring stick. I hold the stick against the wall, the kid backs up to it, and I use a pencil, level on the top of their head, to mark their height on the stick. Perhaps my measurement technique has an error range of about 1/8 inch (”). Once, their heights were 52”, 49”, 48”, 40”, 35”, 33”, and 27”. The standard deviation of the heights is 9.4”, and if the heights all came from the same population the 95% range would be about 30”. But none of my measurements has an error so large. The grandkids’ heights are not from the same population. In calculating a standard deviation, the data must come from the same population. In paired data, it is not presumed that all values are from the same population, but it is assumed that the differences reflect the treatment effects. We assume that the differences represent a consistent population. Testing hypothesis about the differences follows the same rules as for testing data against an individual mean. Use a t-statistic on the average difference and the sample standard deviation of the differences,
D - mD T= SD / n
n
with
2 D
S =
å i =1
(D - D) i
n -1
2
and v = n - 1 (6.17)
135
Hypothesis Formulation and Testing
Example 6.10: Reconsider Example 6.5. In a competition, female gymnasts perform two routines (a compulsory, C, and an optional, O) on each of four apparatus (uneven bars, balance beam, vault, and floor exercise). At the end of a meet the all-around winners are based on the sum of their 8 scores. If the top two all-around scores are 56.95 and 55.25, the girl with 56.95 points gets the first-place blue ribbon and the girl with 55.25 points gets the second-place red ribbon. Can we claim that the one gymnast performed better than the other, or is that difference within judging uncertainty? Here is the data. Gymnast A has the slightly higher average score. The hypothesis is that m A > mB
Bars C Bars O Beam C Beam O Vault C Vault O Floor C Floor O
Gymnast A
Gymnast B
7.45 5.7 7.85 6.95 6.65 7.5 7.95 6.9
7.05 5.85 7.15 6.65 6.8 6.9 7.7 7.15
These are paired differences, the beam especially requires balance, bars require strength, and the vault goes very fast. The events are very different. The scores for any one gymnast do not come from the same population. So, use paired differences. The difference is defined as d = X A - XB, and the average difference is 0.2125, the standard deviation of the differences is 0.36037, n = 8, and υ = 7. The data T-value is 1.668133. At a 95% confidence, the t-critical is −1.89458. We reject the hypothesis that A beats B if the data T is outside of t-critical. Since it is not, we accept that A is better than B. Note: If the question was, “Were the two gymnasts equivalent?”, the two-sided t-critical is ±2.36462, and the experimental T of 1.668133 is not outside of either limit. So, we accept the Hypothesis that the population of differences could be zero, the girls could be equivalent, and the apparent difference may be because of judging vagaries. In this case we would accept either hypothesis, as well as the hypothesis that B could have been the better gymnast. Accordingly it is useful to report p-values. If the two gymnasts are presumed equivalent, the p-value for a T-value of 1.668133 is 0.13922, and the p-value for A being the better gymnast B is 0.06961. The first is marginally acceptable. The second is near to the classic rejection criterion. Example 6.11: A synthetic lipid was applied at a uniform rate to soil samples to determine the effectiveness in reducing evaporative water losses. All the soil samples were made up from the same batch of well-mixed materials. Twelve soil samples were available. Half were sprayed with a wetting agent prior to application of the lipid; the other 6 samples were not. The results obtained in grams of water lost per square decimeter per minute for a particular set of temperature and humidity conditions were as follows: Sample Lipid Wetting agent + lipid Difference
1
2
3
4
5
6
11.5 10.8 0.7
13.4 10.8 2.6
14.0 12.5 1.5
13.6 12.1 1.5
11.6 12.1 −0.5
14.6 13.5 1.1
X1i X2i Di
136
Applied Engineering Statistics
Did the inclusion of the wetting agent significantly affect the water loss rate?
1. Assume that population of differences is approximately normally distributed. 2. H0: μD = 0 vs HA: μD ≠ 0. 3. The test statistic is T with 6 − 1 = 5 degrees of freedom. 4. We select α = 0.02 as we are dealing with samples not totally within our control and want a strong confidence should we reject the hypothesis. (Since a significant stakeholder wants them to be equivalent, you need strong evidence to reject the hypothesis.) 5. The critical value of T occurs at ±t5,0.99 = ±3.3649. 6. The values of D and SD are 1.15 g/dm2 min (which seems to indicate a difference of about 10% of the nominal values) and 0.419325 g/dm2 min, respectively. The resulting value of the experimental T is 2.7425. 7. As T is within the 98% Cl for the t-test, we accept H0: μD = 0 and conclude:
The addition of the wetting agent was probably ineffective. Let’s say that you work for the supplier of the wetting agent, and you want to show that it probably is effective. You may think that all you would have to do is change the significance level to α = 0.05 so that the critical value of t for the two-tailed null hypothesis is 2.571. Then the calculated value of T would be in the rejection region, and you could conclude that the wetting agent probably does make a difference for the better. To guard against possibly unethical behavior, you should set the significance level first, independent of the test outcome you might desire, not after you see how the choice could influence the outcome. Alternately, to avoid an unqualified accept/reject conclusion, which could be a distortion of the reality for an audience, include a p-value in a statement. In this case, the p-value associated with the null hypothesis is 2 (1 - T.INV ( 2.7425, 5, 1) ) = 0.04067 ¼. There is only a 4% chance that the experimental difference could be so great between the two treatments if the means were the same. The probability ratio is 24.6, again providing high odds that the two treatments are not the same. Although not absolutely definitive, the data provides strong evidence that the treatments are different. Example 6.12: Two gaskets were cut from each of eight different production runs of a common gasket sheet stock material. One gasket of each pair was randomly selected for use in dilute HCl service. The other gasket of each pair was for concentrated HCl service. All gaskets were subjected to accelerated life tests for their respective intended uses. From the data so obtained, the estimated values of the average service life in weeks are given below: Run no. 1 2 3 4 5 6 7 8
Dilute HCl
Concentrated HCl
35 40 27 25 36 48 53 48
30 32 28 27 33 38 41 39
Use the data to test the hypothesis that the service life of the gaskets is independent of the HCl acid strength.
137
Hypothesis Formulation and Testing
Use a paired t-test. From the data D = 5.625 , SD = 1.82247869 (both in weeks), and T = 3.02. As t7,0.995 = 3.4995, for a two-sided test with α = 0.01, we do not reject H0: μD = 0 at the 99% confidence interval and conclude: Acid concentration may not affect the service life. However, a different conclusion would be reached using a 95% confidence, at α = 0.05, the t-critical values are ±2.3646, which is exceeded by the data T-value of +3.02. So, at the 95% confidence level (pretty sure, but not really-really sure) we would claim: Acid concentration may actually affect the expected life. The p-value for rejecting the null hypothesis is 0.019386, which should be reported to help a reader understand the test conclusion. There is less than a 2% chance that the average difference could be so large if the means were identical. The probability ratio is nearly 52. Very strong odds. It seems that there is fairly strong evidence to claim that acid concentration might affect gasket life.
Note: If all of the samples do come from the same population, and treatment A and B are being compared, and nA = nB then either the paired t-test on the data differences, or the normal t-test could be used. However, with the paired t-test n = nA - 1 = nB - 1, and for the normal t-test n = nA + nB - 2. The numerator value in the two t-statistics is the same, since D = XB - X A , although, the variance on the difference in the paired test is larger than the variance for either individual average, since nA = nB the denominator values of the t-statistic will be nearly the same. So, the t-value for the paired test will be nearly same as the t-value for the unpaired test. However, since the degree of freedom for the unpaired test is larger, it will be more sensitive than the paired test. Only use the paired test if the paired samples come from populations that cannot be presumed to have the same mean.
6.7 Tests on a Single Variance Any one of the several hypotheses may be used to test a single population variance. These tests may be formulated in terms of the confidence interval on chi-squared, Equation (6.18), or the confidence interval on the unknown variance σ2 of the population, Equation (6.19). These equations are for the null hypothesis H 0 : s 2 = s 02 vs. H A : s 2 ¹ s 02 :
(
)
P c n2-1,a /2 £ c 2 £ c n2-1,1-a /2 = 1 - a (6.18)
and æ ( n - 1) S 2 ( n - 1) S2 ö = 1 - a (6.19) P çç 2 £s2 £ 2 ÷ c n -1, -a /2 ÷ø è c n -1,1-a /2 The only restriction is that the sample was drawn from a normal population. In that case, χ2 calculated by ( n - 1) S2 / s 02 has a chi-squared distribution with u = ( n - 1) degrees of freedom.
c2 =
( n - 1) S2 (6.20) s 02
Note: Although the population variance will likely have dimensional units, χ2 is dimensionless.
138
Applied Engineering Statistics
Note: χ2 is a ratio of variances, with one known. Contrasting the F-statistic a ratio of variances with neither known. The appropriate acceptance and rejection regions for testing hypotheses about a single variance are shown in Table 6.1. TABLE 6.1 Testing Hypotheses Involving a Single Variance Statistic : c 2 = Null hypothesis
( n - 1 ) S2 , v = s 02
( n - 1)
Alternative hypothesis
Rejection region
H 0 : s 2 = s 02
H A : s 2 ¹ s 02
c 2 > c n2-1,1-a /2 or c 2 < c n2-1,a /2
H 0 : s 2 £ s 02
H A : s 2 > s 02
c 2 > c n2-1,1-a
H 0 : s 2 ³ s 02
H A : s 2 < s 02
c 2 < c n2-1,a
H 0 : s 2 ¹ s 02
HA : s 2 = s 2
c n2-1,a /2 £ c 2 £ c n2-1,1-a /2
H 0 : s 2 < s 02
H A : s 2 ³ s 02
c 2 ³ c n2-1,1-a
H 0 : s 2 > s 02
H A : s 2 £ s 02
c 2 £ c n2-1,a
Example 6.13: The long-term average variance of the heat transfer coefficient for the drying of plywood is 0.2 (Btu/hr ft2 °F)2. Recent quality control data indicate that the plywood is not bonding properly in the kiln. Has the variance of the heat transfer coefficient changed significantly? A larger variance would indicate more drying variability and poorer bonding in the undried sheets. Test values are below in Btu/hr ft2 °F. 4.68 4.73 4.65 4.69
4.52 4.79 4.70 4.57
4.70 4.67 4.63 4.58
1. Assume population is normally distributed. 2. H0: σ2 = 0.2 vs σ2 ≠ 0.2. 3. The test statistic is χ2 with 12 − 1 = 11 degrees of freedom. 4. Choose α = 0.05. 2 2 2 5. The critical regions are c 2 > c11 , 0.975 = 21.9200 and c < c 11, 0.025 = 3.8157. In Excel these are calculated by the cell function CHISQ.INV ( CDF,u ). 6. From the data, S2 = 0.074767 and c 2 = 4.112. 7. The thumbnail sketch is the distribution for c 2 / u , for which the critical values are 21.9200/11 = 1.9927… and 3.8157/11 = 0.34688….
139
Hypothesis Formulation and Testing
c 2 4.112 = = 0.3738¼ is within the 95% confidence u 11 interval, we accept H0: σ2 = 0.2 and conclude as a result of the χ2 test.
8. As the calculated value of
We accept the null hypothesis that heat transfer variance has not changed at the 95% confidence. The heat transfer variance may not be the source of poor plywood bonding. However, notice that the calculated c 2 = 4.112 is very close to the lower critical value 2 of c11 , 0.025 = 3.8157 , suggesting that heat transfer variance may have actually decreased. The CDF value of c 2 = 4.112 is 0.033475. In Excel this is calculated by the cell function
(
)
CHISQ.DIST c 2 ,u , 1 . The p-value associated with the null hypothesis, H0: σ2 = 0.2 is twice 0.033475 = 0.06695 (because it includes the equal probability of both extremes). Since 0.06695 is near to the level of significance for the test, 0.05, one should report more information than just “accept the null hypothesis”: The heat transfer variance may have changed, and actually may have reduced, not increased. By considering a new hypothesis H: σ2 > 0.2, the 95% one-sided critical value is 2 c11, 0.05 = 4.5748 . Since the data value of 4.1224 is more extreme, we reject the hypothesis that variance has increased with a 95% confidence. The p-value for the σ2 > 0.2 hypothesis is 0.033475, again a value less than the conventional 0.05 level of significance, and support to reject the hypothesis.
6.8 Tests Concerning Two Variances
The most common two-variance test is that of equality under the null hypothesis H 0 : s 12 = s 22 vs. H A : s 12 ¹ s 22 . The null hypothesis is usually stated H 0 : s 12 /s 22 = 1 vs. H A : s 12 / 2 2 H 0 : s 1 /s 2 = 1 vs. H A : s 12 /s 22 ¹ 1. This particular test is of considerable importance when examining the variability of production lines, the effect of changing production parameters such as a raw material source, the efficiency of different machines, etc. For two independent normal populations X1,i, i = 1, n1, and X2 , j, j = 1, n2,
F=
2 S12 / s 12 c1 / ( n1 - 1) = (6.21) S22 / s 22 c 22 / ( n2 - 1)
is the appropriate statistic for testing any hypothesis about two population variances. If the equality hypothesis is true, s 12 = s 22 and then F = S12 / S22 is distributed with v1 = n1 − 1 and v2 = n2 − 1 degrees of freedom, commonly referred to as the numerator and denominator degrees of freedom, respectively. The expected value for the data-calculated F-value is ~1. The corresponding confidence interval for F is
P ( Fn1 -1, n2 -1,a /2 £ F £ Fn1 -1, n2 -1,1-a /2 ) = 1 - a (6.22)
Table 6.2 shows the appropriate hypotheses and their corresponding acceptance and critical regions for examining relations between two variances. The rejection regions for the calculated values of F follow the same pattern as χ 2 in Table 6.1.
140
Applied Engineering Statistics
TABLE 6.2 Testing the Equivalent Hypotheses Involving Two Variances Satatistic : F = Null hypothesis
S12 , v1 = n1 - 1, v2 = n2 - 1 S22
Alternative hypothesis
Rejection region
H 0 : s 12 = s 22
H A : s 12 ¹ s 22
F > Fv1 ,v2 ,1-a /2 or F < Fv1 ,v2 ,a /2
H 0 : s 12 £ s 22
H A : s 12 > s 22
F > Fv1 ,v2 ,1-a
H 0 : s 12 ³ s 22
H A : s 12 < s 22
F < Fv1 ,v2 ,a
Example 6.14: Pilot plant runs were made on two variations of a process to produce crude naphthalene. Product purities for the several runs are given below. In each series, all conditions were controlled in the normal manner, and there is no evidence from the log sheets of bad runs. Product purity (% naphthalene) Conditions A Conditions B
76.0, 80.0,
77.5, 76.0,
77.0, 80.5,
75.5, 75.5,
75.0 78.5,
79.0,
78.5
The development engineer reports that, on the basis of these data, Conditions B give better purities, but uniformity is poorer at these conditions. Do you agree? To answer the question about product purity, you should use a one-tailed t-test of H0: μB > μA vs HA: μB ≤ μA. We address the claim that uniformity is worse for B by examining the ratio of variances. 1. Assume that both populations are normally distributed. 2. H : s B2 /s A2 > 1 vs. H A : s B2 /s A2 £ 1.
3. The test statistic will be F = SB2 / SA2 with v1 = 7 − 1 = 6 and v2 = 5 − 1 = 4. 4. Set α = 0.05. 5. The critical region is F ≤ F6,4,0.05 = 0.220572. 6. From the data, SB2 = 3.571428, SA2 = 1.075 and F = 3.3223. 7. The sketch illustrates the CDF of F with DoF 6 and 4 and the CDF = α = 0.05 critical region.
8. As F = 3.3223 ≰ 0.220572, we do not reject the H at the 95% confidence level. We therefore conclude with 95% confidence:
The variance at Conditions B could be worse than at Conditions A.
141
Hypothesis Formulation and Testing
The same result is obtained if F = SA2 / SB2 (necessitates the reversal of both inequalities in step 2 above). Example 6.15: Going back to Example 6.8, let’s test the equality of variances. From the data, F = S12 / S22 = 0.005427 / 0.045689 = 0.1180. The critical values of F are F9,9,0.005 = 0.15288 and F9,9,0.995 = 6.54109. The F calculated from sample data is in the 99% reject region, and does not support H 0 : s 12 = s 22 . Therefore, our use of the Case 3 t-test in Example 6.8 was correct.
If the hypothesis is that s 12 is different from s 22 by the ratio k, then the presumption is S2 / s 2 S2 s 2 S2 1 S2 1 k 1 s 12 = ks 22 . then F = 12 12 = 12 22 = 12 and if true then F = 12 = = 1. But this test S2 / s 2 S2 s 1 S2 k S2 k 1 k requires a presumption of k.
6.9 Characterizing Experimental Distributions Another frequent use of the χ2 test is in comparing observed counts (frequencies) and the corresponding expected values if the population is distributed according to a particular theoretical distribution (normal, exponential, etc.). See Section 3.4.7. In this case, we use χ2 as defined by k
c2 =
å i =1
(Oi - Ei )
2
Ei
(6.23)
where the Oi are the observed frequencies (the count in each category), the Ei are the expected frequencies (counts), and k is the number of classes (categories) into which the data have been divided. Under certain conditions, this statistic is approximately distributed as χ2 as defined by Equation (3.62). The corresponding confidence limit on χ2 is
(
)
P c v2,a /2 £ c 2 £ c v2,1-a /2 = 1 - a (6.24)
If the value of χ2 calculated from sample data falls within the interval described by Equation (6.24), we will accept the null hypothesis that the population probably is distributed as we expected. The use of this confidence interval is termed a goodness-of-fit test. One caution: Percentages can only be used as counts in this test if the sample size is exactly 100. To reject the null hypothesis, H0: The distribution is as expected, the χ 2 value calculated from sample data must be unusually large. (If the experimental distribution perfectly fit the theoretical, then the perfect value of χ2 would be zero.) Accordingly, this approach uses a one-sided test described by
(
)
P c 2 £ c v2,1-a = 1 - a (6.25)
If c 2 > c v,2 1-a , this null hypothesis should be rejected. If χ2 = 0, the distribution is exactly as expected (but this also might raise the flag of suspicion).
142
Applied Engineering Statistics
Note that the degrees of freedom are k − 1 − p, where k is the number of classes and p is the number of population parameters which have been calculated from the sample data. If you are testing whether a distribution is normal, X and S2 might be estimated from sample data. As a result, v = k − 1 − 2 = k − 3. If some of the expected counts you calculate are “small”, the results of the goodness-of-fit test may be incorrect. As a conservative rule of thumb, if all of the Ei ≥ 5, the test is valid. If Ei < 5, that ith class should be combined with a neighboring class. A less conservative, but also accepted restriction is that each Ei must be >1 and no more than 20% of the Ei may be 12 classifications are eliminated. Does the life expectancy of these components follow an exponential distribution with mean μ = 4? The cumulative exponential distribution is modeled by Equation (3.43) and is F ( X ) = CDF ( X ) = 1 - e -a x , where α = 1/μ. This α is the population parameter and has nothing to do with a level of significance. For this proposed distribution: Here α = 0.25, and the units are per 100 hrs. As with any cumulative distribution function, P(x1 < X ≤ x2) = P(x2) − P(x1). To obtain the expected frequency E2 for the second class, we need P(2 < x ≤ 4) = (1 − e−0.25(4)) − (1 − e−0.25(2)) = 0.63212 − 0.393469 = 0.23865 for a single value. For a sample of 100 items, the corresponding value of E2 is 23.865. We have tabulated the expected values of the classes under the assumption that the population does follow the anticipated exponential distribution. Also tabulated are the corresponding contributions of each class to χ2, the test statistic. If we select α = 0.05, the critical region is c 2 ³ c v,2 1-a = c 52, 0.95 = 11.0705 . As the calculated χ2 contributions sum to 6.8454, we accept that the exponential distribution with μ = 4 (100 hrs) may correctly model the life distribution using a χ2 goodness-of-fit test at the 95% level of confidence.
6.10 Contingency Tests A third use of the chi-squared test is the so-called contingency test. In it, sample members are classified into mutually exclusive classes such as yes/no, success/failure, and
143
Hypothesis Formulation and Testing
on-grade/off-grade. This test is commonly used in quality control and materials testing and for attribute comparisons. Data for a contingency test are organized into rows and columns. The test is used to determine whether the two attributes (represented by the rows and columns) are independent or not. In the contingency test, if c 2 ³ c v,2 1-a , where v = (r − l) (c − 1) and r and c are the number of rows and columns, respectively, then the samples are probably not from the same population. For this test, χ 2 is defined as
c2 =
åå i
j
(Oij - Eij ) Eij
2
(6.26)
Again, the expected values for each ij category must be >1 with no more than 20% of the Eij < 5 in order for this statistic to be distributed as χ2. A more conservative restriction is that each Eij must be >5. In addition, if v = 1, the absolute value of each (Oij − Eij) pair must be decreased by 0.5 before squaring. Another use of the chi-squared test is for evaluation of the experimental or sampling procedure itself. It is possible, though improbable, that χ2 could be very small. If that situation occurs, it suggests that the categories are unusually similar as a result of poor experimental technique, an improper experimental design, contrived data, etc. In this case, the null hypothesis could be H0: The experiment appears reasonable vs HA: The data have too little scatter and therefore are suspicious. For these hypotheses, the critical region is c 2 £ c v,2 a , where χ2 is as defined by Equation (6.26). Example 6.17: Melt spinning is a method of producing a man-made fiber (also termed a synthetic fiber). In this manufacturing procedure, molten polymer is extruded through tiny holes (on the order of 0.002 in diameter) in a spinneret. Flow instabilities cause breaks in the filaments, leading to the production of bobbins containing insufficient fiber (short bobbins). The shape profile of the spinneret holes can affect the frequency of short bobbins, an important factor in plant productivity. The data below compare two new spinneret designs with the standard design now in use.
Full bobbins Short bobbins Total
Type A
Type B
Standard
300 70 370
400 40 440
300 60 360
Is there a statistically significant difference in spinneret performance?
1. We are sure that the data are from dichotomous populations: The bobbins can be classified according to yarn weight as full or short. 2. H0: The samples are from the same population vs HA: The samples are probably from different populations. 3. The test statistic is χ2. 4. Select α = 0.05. 5. For this data, v = (c − l)(r − 1) = (2 − 1)(3 − 1) = 2. The critical region is c 2 ³ c 22, 0.95 = 5.9915 . 6. Of the total of 1,170 bobbins, there are a total of 170 short bobbins. Accordingly, if all spinnerets have similar performance, we expect 170/1,170 = 0.145299 or about 15% of bobbins in each category to be short and about 85% to be full. As there are 370 bobbins from type A, 440 from type B, and 360 from the standard design, the expected values, Eij are
144
Applied Engineering Statistics
Full bobbins Short bobbins
Type A
Type B
Standard
316.2393 53.7607
376.0684 63.9316
307.6923 52.3076
c2 . u 2 2 c 5.9915 c 10.282 With u = 2 , the critical value is = = 2.9957 and the data = = 5.141 . u u 2 2 as calculated from Equation (6.26), c 2 = 10.282. The sketch illustrates the CDF of
As c 2 > c 2 critical , we must reject H0 and conclude with 95% confidence that the samples are from different populations, i.e., that: The spinnerets are probably different. A quick look at the original data supports this conclusion: The standard and Type A spinnerets produce about 17% short bobbins, but the Type B spinneret produces only 10% short bobbins. But, that quick look does not consider the variability, which the test does. However, the test only indicates that the three treatments are not identical. It does not indicate which is different, nor which is best. (It may be that the cost of Type B spinnerets exceeds the short/full benefit, or that Type B has other not tested deficiencies.) Example 6.18: Reconsider the comparison of three spinneret designs in Example 6.17 but with different experimental results. The alternate test result is shown below. The question is, “Are the results unexpectedly identical?”
Full bobbins Short bobbins
Type A
Type B
Standard
300 58
400 81
300 60
In this new situation, c 2 = 0.0623 As c 22, 0.05 = 0.1026 , we should reject the test results with 95% confidence as the data do not appear to exhibit random scatter: The test results may be biased.
6.11 Testing Proportions A proportion, P, would be the number of successes, s, (or number of classifications) out of n trials. From experimental data, P = s / n, the binomial distribution usually would model
145
Hypothesis Formulation and Testing
the distribution of the experimental P, given n and p, the probability of an individual trial giving a success. If n is large and P is not near an extreme of 0 or 1 then the z- or t-distribution is valid for hypothesis tests. But, testing proportions should really use the binomial distribution. If the value of p0 is known, (and n is large and p0 is not near the extremes of 0 or 1) then the Z-statistic, is P - p0 s - np0 = (6.27) p0 q0 /n np0 q0
Z=
The hypotheses involving single proportions are analogous to those involving a single mean. If you use the definition of Z in Equation (6.27) and replace μ and μ0 by p and p0, you will have the table you need for testing hypotheses about single proportions (Table 6.3). When faced with comparing the equality of two proportions, where P is estimated from n data, use T=
éë( P1 - P2 ) - ( p2 - p2 ) ùû (6.28) P1 ( 1 - P1 ) P2 ( 1 - P2 ) + n2 n1
The test of this hypothesis is carried out in exactly the same way as the Case 2 t-test. If the inequality of proportions must be tested, use Equation (6.28) after replacing that pooled error with Satterthwaite’s estimate. Alternately, the χ 2contingency test may be used. Example 6.19: After n = 25 coin flip trials, the number of Heads is 10. The experimental proportion is 0.4. The hypothesized proportion is 0.5. Can we reject the fair coin hypothesis? 1. H 0 : p = 0.5 . This is a test with a two-sided rejection region. 2. α = 0.05. Splitting the two rejection regions equally the rejection CDF values are CDF £ 0.025 and CDF ³ 0.975 . 10 = 0.4 . 3. The data P = 25 the hypothesized p = 0.5, the CDF = 0.025 and CDF = 0.975 4. With n = 25 and limits are s £ 8 and s ³ 17. The sketch illustrates the binomial distribution and the critical values. TABLE 6.3 Testing a Single Proportion p Statistic : Z = Null hypothesis H0: p = p0 H0: p ≤ p0 H0: p ≥ p0
P - p0
p0 ( 1 - p0 ) / n
Alternative hypothesis
Rejection region
HA: p ≠ p0 HA: p > p0 HA: p < p0
Z > z1 – α/2 or Z ≤ − z1 – α/2 Z > z1 – α Z < − z1 – α
146
Applied Engineering Statistics
5. Since the experimental s = 10 is within the critical value limits we cannot reject the fair coin hypothesis.
Data indicates that the coin might be fair (95% confidence level using the binomial distribution). The p-value for such an extreme experimental finding, should the hypothesis be true is 0.21, not very close to the rejection value, only raising moderate suspicion. Repeating the analysis using the standard normal approximation: Z from Equation P - p0 0.4 - 0.5 (6.27) is Z = = = -1 . The rejection regions are Z £ -1.95996 p0 q0 / n 0.5 (1 - 0.5 ) / 25 and Z ³ +1.95996. Since the experimental value is not within the rejection region, we accept the null hypothesis. The next sketch illustrates the standard normal distribution with the critical values. Notice the similarity with Z = −1 in the sketch below to the binomial sketch above with s = 10.
Here the p-value is 0.158, compared to a p-value of 0.21 using the binomial distribution. Although the results are similar with the large n and p far from the extremes, the standard normal approximation is a convenience, not the true approach. Example 6.20: The manufacturer claims a failure probability of 0.0001 on a particular product, based on quality control testing under normal use. 1,000,000 have been sold and there have been 437 failures reported by customers. Does the manufacturer speak the truth? 1. H 0 : p = 0.0001. This is a test with a two-sided rejection region. 2. α = 0.05. Splitting the two rejection regions equally the rejection CDF values are CDF £ 0.025 and CDF ³ 0.975 . 437 = 0.000437 . 3. The data P = 1000000
147
Hypothesis Formulation and Testing
4. With n = 1,000,000 and the hypothesized p = 0.0001, the CDF = 0.025 and CDF = 0.975 limits for the binomial model are x £ 81 and x ³ 120. The sketch illustrates the binomial distribution and the critical values.
5. Since the experimental x = 437 is beyond the critical value limits, we reject the hypothesis that the customer experience is compatible to the manufacturer’s claim of a 0.0001 failure probability.
However, this does not mean that the manufacturer did not tell the truth. Soccer balls are designed to be kicked on the grass, but if the dog plays catch with it, and bites a hole in it, the ball fails. If a mechanic leaves the channel-lock wrench out in the rain, the tool will rust, not work, and fail. If a parent lets their kids use the cell phone as a music blaster all day at the side of the pool it will fail. Who knows what abuse a customer might create for a product! If products are not used as the manufacturer designed their use, the customer-reported failure rate, may not be the manufacturer’s issue. Here is a case of a legitimate statistical test, which can be misinterpreted as providing evidence to support a false claim about a supposed cause-and-effect mechanism.
6.12 Testing Probable Outcomes The probability of an event happening is a proportion, where p is the number of events (successes, s) per number of trials, n, as the number of trials approaches infinity. If you expect a probability to be 50%, and do one trial, you either get a success or a not-success, but with one trial you have no confidence to reject the hypothesis. In n = 12 flips of a fair coin, you expect s = 6 successes. If s = 5, or if s = 7 you would not reject the fair coin hypothesis, but if s = 1, you might very well claim foul. There is a probability that a fair contest would result in only 1 (or fewer) successes out of n = 12 trials when p = 0.5, given by the binomial distribution as Equation (3.11).
F ( x i |n ) = P ( X £ x i |n ) =
xi
ænö
å çè k ÷ø P (1 - P ) k
n-k
(6.29)
k =0
With s = xi = 1, n = 12, and p = 0.5, F ( 1|10 ) = 0.00317 ¼. There is only a 0.32% chance that a fair contest would return so few successes. If you are considering rejecting the fair coin hypothesis if either too few or too many successes, then there is a 0.64% of either getting s £ 1, or s ³ 11 in n = 12 flips. You could claim the hypothesis is not true with a 99.56% confidence.
148
Applied Engineering Statistics
Example 6.21: There is a claim that a component from Supplier B frequently causes our manufactured product to fail (in 3 out of 10 instances), but components from Supplier A are nearly faultless. Fifteen samples each of product with components from A and from B were randomly sampled and subject to accelerated testing. None of the 30 products failed. Can you reject or accept the hypothesis that components from B make a less reliable product (that they cause a probability of failure p = 0.3? 3 = 0.3 , and n = 15. We Here, with products containing the B component, s = 0, p = 10 will reject the one-sided hypothesis if s = 0 is improbably too few, given the hypothesis that p = 0.3. The cumulative binomial distribution reveals that
F ( 0|15 ) =
0
æ 15 ö
å çè k ÷ø 0.3 (1 - 0.3 )
15 - k
k
= (1 - 0.3 ) = 0.004747 ¼ 15
k =0
The p-value of the results is about 0.00475, resulting in a 99.5% confidence in rejecting the claim that B leads to product failures. Example 6.22: A medical test has both false positives and false negatives. If a person has the disease, the test will correctly indicate so in 95% of the cases, but this means there are 5% false negatives. Five percent of the tests on patients with the disease will not detect the disease. Alternately, if a person does not have the disease, the test will correctly identify that in 85% of the cases. This means that it will report 15% false positives. Fifteen percent of the tests on patients without the disease will indicate that the patient does have the disease. The doctor does not know whether the patient has the disease or not, and one test would not be definitive. Two tests could either affirm or contradict each other. So, the doctor prescribes 4 tests. The results are 1 positive indication and 3 negative indications. Can either hypothesis be rejected? If the patient has the disease, then s = 1, p = 0.95, and n = 4. We will reject the one-sided hypothesis if s = 1 is improbably too few. The cumulative binomial distribution reveals that
F (1|4 ) =
1
æ 4ö
å çè k ÷ø 0.95 (1 - 0.95) k
4-k
= 0.004812¼
k =0
The p-value of the results is about 0.0048, resulting in a 99.52% confidence in rejecting the hypothesis that the patient has the disease. If the patient does not have the disease, then s = 3, p = 0.85, and n = 4. We will reject the one-sided hypothesis if s = 3 is improbably too few. The cumulative binomial distribution reveals that
F ( 3|4 ) =
3
æ 4ö
å çè k ÷ø 0.85 (1 - 0.85) k
4-k
= 0.47899¼
k =0
The p-value of about 0.48, does not provide adequate justification to rejecting the hypothesis that the patient does not have the disease.
6.13 Takeaway The hypothesis is about the population aspect not about the sample. You have the sample data and can make definitive statements about the sample, but you do not have all the population data; you cannot make definitive statements about the population.
Hypothesis Formulation and Testing
149
We presume that the sample average, sample standard deviation, or sample proportion are best estimates of the population values. Be sure that this is reasonable. The precise statement of a statistical conclusion is often unintelligible to a non-statistician, and we usually restate the conclusions to generate the appropriate business engineering action. In doing so, we must take responsibility not to misrepresent the limits of the test. The hypothesis is about some data-based attribute of the population that would be expected to be manifest if the supposition is true. The hypothesized feature must be legitimate for the supposition. Statistics is not the decision. Use statistical analysis as one of many criteria to decide action. The statements will be dependent on the tests used, the level of significance for rejection, the hypothesis, the choice of the number of data, and number of bins in a histogram. All of these are user-choices. It is relatively easy to structure the test to get an outcome that you might want. Be careful to structure the test to be a legitimate indicator of the reality, not a contrived appearance of an affirmation of someone’s bias. We must test all features that are of importance. For example, both mean and variance of seatbelt breaking strength are important. Certainly, a treatment that gives greater strength is a benefit, but if the treatment also causes greater variability, the fraction of belts with breaking strength below some minimum acceptable value may make the effect of that treatment worse. Be sure that all important features are tested before making recommendations. Don’t test only one aspect of the treatment. The important decision is not the “accept” or “reject” of the hypothesis. What is important is the action, and that must be based on the context and the statistical outcome. Be aware of and sensitive to the context.
6.14 Exercises 1. Define these terms: a. critical region b. hypothesis c. primary hypothesis d. null hypothesis e. alternate hypothesis f. level of significance g. critical values h. Type 1 error i. Type 2 error j. Reject k. Rejection region l. Accept
m. Acceptance region n. Two-sided test
150
Applied Engineering Statistics
o. One-sided test p. p-value 2. Repeat Example 6.14 with two alternate hypotheses: 1) The variances are equal, and 2) A is more variable than B. If you choose one of these, and another person chooses another, how would you settle the conflicting “accept” claims? 3. Choose any example from this chapter and change the value of alpha. Do the outcomes make sense? How would you choose an appropriate value of alpha? 4. Repeat the analysis of Example 6.21 or of Example 6.22 using either the normal z- or t-statistic. Comment on the new results and validity of the z- or t-statistic.
5. Use Example 6.8 to test other possible hypotheses. 6. Show that the discussion at the end of Section 6.6 is true.
7 Nonparametric Hypothesis Tests
7.1 Introduction Probability distributions are described by specific mathematical functions (Chapter 3) with a few parameters that adjust the location and scale of the curve. For the normal distribution, those parameters are μ and σ. For the Poisson distribution, the single parameter is λ. Parametric tests are hypothesis tests concerning the values of those parameters. Parametric hypothesis tests, such as in Chapter 6, are valid only if the distribution of the sampled data is similar to that of the population. Parametric tests are to test the values of the distribution coefficients (parameters) such as mean, standard deviation, or probability. However, there are many other data attributes that can be meaningful. These include the number of runs in the data, the median value, the largest deviation, or the count of the value of data that is above or below a hypothesized value. These are not coefficient values in a mathematical model of the distribution. Further, hypothesis tests commonly involve the assumptions that the random variable is normally distributed and that it has a uniform variance throughout the test region. However, many situations are nonnormal (particle size distributions, queue incidence distributions, failure rate distributions) or have a nonuniform variance throughout the variable range (pH, composition). For such situations, hypothesis tests such as those in Chapter 6 are inappropriate. Nonparametric tests are used when the population distribution is either not known or cannot be assumed. Because nonparametric tests are not based on the assumption of population distribution, they are more widely applicable than are parametric tests. However, because knowledge of the population distribution is not utilized, nonparametric tests are less powerful than parametric tests. Nonparametric tests require more data than parametric tests to arrive at the same statistical conclusion with the same confidence level. The bottom line is as follows: If the population distribution is known, use the appropriate parametric tests because of their greater efficiency. Use nonparametric tests in either of these cases: 1. The population distribution cannot be assumed. 2. The population distribution type is being tested. 3. The within-treatment samples are known to come from different populations (perhaps the variance is not the same). 4. The data are ranked values.
DOI: 10.1201/9781003222330-7
151
152
Applied Engineering Statistics
5. The data are nominal or ordinal classifications (chair/table, on/off, preferred/not preferred, above/below, greater than/less than).
7.2 The Sign Test The sign test is used to compare the median (not mean) values of either paired data or data and their corresponding estimates. We wish to emphasize that the parametric paired tests of Chapter 6 required treatments to be homogeneous and tested for differences of the means. This nonparametric test does not require homogeneous experimental units and compares median (not mean) differences. Consider a crop experiment in which two seed preservation treatments are being compared to “determine” their effect on crop yield. Seeds from treatments A and B are planted on side-by-side 1-acre plots throughout the geographic region in which the crop is grown. In this way, both A and B will experience a variety of similar soil conditions and vagaries of climate. Consequently, median crop yields from A and B will be directly comparable over a wide range of conditions. There is no justification for assuming that the yield data will be normally distributed. In fact, the soil and climate treatment that seeds A and B experience in each planting are different. The data pairs do not come from the same population of uniform growing conditions. One treatment will be rated as better if it shows a larger median crop yield in a significant number of the adjacent plots. It is common practice to use paired observations to exclude the effect of uncontrolled variables. In general, there will be a set of paired data, the measured and/or predicted test results of treatments A and B, {(A1 B1), (A2, B2), …, (Ai, Bi), …, (An, Bn)}. If treatments A and B are equivalent, then we expect Ai to be greater than Bi in half of the pairings. If the number of pairings in which Ai is greater than Bi is significantly more or less frequent than half, then the A and B treatments are probably not equivalent. For this crop example, the data might be as shown in Table 7.1. The number of pairings, 10, is low but was chosen to keep the example simple. The third row in the table is the sign of the Ai − Bi data. If the treatments are equivalent, we expect 5 + and 5 − signs; however, there is only one + sign. If the treatments are equivalent, 1 + sign is an unexpectedly low number and suggests that the treatments are different. However, if A and B are equivalent, it is indeed possible to have only 1 + sign out of 10 pairings, just as it is possible to have only 1 Head out of 10 coin flips. What is the probability of one event occurring out of 10 possible events when the individual event probability is 50%? Using the binomial distribution, the probability for each composite event is listed in Table 7.2 as calculated from Equation (3.9). TABLE 7.1 Crop Yield (Tons/Acre) for Treatments A and B Location Treatment A Treatment B Sign of Ai − Bi
1
2
3
4
5
6
7
8
9
10
2.1 2.6 –
1.8 1.7 +
2.5 2.7 –
2.6 3.0 –
1.1 1.5 –
1.7 1.8 –
2.2 3.3 –
1.5 2.0 –
1.2 1.3 –
2.1 2.3 –
153
Nonparametric Hypothesis Tests
TABLE 7.2 Binomial Probabilities for Composite Events When Individual Event Probability is 1/2 P( 0 | 10) P( 1 | 10) P( 2 | 10) P( 3 | 10) P( 4 | 10) P( 5 | 10) P( 6 | 10) P( 7 | 10) P( 8 | 10) P( 9 | 10) P(10 | 10)
0.0009765625 0.0097656250 0.0439453125 0.117187500 0.205078125 0.24609375 0.205078125 0.117187500 0.0439453125 0.0097656250 0.0009765625
From Table 7.2, one can read that the probability of 0 events or 10 events out of 10 possible events (all + or all − signs) is P(0 | 10) + P(10 | 10) = 0.001953125. Further, the probability of 1 event or 9 events out of 10 possible events (one + or one − sign) is P(1 | 10) + P(9 | 10) = 0.01953125. Consequently, the cumulative probability of 1 or 0 occurrences of either sign is 0.00195… + 0.01953… = 0.021484375. If treatments A and B are equivalent, there is only a 0.0214… chance that such a lopsided or even more distorted distribution would have occurred. We can reject the hypothesis that medians (not means) of the treatments A and B are equivalent at the (1 − 0.0214…) · 100 = 97.85…% confidence level. Normally, this statement would become, “we are 98% confident that B is better than A.” But be aware of the personal selection of B and of the untested evaluation “better”, when only one of many attributes are measured. To clarify this caution, observe that the hypothesis this illustration tested was that the yield of treatments A and B are equivalent. The statistic generated was the number of + signs, and the hypothesis would be rejected if there were either improbably too few or too many + signs. Had the hypothesis been that treatment B is better than treatment A, the test would be quite different and would require more information. We would still choose the number of + signs as the statistic, and we would expect all of the signs to be +. If there are improbably too few + signs, we would reject the hypothesis. To do so, however, we would need the probability that the Bi would be better than the Ai. This situation requires knowledge of the distribution of Ai and Bi, information that we don’t have. This nonparametric test illustration hypothesized that A and B are equivalent, hence the random chance that Ai is better than Bi is 50% regardless of the distribution. You must take care to follow the hypothesis test procedure precisely and to state the conclusion carefully. As a helpful hint, you do not have to use Equation (3.9) to generate either the point binomial probabilities or the cumulative binomial probabilities. The Excel Function is CDF = BINOM.DIST ( s, n, p,1) . Alternately, Table A.1 provides critical values for the sign test. As a final observation, the discrimination ability of the test increases with the number of paired observations. For example, to be 98% confident that A and B are different in a 10-pair sample, one must have 1/10 or fewer pairs with one type of sign. For a 40-pair sample, only 30% or fewer of the data must have the same sign.
154
Applied Engineering Statistics
To apply the sign test:
1. Obtain paired observations. 2. Hypothesize that the median values of the treatments are equivalent. 3. Choose α. 4. Determine the number of + signs, m, and − signs, n. Disregard tied data. Nominally m + n = N = the total number of data, with tied data m + n < N. 5. Set R = min{n,m}. 6. Use F = CDF = 1 − α and N to determine rF from the binomial distribution. In Excel rF = BINOM.INV ( N , p = 0.5, F ) . Alternately use Table A.1 to obtain the critical value of R. 7. If R ≤ rF, reject the hypothesis at the α level of significance. Note that the median, not the mean, is tested. The median is the data value for which half of the observations are larger (better) and half are smaller (worse). The sign test does not require numerical values. One could use any classification system to obtain observed differences (like/don’t like, window/door) that could also be classified as +/−. Example 7.1: Automobile tire rubber compound formulations A and B have been laboratory-tested and appear equivalent. If in-use tests confirm their equivalence, your company will switch to formulation B. As a first in-use test, 6 cars in your company fleet were fitted with formulation A tires on the left side and formulation B tires on the right side. The tires were switched sides every month so that there was no preferential roadside treatment. The tread lifetime (miles) of each of the 24 tires was measured and recorded as follows:
Front tire A Front tire B Rear tire A Rear tire B
Car 1
Car 2
Car 3
Car 4
Car 5
Car 6
73,080 75,325 77,215 79,039
69,117 71,970 74,744 79,556
84,307 83,076 66,090 68,446
78,692 84,417 82,422 80,913
75,453 73,208 78,161 87,912
71,439 75,003 78,214 76,500
Are the medians of treatments A and B equivalent? The A and B treatments are paired by car and by location so that there are 12 pairings. The hypothesis is that the median tread lifetime (miles) of formulation A is equivalent to that of formulation B. Choose α = 0.05. Using Ai − Bi to determine the paired mileage difference, the number of “+” differences is n = 4. The number of “−” differences is m = 8, making the total number of nonzero differences N = 12 and R = 4. Assuming p = 0.5 (the tires are equivalent), At F = 1 − α = 0.95 and N = 12, rF = 3. We find R = 4 ≰ rF = 3.
Nonparametric Hypothesis Tests
155
Accordingly, using a sign test based on tread life (miles) of 12 tire pairs, we cannot reject the hypothesis that the median tread life of formulations A and B are equivalent at the 95% confidence level. However, the count is very near to the reject criterion. One would normally simply state that based on tire wear, the two formulations are not confidently different, but that there is suspicion that B is better.
When observations have relative numerical values, the sign test is primitive and inefficient when compared to a chi-squared contingency or ANOVA test. However, the sign test is applicable to any dichotomous classification and forms the conceptual basis of the signed-rank test that follows. If formulations A and B gave different performances in Example 7.1, the differences were not great enough to be shown by the sign test. However, note that formulation B was better more often than A: when B was better, it was markedly better; and when B was worse, it was slightly worse. The data suggest that tire B is better, but the sign test does not utilize enough information to make such a statement. Signed-rank tests, however, are more powerful.
7.3 Wilcoxon Signed-Rank Test The Wilcoxon signed-rank test uses the relative magnitudes of the “+” and “−” deviations as additional information to include in a sign type of test. Consequently, it is a more efficient test (requires fewer points at the same level of significance or is more discriminating with the same number of points) than the sign test. However, while the Wilcoxon signedrank test is a nonparametric test, it carries the additional restrictions that the treatment differences can be ranked and that the two populations are symmetric: The chance of a sample being z units greater than the median is identical to the chance of a sample being z units less than the median. To apply the Wilcoxon signed-rank test:
1. Obtain N paired observations. 2. Check that the data are symmetrically distributed. 3. Hypothesize that the treatment medians are equivalent. 4. Choose α. 5. Rank the absolute values of the paired observation differences in ascending order. Ranks go from 1 to N and include pairs with zero difference. In case of pairs with identical differences, assign each the average rank value. 6. Sum the ranks of the paired differences that had positive differences, S. 7. Use F = 1 − α and N to determine sα/2,N and s1−α/2,N from Table A.2. 8. If S ≤ sα/2,N or if S ≥ s1−α/2,N, reject the hypothesis. Example 7.2: Use the Wilcoxon signed-rank test to analyze the data in Example 7.1. The following table lists the tread lifetimes of the 12 pairs of tires, the difference, the ranks based on absolute values, the ranks of positive differences, and the sum of the ranks of positive differences. The hypothesis (A and B have equivalent median tread life) and level of significance (α = 0.05) are the same as those in Example 7.1. Visually, the mileage data of each treatment appears symmetric about its respective median.
156
Applied Engineering Statistics
Pair 1 2 3 4 5 6 7 8 9 10 11 12
Formulation A tread life (miles)
Formulation B tread life (miles)
Difference (miles)
Rank of absolute value
73,080 69,117 84,307 78,692 75,453 71,439 77,215 74,744 66,090 82,422 78,161 78,214
75,325 71,970 83,076 84,417 73,208 75,003 79,039 79,556 68,446 80,913 87,912 76,500
−2,245 −2,853 +1,231 −5,725 +2,245 −3,564 −1,824 −4,812 −2,356 +1,509 −9,751 +1,714
5.5 8 1 11 5.5 9 4 10 7 2 12 3
Rank of + difference
1 5.5
2 3 Σ = 11.5 = S
Note that the tied absolute values of differences for pairs 1 and 5 share the 5th and 6th rank values of 5.5. From Table A.2, using F = 1 − α = 0.95 and N = 12, s0.025,12 = 14 and s0.975,12 = 64. We find that S = 11.5 < s0.025,12 = 14. Accordingly, using the Wilcoxon signedrank test based on the tread life (miles) of 12 tire pairs, we reject the hypothesis that formulations A and B have equivalent median performance at the 95% confidence level. One would normally simply state: Based on tire wear, the medians of formulations A and B are probably different. Note that the test does presume that the mileage data of each treatment appear symmetric about its respective median. This is just visually supported. Even though each of the two tread life columns seem symmetric about their respective medians, the difference column is not symmetric.
In this example, the Wilcoxon signed-rank test recognized that there was probably a difference in formulations A and B. By contrast, the sign test ignored the relative magnitude of the paired differences and could not confidently report any formulation effect. The Wilcoxon signed-rank test is preferred but is valid only if the data are symmetrically distributed. You can use the technique of Sections 7.6 or 7.7 to test for distribution symmetry.
7.4 Modification to the Sign and Signed-Rank Tests Treatment A may be better than treatment B. If the differences can be numerically quantified, you can hypothesize a median difference and adjust the Ai − Bi difference by C (units or %). Then use either the sign or Wilcoxon signed-rank test to test the hypothesis that the adjusted medians are equivalent. You could carry out this procedure for a variety of C values and find the hypothesis rejection region and thereby the confidence interval on C. Example 7.3: Is the median tread life (miles) of formulation B in Example 7.1 better than that of formulation A by 2,200 miles?
157
Nonparametric Hypothesis Tests
Recalculating the differences in Example 7.1 as Ai − Bi + 2200 gives the following data:
Pair 1 2 3 4 5 6 7 8 9 10 11 12
Adjusted difference (miles)
Rank of absolute value
−45 −653 +3,431 −3,525 +4,445 −1,364 −376 −2612 −156 +3,709 −7,551 +3,914
1 4 7 8 11 5 3 6 2 9 12 10
Rank of + differences
7 11
9 10 Σ = 37 = S
There is no evidence that the tire mileage data are not symmetric; therefore, the Wilcoxon signed-rank test can be used. The hypothesis is that the median value of the 2,200-mile-adjusted data is 0. Choose α = 0.05. There are N = 12 data pairs and the sum of the positive ranks is S = 37. From Table A.2, we obtain s0.025,12 = 14 and s0.975,12 = 64. We find S = 37 ≰ s0.025,12 = 14 and S = 37 ≱ s0.975,12 = 64. Accordingly, using the Wilcoxon signedrank test based on the tread life (miles) of 12 tire pairs, we cannot reject the hypothesis that the median tread life of formulation B is 2,200 miles greater than that for formulation A at the 95% confidence level. Or more simply: the median life of formulation B could be 2,200 miles greater than that of A.
Note: This does not mean that B is 2,200 miles greater than A. If you hypothesized that the difference was 1,500 miles you would conclude that might be true. Also 2,500 miles might be true. Someone will make a decision on the findings. Be sure to present the results in a manner that does not mislead a reader.
7.5 Runs Test If there is a dichotomous population with p = 0.5, then one expects half of the events to be of one kind and half of the other kind, such as A/B, heads/tails, +/−, …. In addition, if each event is independent of previous events, then each kind of event should be randomly scattered throughout the sequence of trials. For example, sequentially measured process data may be observed to increase, I, or decrease, D, with respect to the previous measurement. If a process is operating steadily, then measured data should reflect random and independent events, and one would expect Is and Ds to occur randomly throughout a sequence with the probability of each p = 0.5. If, however, the sequence were IIIIIDDDIIIIDDDDIII DDDIII, you would suspect that the system was oscillating under the influence of some driving force (perhaps an improperly tuned controller) and not at steady conditions. In
158
Applied Engineering Statistics
addition, if the sequence were DDIIIIDIDIIDIIIIIIIII, you would suspect that some external event is causing the process variable to increase regularly and that the process is not at steady conditions. A run is a sequence of identical dichotomous events. The eight runs in the following series are underscored.
HH T H TT HHHHH T H TT
If there are either too few or too many runs in a dichotomous population with p = 0.5, you would suspect that the events are not independent of all previous events. Too few runs indicate long periods where some influence persists. Too many runs, if one value is high making the next low, and vice versa, suggest that some “force” is shaping sequential data. The runs test is useful in statistical process control (see Chapter 21). It is also useful as a test for independence of sequential data (Example 7.4), and for identification of systematic nonlinear differences between data and models (Chapter 19). The runs test will require at least 12 observations for you to be able to detect or “see” improbably too-few runs, and 20 or more to “see” improbably too-many runs. To apply the runs test to detect improbably too-few number of runs (persistence in the data): 1. Collect N sequential data from a dichotomous population with an expected p = 0.5. 2. Hypothesize that the events are not clustered, but that one type of event is randomly distributed throughout the sequence. 3. Choose F = 1 − α. 4. Count m and n, the number of each type of events, m is the number of feweroccurring events. Count U, the number of runs. 5. Use α, m, and n to interpolate uα,m,n from Table A.3a. If N > 20 observations, assume m = n and use Table A.3b. 6. If U ≤ uα,m,n, reject the hypothesis. You may occasionally encounter a third type of event: A tie or neutral rating. For example, there are also 8 runs in this sequence:
H0 T H TT HH0HH T H T0
A tie extends the prior run. Occasionally, one may test for too many runs, or oscillating behavior; then if U ≥ u1 − α,m,n, reject the null hypothesis. Or if one is testing for either too few or too many runs, use the two-sided test and reject if either U ≤ uα/2,m,n or U ≥ u1 − α/2,m,n. Example 7.4: The country of Zard has two political parties, Nationalists and Patriots, which supply most of the elected officials. The following sequence represents the controlling party over the past 15 years.
NNNNPNNPPPNPPPP
Is the controlling party an independent random variable? Or does control by one party extend its control?
159
Nonparametric Hypothesis Tests
The population is dichotomous. Since the numbers of Ns, m, and Ps, n, are 7 and 8, respectively, there is no reason to suspect that event probability is other than 0.5. Consequently, we can apply the runs test to test for independence. There are 6 = U runs in the sequence. We will choose a 0.05 level of significance. From Table A.3a, we find that u0.05,7,8 is 4, and find that U = 6 ≰ u0.05,7,8 = 4. Consequently, using the runs test on a 15-year sequence of controlling party, we cannot reject the hypothesis that the party in power has no influence that extends its control at the 95% confidence level. More simply, one would state, there is insufficient information to conclude that once a Zard political party gets in power, it stays there for a while.
7.6 Chi-Squared Goodness-of-Fit Test In Section 6.10, we showed how the chi-squared distribution can be used to compare measured frequency (or number of data of a particular classification) with the expected frequency (or number). The same test can be used to compare a measured probability distribution to a hypothesized distribution, perhaps to test whether the measured data are normally distributed. To apply the chi-squared goodness-of-fit test: 1. Collect N data points. 2. Partition the range of the data into n cells, with no fewer than one observation in any cell and no fewer than 20% of the cells with less than five observations. Preferentially with at least 5 observations in each cell. 3. Count Oi, the number of observations in each ith cell. 4. Hypothesize that deviations of the data histogram from a particular probability density function are small. You could use the data X and S, for instance, to parameterize a particular normal probability density function. 5. Choose α. 6. Calculate the degrees of freedom, v = n − 1 − k, where k is the number of parameters in the hypothesized distribution that were determined from the data. If the data had been used to parameterize a Gaussian distribution, k would be 2.
7. Compute the chi-squared statistic N
c2 =
å i =1
( Ei - Oi ) Ei
2
where Ei is the expected number of observations in the ith cell. For the special case where v = 1, some sources advise a correction for continuity by reducing the absolute value of each nonzero value of (Ei − Oi) by 0.5. 8. If c 2 ³ c v,2 1-a , reject the hypothesis that the data fits the model. Our guide for the number of bins, N, is that it should be about the square root of the number of observations, n. Use equal bin intervals to be able to “see” the pdf. Alternately, choose bin intervals so that each bin has the same number of expected observations, to have the strongest test results.
160
Applied Engineering Statistics
The restriction on the number of observations per cell stated in Step 2 is standard accepted practice but is a subjective rule based on experience with the chi-squared test. The chi-squared distribution is approximately equal to the statistic as calculated in Step 7 when Ei is not too small. Some statisticians suggest choosing partitions such that no cells have fewer than five observations in order to ensure the valid use of the χ2 distribution. If the observed distribution is different from the hypothesized distribution, then (Ei − Oi) will be large and the calculated χ2 statistic will be large. If it is improbably too large, it indicates that the two distributions are probably different. However, an improbably toosmall χ2 is also cause for concern. One expects some deviations from the observed and hypothesized distributions. Too small a χ2 could indicate contrived data or poor experimental technique. However, because most χ2 tests only look for unexpectedly large differ2 ences, a single-tailed test based on c v,1 -a is indicated in Step 8. Example 7.5: Liquid is continuously flowing at 1 m3/min through a 4 m3 volume in a well-stirred tank. The liquid volume remains constant. One thousand tracer particles are instantaneously dumped into the tank, rapidly become uniformly dispersed, and begin flowing out of the tank with the liquid. As some of the particles leave, the rest become uniformly dispersed, and the probability of a particle being near the exit and caught by the exit fluid decreases. As time progresses, the rate at which particles leave diminishes. Initially, many particles exit rapidly and have a low residence time in the tank. Later, some particles still remain in the tank and have a longer residence time. The particle residence time distribution is an important mixing and chemical reaction design criterion. The distribution has been measured for this tank by screening the tank effluent for successive 1-minute periods and counting the number of particles. The data follow. Does the measured distribution match the expected exponential distribution? Period: Measured number:
1
2
3
4
5
6
7
8
203
185
132
110
85
65
57
40
The expected distribution is exponential with τ = F/V = (1 m3/min)/(4 m3) = 0.25 min−1. No distribution parameter was calculated from the measured frequency data, consequently k = 0, and n = 8 - 1 - 0 = 7. As measured, the data are partitioned into eight cells, none of which have less than five particles. Choose α = 0.05. With τ = 0.25 min−1, the expected number of particles leaving in each minute of operation are obtained from Equation (3.43). Period: Expected number: Ei − Oi 8
c2 =
å i =1
1
2
3
4
5
6
7
8
221 18
172 −13
135 3
104 −6
81 −4
64 −1
49 −8
39 −1
( Ei - Oi ) Ei
2
18 2 ( -13 ) 32 + + 221 172 135 2
=
+ +
( -6 )
2
104
( -8 ) 49
+
2
+
( -4 )
2
81
( -1) 39
+
( -1)
2
64
2
= 4.4063613…
Nonparametric Hypothesis Tests
161
The expected chi-squared value could be obtained from many sources. In Excel, c 72, 0 , 95 = CHISQ.INV ( 0.95, 7 ) = 14.0671. We find c 2 = 4.4 ¼ c 72, 0 , 95 = 14.0671 . Accordingly, using the χ2 goodness-of-fit test, on the 0- to 8-minute tank data, we cannot reject the hypothesis that the tank has a residence time distribution equivalent to the expected exponential distribution with τ = F/V, at the 95% confidence level. There is insufficient evidence to say that the mixer design is bad. Simply, the mixer appears to work properly.
7.7 Kolmogorov-Smirnov Goodness-of-Fit Test Although the χ2 test is a parametric test based on normally distributed data, in practice its efficacy is not undermined by other distributions and its use as a goodness-of-fit test has become standard practice. However, the chi-squared test results are sometimes questioned because the experimenter can bias the results with the selection of the sample partition locations. In addition, ten or more cells (of five or more observations each) are preferred for the chi-squared goodness-of-fit test in order to discriminate between distributions. One may not always have the luxury of 10 × 5 = 50 observations. The Kolmogorov-Smirnov test avoids experimenter bias and can be effectively applied to five, but preferably ten or more, observations. The Kolmogorov-Smirnov test is applied to the cumulative distribution of the data to determine whether it is different from a specified cumulative distribution function, F(X), where X is a random variable. To apply the Kolmogorov-Smirnov test: 1. Specify the cumulative distribution function, F(X). 2. Hypothesize that the observed distribution is equivalent to the specified distribution. 3. Choose F = 1 − α. 4. Sample n times and list the observed values of X in ascending order: x1, x2, …, xi, xi+1, …. 5. Compute the empirical cumulative function Fn(X) = k/n where k is the number of samples with a value less than or equal to x. Fn(X) is a step function that changes value at each xi and holds that value for xi ≤ X ≤ xi + 1. 6. Compute D, the maximum absolute value of Fn(X) − F(X) over the entire range of x, not simply at observed xi values. 7. Obtain dF,n from Table A.4. 8. If D ≥ dF,n, reject the hypothesis. Example 7.6: Computer chips are made by repeatedly applying films and then removing unwanted micron-sized areas by a chemical etch. This process leaves “wires” and “insulators” and eventually builds transistor-type components. In the 1980s, about 300 computer chips were simultaneously built on a 4 inch disk of single-crystal silicon. In an effort to understand the manufacturing process for making computer chips, an engineer has developed a phenomenological model that attempts to predict the probability distribution of the number of bad chips made on a 325-chip silicon wafer during an etch step. If the prediction is equivalent to the measured distribution, then a
162
Applied Engineering Statistics
phenomenological understanding of the process can be claimed, and the engineer will know how to improve the process. Empirical data are obtained by randomly sampling and inspecting the half-built computer chips for evidence of bad etching. The number of etch-damaged chips found on 30 wafers follows. Wafer No. of bad chips
1 6
2 11
3 5
4 7
5 6
6 13
7 10
8 15
9 13
10 14
Wafer No. of bad chips
11 19
12 6
13 15
14 15
15 18
16 13
17 14
18 14
19 13
20 5
Wafer No. of bad chips
21 16
22 14
23 12
24 6
25 13
26 13
27 7
28 20
29 12
30 9
The cumulative probability function that was predicted is X = No. of bad chips F(X) = X = No. of bad chips F(X) =
0 0 12 0.43
2 0.02 14 0.76
4 0.05 16 0.90
6 0.11 18 0.94
8 0.20 20 0.96
10 0.30
Apply the Kolmogorov-Smirnov test to determine whether the theoretical and empirical distributions are equivalent. The random variable X is the number of bad chips. Values of xi are those measured. In ascending order, the xi values are 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, and 20, representing x1 through x14. For x4 (9 bad chips), there are k = 8 wafers with 9 or fewer bad chips. Consequently, Fn(xi) are listed below: xi k Fn(xi)
5 2 0.067
6 6 0.200
7 8 0.267
9 9 0.300
10 10 0.333
11 11 0.367
12 13 0.433
xi k Fn(xi)
13 19 0.633
14 23 0.767
15 26 0.867
16 27 0.900
18 28 0.93
19 29 0.967
20 30 1
Although xi had only 14 values from 5 to 20, X may have 326 values from 0 to 325. The absolute value of the difference in Fn(X) and F(X) must be checked at all possible values of X. They are listed below, where F(X) at odd values of X has been obtained by linear interpolation. X
F(X)
Fn(X)
|Fn(X) − F(X)|
0 1 2 3 4 5
0 0.01 0.02 0.035 0.05 0.08
0 0 0 0 0 0.067
0 0.01 0.02 0.035 0.05 0.013 (Continued)
163
Nonparametric Hypothesis Tests
X
F(X)
Fn(X)
|Fn(X) − F(X)|
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21-326
0.11 0.155 0.20 0.25 0.30 0.365 0.43 0.595 0.76 0.83 0.90 0.92 0.94 0.95 0.96 >0.96
0.200 0.267 0.267 0.300 0.333 0.367 0.433 0.633 0.767 0.867 0.900 0.900 0.933 0.967 1.00 1.00
0.090 0.112 0.067 0.050 0.033 0.002 0.003 0.038 0.007 0.037 0.00 0.02 0.007 0.017 0.040 X hypothesis. Then if not rejected, even if they are equivalent, you can claim, “Our testing indicates that we accept Y > X”. The outcome of the statistics will likely be shaped by your choice of the hypothesis. Don’t let your bias or agenda misrepresent the reality within the context. At least, support an “accept/reject” conclusion with the associated p-value. Perhaps, also reveal the results with alternate hypotheses choices. The hypothesis might be that the treatments X and Y are economically equivalent. As an example, consider that the treatments are raw material suppliers for a process. The economics would include many factors, such as price, receiving packaging, waste, possible processing speed, yield, and many more. X may be better in some attributes and worse in others. The hypothesis would not be on one attribute, but on the holistic economic impact. If the current supplier is X, and Y is being qualified by tests, then there will be an additional cost if Y is accepted. This would represent the management of change (MoC), which would include the revision of operating documents to acknowledge the change, the additional burden of tracking which raw material is in which products, etc. It also considers the cost of the trials. So, the hypothesis should not be simply based on processing economics, but it should include the MoC costs. In order to accept Y, the economic benefit of Y, E Y , should exceed X by the MoC. If the MoC is a one-time cost, and the economic benefits of X and Y are annual benefits (rates), then the two need to be normalized. The MoC could be normalized by a target Pay-Back-Time. Then the t-statistic for comparing X and Y would be
MoC ö æ E Y - ç E X + ÷ PBT ø è T= (10.1) s pooled on E
Be sure that all the numerator and denominator terms have the same units. Further, it may be difficult to economically quantify some important factors. One such factor could be the peace of mind that having an alternate supplier brings to robustness to a supply chain failure, if there is a single supplier. Another might be the potential leveraging of relations to supplier Y that are related to other initiatives, maybe even a possible merger. If there are such concerns, apply a value to the sum of all of them, and an equiva Then structure the T statistic as lent value, E equivalent , to the nominal E.
MoC ö æ (E Y + E equivalent ) - ç E X + ÷ PBT ø è T= (10.2) s pooled on E
Choices
211
Again, be sure that the terms are dimensionally consistent. 10.2.2 Truncating or Rounding Conventionally, the rule is to carry two more digits than are justified by uncertainty, then to round the answer to only report significant digits. But humans can choose to truncate rather than round or can choose the number of digits that are significant. If the statistic is close to the critical value, then these human choices can make a reject outcome become accept, or vice versa. For instance, if the critical value is 2.48452 and the experimental value is 2.4199, you would not reject it. But if you claim there are only two significant digits and truncate both to 2.4, you could show the audience that the conclusion is to reject. 10.2.3 The Conclusion The conventional conclusion is to either accept or reject. But this does not reveal how close the situation was to the alternate conclusion. So, be sure to indicate to your audience how close the claim is. Report a p-value or report the level of confidence that would generate the alternate conclusion. 10.2.4 Data Collection Design the experiment to represent the population, all conditions, not just a subset of environmental factors, equipment condition, operators, etc. Randomize experimental order to minimize correlation. Consider that you have historical operating data on raw material from supplier X, and before the process equipment is to be taken down for maintenance you run tests on material from supplier Y. If the equipment needs maintenance, then the equipment condition may have a larger impact on manufacturing statistics than the difference between material from the two suppliers. Consider that when operators are on the day shift, and in the presence of process engineers, maintenance personnel, and plant management, they may pay closer attention to manufacturing conditions than when they are on the evening and night shifts. If you choose to run tests of a new treatment during the day shifts (so that it can be properly supervised) or on the evening shifts (so that the trials don’t interfere with daytime activities), then the tests will not represent the same conditions as all production. 10.2.5 Data Preprocessing Grouping data into bins is commonly performed to see the histogram, or to use chisquared tests. But if the bin is too wide, all the data fits into one, and the detail is not visible. Alternately, if there are too many bins, perhaps as many as there are data, the histogram with just 0, 1, or 2 counts in each bin just reveals noise. The user must choose bin centers and bin widths. However, in doing so, the user can make it appear that treatment X is different from treatment Y. As a political parallel, gerrymandering is the practice of manipulating political boundaries so that number of delegates from votes will favor the control of a political party. If you wish, you can pretend that you have a good reason for your choice of the selection of bins, when the real reason is to bias the outcome of a statistical test. But don’t.
212
Applied Engineering Statistics
10.2.6 Data Post Processing Elimination of outliers should be a permissible activity. If, after the data analysis, it is realized that a data point, or sequence of data, had been corrupted, and does not represent the population, then it should not be included in statistics that seek to describe the population. Like allocating data to bins in preprocessing, the user can choose to not seek outliers, if the data suits a personal bias, or use any number of justifications to select outliers if doing so makes the remaining data support a bias. Don’t be so tempted. There are several accepted statistical practices for detecting outliers, but these are associated with the requirement that the action be reported, and that some legitimate mechanism be identified to justify the data rejection. See Chapter 15 for one criterion, Chauvenet’s. 10.2.7 Choice of Distribution and Test If the distribution of the data is normal, then a t-test would be appropriate for assessing the difference between means. But if the statistic is a proportion, or if the results are Poisson or Binomially distributed, the normal approximation may not be valid. In nonnormal cases, if the value of the statistic is near to an extreme value or generated from a low N, then the t-test would not be appropriate. Similarly, F-tests should be used on variances, and nonparametric tests should be used on medians and runs. Choose the test that is appropriate to the distribution of the data. Of course, if you do not like a particular outcome from a test, you might be tempted to see if an alternate test returns the decision you might favor. But, please don’t shape your test to support your desired result. So, defend your choice of a test as correct for the distribution of the data. 10.2.8 Choice of N If you want to accept the hypothesis, use a small number of trials to generate the statistic. In such a case, there will be so little certainty about the statistic value, that you will not be able to confidently reject the hypothesis. Alternately, if you want to reject the null hypothesis keep collecting data. Eventually, the statistic will be beyond the critical value. If two treatments are identical, and α = 0.05, then in 5% of the testing, the statistic will be beyond the critical value. Keep collecting data until this happens, then you can claim reject with statistical validity. Classically in statistics, the number of data is defined by choices of α (the Type 1 error, the probability of rejecting the hypothesis when it is true), and β (the Type 2 error, the probability of accepting the hypothesis when it is false, which also depends on a specified difference between the rejectable and true statistic). Of course, even in the idealized mathematics and truth of statistics, N, alpha, and beta are human choices. Be sure you can defend the legitimacy of your choices within the situation context. 10.2.9 Parametric or Nonparametric Test Both test types are legitimate. If you want to accept a hypothesis, and a parametric test rejects it, then you may be tempted to try a nonparametric test. Parametric tests are grounded in the coefficient values (parameters) of a particular distribution. For example, the parameters of a normal distribution are the mean and variance. Predicated on a particular data distribution, parametric tests are more sensitive than nonparametric tests.
213
Choices
Parametric tests can reject hypotheses with fewer data. If the parametric test is appropriate, don’t discard it because you want an alternate outcome. 10.2.10 Level of Confidence, Level of Significance, T-I Error, Alpha The four terms, Level of Confidence, Level of Significance, T-I Error Probability, and Alpha are equivalent in hypothesis testing. Three of the terms, Level of Significance, T-I Error Probability, and Alpha are identical. They range between 0 and 1, and level of confidence is 100 (1 − α). If you are using the classic dichotomous hypothesis testing, you use α to define the critical value of the statistic, the boundary between the reject and the accept results. For instance, consider the test is two-sided using a t-statistic to compare Treatments X and Y. The null hypothesis is that X = Y, and we will reject if X is confidently either greater or less than Y. If there are n = 10 replicates, and the experimental value is T = 2.2153, then a choice of α = 0.05 (the 95% confidence) has a t-critical value of tn ,a /2 = 2.2622 would not reject the equivalence hypothesis. The claim would be to accept that X and Y are equivalent. The report might be, “I can state with a 95% confidence that X and Y are equivalent.” However, at the 90% confidence, with α = 0.1, the t-critical value is tn ,a /2 = 1.8331 , and the claim would be to reject the hypothesis that X and Y are equivalent. Now the report might be, “I can state with a 90% confidence that X and Y are different.” Both 90% and 95% confidences are fairly strong, and they could be used to influence an audience to accept either claim. If you want product Y to be accepted as equivalent to X, then use the 95% confidence. If you want the company to reject product Y and favor X, then use the 90% confidence. The data processor’s choice can distort the situation. So, reporting a p-value is strongly recommended to let the audience know how close the accept/reject decision is. In the above case the p-value is @ 0.054 , which has a corresponding reject confidence of 94.6%. The interpretation is: If X and Y are equivalent, and the experimental results are normal, the T-statistic value should be about zero. There is only a 5.4% chance that the experimental T-value could be as extremely nonzero as the 2.2153 value (or higher value) obtained. The 5.4% value can be placed in perspective: If X and Y are not equivalent, if their means difference, mX - mY = X X - XY , is 2.2153s / 10 , then the T = 2.2153 result would be expected. (Recall that T = X X - XY / s / N .) There is a 50% chance that the experiment would generate data with that, or a higher, value. So, if one hypothesis has a 5% chance of such a high or higher value, and the other hypothesis has a 50% chance, the 10:1 ratio of probabilities (odds) favors the not-equivalent case. We should reject the equivalent hypothesis. Note: The 10:1 odds of that example are like flipping a fair coin 4 times and losing each time, four in a row. It is certainly a possible outcome, but not expected. Whether the odds are 9:1 or 11:1, it seems the better decision is to reject the null hypothesis. Whether alpha is 7% or 5% or smaller, the odds indicate a reject decision. There is not certainty, just probable cause. Classic statistical testing uses values in the range of 90% to 99% confidence levels, which means alpha is in the 0.01 to 0.1 range. A question is what is the right value? Suppose, however, the p-value was 0.28 (meaning if the two treatments were equivalent there is a 28% chance that the experiment could generate data that extreme). With the same consideration as above: If X and Y are not equivalent, if their means difference, mX - mY = X X - XY = Ts / N , then the T result (whatever value it is) would be expected. There is a 50% chance that the experiment would generate data with that, or a higher, value. So, if one hypothesis has a 28% chance of such a high or higher value, and the other
(
)
214
Applied Engineering Statistics
hypothesis has a 50% chance, the ~2:1 ratio of probabilities slightly favors the not-equivalent case, but it is not unexpected. To put this into perspective, this is the same as flipping a fair coin once and not winning. But, somewhere between the p-values of a @ 0.28 and a @ 0.05 is the “could be” region where no decision can be confidently made. One option is to continue experiments, to collect more data to have greater precision in the comparison. But, that may not be practicable. The question remains: What are the right values of α and β for testing a hypothesis? We would offer that you consider the consequences of rejecting the hypothesis if it is true, and the consequences of accepting a minimally wrong condition, weighted by the undesirability of the time and cost penalty of excessively large number of trials. This exercise seems to be a nonlinear optimization which is dependent on quantifying several categories of concerns. So, as a general rule, perhaps aim for the conventional a @ 0.05 for economiconly consideration, a @ 0.02 for quality issues, and much lower values where issues of very high concern (health, safety, life, loss prevention, etc.) are involved. If you deviate from the historical norm, clearly defend why with logic grounded in the application context. 10.2.11 One-Sided or Two-Sided Test For the same level of confidence, or the complement level of significance, alpha, the twosided test, places half the rejection area on each side. The one-sided test places all of the rejection area on the same side. This means that the critical value of the statistic is nearer to centrality for the two-sided test and farther for the one-sided test (into the more extreme region). As an example for degrees of freedom u = 10, and level of significance α = 0 .05, the onesided t-critical is tcritical = 1.8125 . The two-sided t-critical is tcritical = 2.2281 . If you choose the two-sided critical value, a larger difference between two treatments will be required to reject the hypothesis. Similarly, for degrees of freedom u = 10, and level of significance α = 0.05, the one-sided chi-squared-critical to reject a large difference is c 2critical = 18.307 . The two-sided upper value is c 2critical = 20.4832 . Again, if you choose the two-sided critical value, a larger difference between two treatments will be required to reject the hypothesis. But you don’t have a choice about using a two-sided or a one-sided test. If the hypothesis is that two treatments are equal, then if one is either too large or too small you would reject the equal hypothesis. You must use the two-sided test. By contrast, if the hypothesis is that one treatment is better than the other, then you are interested in rejecting the hypothesis if the data indicates it is not better. Here you need to use the one-sided test. There may be a temptation to use the one-sided or two-sided critical value that supports a decision you might want to have supported. Don’t make that choice. Let the hypothesis determine the rejection region. 10.2.12 Choosing the Quantity for Comparison There are probably multiple criteria for judging whether one treatment is better than another. And the criteria have disparate dimensional units and impact. Here is one author’s example: Should I drive on path A or path B? My son’s house was across town, about 20 minutes away. I could drive there, first going east, then north (Treatment A). Or first going north then east (Treatment B). Path A is more direct, but it has more traffic. Path B has more open roads but many turns and hills. So, initially we
Choices
215
often took alternate paths and found that the travel time of Treatment A (Choice A, Path A) is less on average, but not always. Treatment B is a mile longer, meaning it costs more in gasoline. Treatment B goes through the country, and the isolation from population, traffic lights, traffic, service facilities is more aesthetically pleasing. Alternately, B is less available to rescue if needed, and more prone to hitting a wild animal. Although the traffic and aesthetics of Treatment A is less desired, the security is higher, but so is the risk of being involved with a “fender-bender”. We could measure only travel time for the two treatments, and then do a t-test to see if they are different. But this discounts the several other attributes that are important. This example discussed a few: • Travel time, minutes. • Travel cost, $ associated with gas, tire wear, etc. • Security, an emotional rating associated with the convenience, or lack, of help if needed. This might be ranked or valuated on a 0 to 10 basis. • Aesthetics, an emotional preference for open land beauty contrasting the undesirability of traffic congestion and old buildings on crumbling parking lots. Again, this might be ranked or valuated on a 0 to 10 basis. In manufacturing, the disparate issues might be product yield, customer-perceived aesthetics, political capital, processing cost, quality variability, manufacturing flexibility, environmental risk, or many others. Whatever is your application, there are likely to be many aspects of desirability or undesirability of the treatment. One treatment might be better for one criterion, but worse for another criterion. Don’t make a comparison solely on one of the many criteria that are probably important. If you choose only one criterion, then you might bias the management decision. One approach is to measure, compare, and report treatment characteristics relative to each of the several criteria. If one treatment is better in one aspect, but worse in another, you could report the p-values for each aspect, and let the audience choose the treatment. This is a multi-objective approach, reporting the metric for each attribute separately, and letting the decision-maker balance the several objectives and make the choice. Alternately, one could combine the several metrics. However, they will often have different dimensional units and impact relevance, so that cannot simply be added. In the travel example above, the several metrics for the path options were travel time, travel cost, aesthetic appeal, and safety. The metric values for each criterion, vi, could be multiplied for a total evaluation value.
Etreatment = Õ vi , treatment (10.3)
Then the units on the combined evaluation for each treatment, Etreatment A and Etreatment B , would be the same. However, this makes the importance of one value proportional to the value of the others. An additive combination assesses each value as remaining independent of the others. However, some sort of scaling is needed to be able to add terms with disparate units. A Lagrange-type multiplier approach is commonly used.
Etreatment = å ±livi , treatment (10.4)
216
Applied Engineering Statistics
But this means that someone must provide λi values that provide both unit scaling and impact scaling. Our favored approach to determine the lambda-weighting is to use equal concern scaling from an ideal.
Etreatment = å ±
vi , treatment - vi , ideal (10.5) ECi
The ideal value, vi, ideal represents the ideally desired outcome. In the travel path option, for travel time it would be zero minutes. For travel cost it would be zero $. In aesthetic beauty it would be a perfect 10. In the equation, vi, ideal is actually not needed. In the difference between treatments it is added and subtracted.
Etreatment change = Etreatment B - Etreatment A = å ±
vi , B - vi , A (10.6) ECi
But often, acknowledging the ideal helps assign the equal concern scaling values. However, just considering the sum of scaled performance values, without the ideal is also fully functional.
Etreatment = å ±
vi , treatment (10.7) ECi
The ± sign is required. The impact of some terms (such as travel time) is undesirable, and some (such as trip aesthetics) are desirable. To make the treatment evaluation, E, represent desirability use a “+” sign for all terms which represent desirability, and a “−” sign for all representing undesirability. The equal concern scaling factors, ECi, represent the relative subjective concern and importance for each term. The EC values have the same dimensional units as the respective numerators; so, each of the terms in the sum are dimensionless. To assign these:
1. First, choose a deviation for one item that is cause for concern (or joy). For instance, a 5-minute lengthening of the trip time, or a 3 lb/sec improvement in throughput, or a 0.5% increase in interest. Choose a value that has some meaningful concern but is not excessive. This is one of the EC values. 2. Consider all of the concern (and joy) issues that such a deviation might invoke. Perhaps someone is waiting and needs to wait 5 minutes more. Perhaps the roundtrip costs you 10 minutes from getting back to watching the game or returning to housework. Some considerations may have little to no impact. If the person waiting deserves it, then adding 5 minutes to their wait has little negative impact. If the person waiting is going to do you a favor, then making them wait has a substantial impact. Consider all aspects of the deviation, the EC factor, and feel the level of concern (or joy) associated with that deviation.
3. One-by-one, for each remaining term in the sum, choose a value of the EC factor that raises the same level of concern (or Joy) as that chosen in Step 1.
Note: The EC factors are context dependent. As an example: During a period of sold-out capacity, getting 1% more production might be very important. On another day, when sales are only 75% of capacity, the 1% throughput increase is of no impact.
217
Choices
Example 10.1: Treatments A and B represent processing temperatures. Treatment A is the current operating temperature. Treatment B, the higher temperature being explored, speeds up production rate by 5 units/day, but cost $1/unit more, and increases product quality variability by 0.8 kg/L. The increase in production is moderately desirable and is used as a basis for assigning equal concern factors. ECproductivity = 5 units/day . It has been decided that a cost change of $3/unit has the same level of concern. ECcost = $3/unit . It has also been decided that a change in variability of 0.2 kg/L has the same level of impact. ECvariability = 0.2 kg/L . From Equation (10.6):
EA to B = EB - EA =
å
$5u $1 $0.8 kg + vi , B - vi , A day L ± = + u = -3.33 $5u $3 $0..2 kg ECi day u L
Even though the productivity increase is beneficial, the concern over loss in quality makes that quality term dominate the total performance assessment of the treatment difference. The switch to the higher temperature is not desirable. Note: In another year, the perceived loss of quality might not be of much concern, kg , then EA to B = +0.266, then a and if the equal concern over variability is ECvariability = 2 L change would be justified.
Equation (10.7) represents an overall performance, quality, or attribute evaluation for the treatment. After N trials it will have a variability, which can be calculated as the standard deviation of the N Etreatment values. Alternately, you might only have one trial, one value for each element in Equation (10.7). If you have experience of the variation of individual factors, you can estimate the variation on Etreatment by propagation of variance. 2
æs ö s E = å ç vi ÷ (10.8) è ECi ø
Then the standard error of the average treatment evaluation would be 2
æs ö s E = å ç vi ÷ / N (10.9) è ECi ø
And a t-statistic used to compare averages (assuming the variation is equivalent, Case 2) is
T=
EB - EA
s 2EB s 2EA + NB NA
(10.10)
If you have multiple trials from which to calculate the standard deviation of the Etreatment, it would be good to compare that with the estimate from Equation (10.8). 10.2.13 Use the Mean or the Probability of an Extreme Value? Figure 10.1 illustrates the CDF of the distribution of observation values from two treatments. Looking at the 50th percentile values, you can see that the mean of B (about 4 on
218
Applied Engineering Statistics
FIGURE 10.1 CDF of sample values from two populations.
the x-axis) is greater than the mean of the other, A (about 3). Since greater is desirable, Treatment B might be chosen. However, the left-tail of the two distributions indicates that B is more likely to be below the threshold of disaster, a value of 2, as illustrated in this figure. Perhaps this indicates a profitability index, or a safety index. Treatment B has a 25% chance of being a failure, while A only has a 1% chance of being a failure. If the penalty for a failure is equivalent to a 5 on the x-axis then risk difference of 1.2 (= 0.25 * 5−0.01 * 5) is larger than the possible benefit of choosing B over A. Treatment B may be better on average, but if Treatment A is adequate, the security associated with that choice may be more important than the benefit. Consider this example: The thrill of cliff-jumping into the water far exceeds the thrill of wading into the water from the shore. But for most people, the thrill is not worth the possible risk. Often, we should be using data to reveal the possibility of an undesired possibility, not just comparing averages to assess treatments. The undesirability, or penalty, might not be a fixed value, but might increase with deviation into the undesired region. As well, one might look at the glorious possibility of getting a 10 from Treatment B when the best that Treatment A will provide is a 4. A gambler might choose B over A. This can be subjectively quantified. You need a model of the distributions, and a sense of concern or joy about the x-values. Normally risk is associated with a cost of an event. But there are many more undesirable aspects associated with personal and corporate reputation, the concern can be qualitatively mapped to the x-value. Concern could be rated on a scale of 0 to 10, from no concern to the ultimate most undesirable set of outcomes that could happen. Similarly, joy over exceeding expectations can be qualitatively mapped w.r.t. the x-value. Joy could also be rated on a scale of 0 to 10, from indifference over the outcome to the ultimate most desirable set of outcomes that could happen. Joy should be equivalent to Concern. Figure 10.2 illustrates a possible mapping of Joy and Concern w.r.t. the x-value. Here, an x-value of 0 creates a concern level of 2. Equivalent to a concern value of 2 is a Joy value of 2, which corresponds to an x-value of about 7. Note: This illustrates both some level of Joy and some level of Concern at the x-value of 2. What the shapes are do not need to be continuum-valued, and they will be relative to a particular context. Perhaps get several people with a comprehensive view of the situation to collectively judge the trend. The probability of getting an x-value within a particular range is the CDF difference. If within that x-range, the subjective estimates from Figure 10.2 can be used to get the
219
Choices
FIGURE 10.2 Level of Joy and Concern w.r.t. X.
corresponding Concern and Joy values, the product of probability times Concern and Joy summed over all of the Δx values will represent the overall rating for each treatment.
ET =
é
å éëCDF ( x + Dx ) - CDF ( x )ùû êëJoy æçè x +
Dx ö Dx ö ù æ ÷ - Concern ç x + ÷ (10.11) 2 ø 2 ø úû è
10.2.14 Correlation vs Causation Just because there is a correlation, does not mean there is a cause-and-effect relation. A relatable example is that gray hair on humans is strongly correlated to face wrinkles. This correlation does not mean that gray hair causes wrinkles. There is a strong correlation between sunrise and people awakening, but this does not mean that people awakening causes the sun to rise. Often however, the presence of a correlation is used to defend a flawed mechanistic viewpoint, to defend an illogical claim, or to justify action that is driven by a hidden agenda (self-serving, or politically or emotionally based). A 99% confidence in the correlation does not mean that a faulty supposed mechanism is confidently true. In this big-data machine-learning era, we have the computing tools to analyze relations and trends in vast quantities of data. I think that it is very useful in detecting the possibility of fraud or spam, and in suggesting preferences, “If you like that music, you might also like these suggestions.” However, these findings are correlations between variables, and although they might discover cause-and-effect mechanisms, they do not represent the mechanism. And often they discover two effects of the same cause. Flowers blooming do not make it become a warm season. If there is a mechanistic cause-and-effect relation, if there is a phenomenological pathway between one event and the outcome, then you should be able to describe each relation in the sequence and test each internal variable to see if the relation is true. You can measure each internal variable, and also block it to see if the outcome does not happen. For example: Turning the key in the ignition does not start the car, although there might be a 99.9% correlation that it does. Turning the key closes an electrical circuit that energizes the starter motor, engages it with the engine, and closes the choke. It requires gasoline in readiness, energy in the battery, mechanical engagement of the starter motor gears, and many other factors. If you are going to claim causation, be able to defend each cause-andeffect mechanism in the sequence. There may be very confident statistical evidence for a correlation. Don’t use it to claim causation. The statistical hypothesis is not the mechanism.
220
Applied Engineering Statistics
10.2.15 Intuition Balance cost, time, certainty, and perfection with your understanding of sufficiency. The greater the number of trials you perform the more certain you can be about the outcome distribution and a decision. The more rigor you use in modeling the more confident you can be in the model. But for any analysis, there is a diminishing return of the benefit with respect to the amount of time and effort invested. Use your intuition to make such a decision, but also validate your decision with the viewpoint of your customers/stakeholders. Intuitive choices could include: Should you linearize or not? What is the appropriate power series order, network complexity, etc.? What number of trials is appropriate? What level of significance is appropriate? Are variances uniform over the range? 10.2.16 A Possible Method to Determine Values of α, β, and N As a possible approach to quantifying values of α, β, and N for testing a hypothesis, consider this situation: You are currently using Treatment X, and want to know if Treatment Y could be an alternate. Perhaps X is a raw material, from a single supplier, and you want to qualify Y for security of supply, for leveraging of price, etc. You have decided on a key metric to evaluate the performance of X and Y, which may be a composite of several critical aspects; and from extensive operating experience know values for μX and σX for this key metric, and that the variation in metrics is normally distributed. You would accept Y if mY ³ mX . Trials with Y, of course will reveal Y , not μY. So, if Y , indicates that μY could be ³ mX , accept Y. Assume that s Y = s X , then if mY ³ mX , the lower s limit on Y would be calculated as Ycrit = mX - za X , where N is the number of trials. N s Accept Y if Y ³ Ycrit = mX - za X . You need values for two factors here, α and N. In Excel N sX ö æ Ycrit = NORM.INV ç a , mX , ÷. Nø è Another issue is that you would like to reject Y if mY £ mX - D , where you choose the value of Δ. However, if mY = mX - D, because of experimental vagaries, there is a chance that Y ³ Ycrit . You would like enough trials so that there is a small chance of this happening. β is this chance, the probability that Y ³ Ycrit given that mY = mX - D. You need values for s ö æ two factors here, β and Δ. You choose Δ. In Excel b = 1 - NORM.DIST ç Ycrit , mX - D , X , 1 ÷ . N ø è Increasing the number of trials makes the uncertainty on Y smaller. Permitting desirably smaller values for α (the chance of rejecting Y if Y is good) and β (the chance of accepting Y if Y is bad). But, large N creates a concern associated with the cost and duration of the trials. The method offered here seeks to determine values of α, β, and N that minimize all the concerns. Table 10.1 reveals possible issues and associated level of concern (LoC). Each level of concern stated in Table 10.1 is a qualitative number. It is similar to ratings that humans give to products, movies, restaurants, vacations, bosses, employees, and attractiveness of each other. It represents how important all the diverse aspects are. It is qualitative, and it would change with the context. The values in the table are not a universal truth. These values represent the authors’ general experience. Use values relevant to your application. The concern for an extensive number of trials also needs to be included. In general impatience, and undesirability, rises as the square of the magnitude of something. And
221
Choices
TABLE 10.1 Outcomes and Level of Concern Truth
Action: Reject Y
Action: Accept Y
mY ³ m X
Wrong Action. Concerns include you still need to find an alternate to X, wasted trials, and you missed a good thing. On a 0–10 basis the level of concern, c1, might be a 4. The probability of this happening is α.
mY £ m X - D
Correct Action. However, you still do not have an alternate, and have invested in trials. Finding and testing a new alternate, is in the future. On a 0–10 basis the level of concern, c3, might be a 3. The probability of this happening is (1 − β).
Correct Action. You are happy to have an alternate but need to still do the work to manage the change of implementing Y in production, and there is always the nagging possibility that something unforeseen will cause an upset. On a 0–10 basis the level of concern, c2, might be a 1. The probability of this happening is (1 − α). Wrong Action. You have accepted a bad treatment. Certainly, it will be revealed after substantial use, and everyone associated will be embarrassed. Trial expense was wasted, Y will be rejected, and you still need to find an alternate to X. It will show up on your annual appraisal. On a 0–10 basis the level of concern, c4, might be an 8. The probability of this happening is β.
assigning the concern of 20 trials as equivalent to a level of concern value of 3, the level of 2 æNö concern of N trials would be 3 ç ÷ . è 20 ø 2 æNö In total the sum of all the concern factors would be 4a + 1 ( 1 - a ) + 3 ( 1 - b ) + 8 b + 3 ç ÷ . è 20 ø Expressing it generically, with c symbols representing the coefficients: 2
æNö Ctotal = c1a + c2 ( 1 - a ) + c3 ( 1 - b ) + c4 b + c5 ç ÷ (10.12) è c6 ø
The objective is to minimize this term. It appears that the decision variables are α, β, and N, but since β depends on a , s , D , N , m x , there are really only two adjustable parameters, α and N. The optimization statement is
2 min æNö J = c1a + c2 ( 1 - a ) + c3 ( 1 - b ) + c4 b + c5 ç ÷ (10.13) {a , N} è c6 ø
S.T. User-specified LoC values for each aspect μX, σX, and Normal Distribution from historical data Δ as chosen by the user, to represent a deviation that should be detected
s ö æ Ycrit = NORM.INV ç a , mX , X ÷ . Nø è s æ ö b = 1 - NORM.DIST ç Ycrit , mX - D , X , 1 ÷ . N è ø From the LoC factors in Table 10.1, the solution is α = 0.13, N = 7, and β = 0.065. If, as an example, the concern for excessive trials was much lower, with N = 20 = c6 trials equivalent to a c5 of 1, the solution is α = 0.066, N = 11, and β = 0.035. Finally, as another example
222
Applied Engineering Statistics
wherein the concerns for not making a mistake dominate the concern for excessive trials, the solution is α = 0.004, N = 35, and β = 0.0006. The procedure to determine appropriate α, β, and N values is somewhat complex. It requires the user to provide rating for the six LoC factors; and the mixed integer–continuum optimization may not be within the skillset of the person designing the experiment. But satisfyingly, the procedure returns conventional values (with conventional economic LoC values), and it extrapolates to traditional level of significance where life-related concerns dominate. 10.2.17 The Hypothesis is not the Supposition The statistical hypothesis is not the supposed cause-and-effect mechanism. The hypothesis and associated attribute are expected observable outcomes if the mechanism is as supposed. It is what you are considering as a data comparison to test a supposed mechanism. Maybe a person wants to claim Supplier A of a component in your product is the reason that customers return your product as being defective, and that person wants to claim that Supplier B is better for your company. So, they count returns and find that 5,317 have the A component and only 684 have the B component. Uncertainty associated with the count is about 10%, implying that s A = 0.1 × 5317 / 2.5 = 141 and s B = 0.1 × 684 / 2.5 = 27 . A test on the numbers will clearly reject the hypothesis that A = B. However, if 97,689 products were made with the component from A, and 13,472 products were made with a component from B, then a more reasonable test would be on the portion of products returned. Here the portions of returned products are pA = 5317 / 97689 = 0.05443 , and pB = 684 / 13472 = 0.05077 , and the equivalent hypothesis cannot be rejected. Be sure that the choice of the hypothesis and the attribute (statistic) being tested is a legitimate test of the claim about a mechanism. 10.2.18 Seek to Reject, Not to Support I hope that you enjoy the fallacious claim in this section, and its message. “The Theory of Positional Invariance” was an article (R. Russell Rhinehart, Develop Your Potential Series in CONTROL magazine, Vol. 33, No. 11, November 2020, p. 41) and reproduced here with kind permission of the editor. The “Theory of Positional Invariance” states that regardless of the observer’s viewpoint the object retains its properties. There are many examples: Whether observed from the north or south poles, the moon has the same mass and craters and rotational speed. Although, the moon does appear upside down to one observer, the viewer orientation does not change the properties of the object. Whether you look at a person from the top or back or front, it is still that person with the same color eyes and personality. I asked my grandchildren, “What’s my name?” Their first thought was, “Oh, no! It is happening to him.” Then they said very tentatively, “Pop.” “Good,” I said, then turned around and asked, “Now, what is my name.” One said, “It’s still Pop.” The other called to their grandmother, “DeeGee, something’s wrong. We need help in here.” A theory starts with corroborating observations, acquires the rule, then a sophisticated sounding name to help validate it. Regardless of the observer’s viewpoint, an object retains its properties: “Positional Invariance”. Applying the principle, observe that except for the 45° rotation, the × and the + symbols are the same; so the theory claims that 2 + 2 = 2 ´ 2 . There you are! Let’s try with some other numbers
Choices
223
3 + 1.5 = 3 ´ 1.5 , and (-4) + 0.8 = ( -4 ) ´ 0.8, and 1 + 2 + 3 = 1 ´ 2 ´ 3 . But, what about complex numbers!? Here are some demonstrations: éë( -1) + 2 i ùû + éë0.75 - 0.25 i ùû = éë( -1) + 2 i ùû ´ éë0.75 - 0.25 i ùû , and éë( -1) + Ö 2 i ùû + éë 2/3 - Ö 2/6 i ùû = éë( -1) + Ö 2 i ùû ´ éë 2/3 - Ö 2/6 i ùû . There are an infinite number of corroborating examples. I chose these examples to show that it works with negative numbers, fractions, and irrational numbers, but I kept the numbers convenient for your affirmation of the truth of the Theory of Positional Invariance. The theory is intuitively logical, has a sophisticated name, and is confirmed by data which has an infinite number of cases. So, the claim must be true. I use this truth to support my claim that we should not be wasting time and mental effort by having students memorize both addition and multiplication facts. Addition is all that is needed. (One can have fun?) Just because there is some corroborating evidence and some intuitive basis for a fancily packaged claim, does not mean that the claim is true. Don’t blindly accept either the technical folklore of your community or your preferred explanation. Don’t seek evidence to support the claim. Seek evidence that could refute it. Data cannot prove. Data can only disprove. So, critically shape trials and examples to see if you can disprove the claim.
10.3 Takeaway The fundamentals of statistics, the equations and mathematical science underlying the techniques, as presented in Chapters 1 through 9, are somewhat important. But the choices you make are even more important to have the analysis and decision to be rational and legitimate. Ensure the basis of the test (what quantity is tested, what level of significance is used, what statistic is used), matches the situation. Understand how to explain the “Accept– Reject” dichotomy, p-values, confidence/significance, and the relation of the statistical method to the decision or action. Differentiate causation and correlation. “There are two reasons for everything: A good reason, and the real reason.” That is one version of a century-old adage. “Statistics don’t lie, but liars use statistics.” That is one version of another century-old adage. Don’t give your audience a reason to accuse you of either age-old frauds. Be transparent with your choices and conclusions, their limits, and their influence on decisions.
10.4 Exercises
1. Practice inventing seemingly proper and justifiable choices for statistical analysis to lead to improper outcomes (not to become proficient at deception, but to more easily detect it). Choose any of the topics presented in this chapter.
224
Applied Engineering Statistics
2. Listen to what is said or written in the media and critically evaluate how the claim might be false, even if couched as true. 3. Consider a situation that you are involved in and assign equal levels of concern and joy for possible outcomes. 4. Create an exercise that could be an example to show how Equation (10.13) works.
Section 3
Applications of Probability and Statistical Fundamentals
11 Risk
11.1 Introduction Risk is a long-term average of the undesirable effect of possible events. Related to engineering business decisions based on economics, the concept of risk can be defined as expected frequency of an undesired event times the cost consequence of the event: R = fc (11.1)
Example 11.1: The expected frequency is 0.01 events per year (once in a hundred years), and the cost associated with the event is $10,000 per event. What is the risk? Using Equation (11.1)
éEù é$ ù é$ù R = 0.01 ê ú 10000 ê ú = 100 ê ú ëEû ë yr û ë yr û
This $100/yr would be considered as an average annual cost associated with the enterprise and treated like an expense in the economic profitability analysis. For instance, if the expected life of the enterprise is 20 years, then, primitively, the cumulative risk is é$ ù 100 ê ú 20 éë yr ùû = 2, 000 [$ ] . ë yr û
Risk is not what will happen. One cannot forecast the future. Also, risk is not the worst that could happen. Risk is the expectation, the average, possibility over many years or uses. Risk can be used in design considerations. If the risk is $2,000 over a 20-year horizon, and the one-time design cost to prevent the event is $1,000, then the $1,000 investment up front makes sense over the life of the enterprise (in a simple pay-back-time analysis not considering the time-value of money). Perhaps an alternate option for prevention of the event is by training, which is estimated to cost $500/yr, then over the 20-year period, the prevention costs more than the consequence, and investment in that method of prevention would not be a good business decision. Risk, with the units of $/yr, can be considered as a possible annual expense and included in economic profitability indices such as pay-back time (PBT), long-term return on assets (LTROA), net present value (NPV), discounted cash flow rate of return (DCFRR), or whatever metric your organization prefers. In doing so, you will gain an appreciation for the impact of the risk on the viability of the venture and insight on solutions to reduce the risk.
DOI: 10.1201/9781003222330-11
227
228
Applied Engineering Statistics
Alternately, risk can be considered as the probable cost associated with an undesired event over a period. It is the probability of an event happening within some interval times the financial magnitude of the penalty. Risk = P ( event in an interval ) Cost ( of the event ) (11.2)
Example 11.2: Past data indicates that in a normal year, the probability of a substantial leak from any pump seal is 0.0001, and the probability of it being considered an environmental transgression is 0.2, and the fine for a “spill” and cost of land reclamation and the value of the lost material all sum to $50,000. What is the risk? Note that the probability of incurring a cost is that there is a leak, and it is caught. The AND conjunction directs that the probability of the joint event is the product of individual probabilities. Using Equation (11.2)
Risk = 0.0001 ´ 0.2 ´ $50, 000 = $1
In the example, the probability is on a per year basis. Often risk is defined by replacing frequency in Equation (11.1) with probability. The probability of 1 event occurring in 100 years is 0.01 events per year. The period does not have to be a year, the frequency could be on a per day basis. Alternately, the probability could be on a per event or state or item basis. For example: 1 bag in 50 will have a leak, 1 sales contact per 20 will lead to a sale, 1 in 300 times that a light is switched to ON the light it will pop and fail. If using frequency as the estimate of the likelihood of an event, then the number of undesired events can easily be scaled to an alternate time period. For example, production of 100 units per year, with an expected defect rate of 1 per 50 units, means an expected frequency of 2 defects per year. There are alternate definitions of risk, and diverse preferences for nonbusiness applications or catastrophic situations such as in healthcare, national security, or personal finance. Some alternate definitions include: Risk = Likelihood of an event times Cost, Risk = Hazard times Vulnerability, or Risk = Danger divided by Resilience. This chapter will only use Equation (11.1) or (11.2) and its variation. Note several features in the simple leaking pump seal example, Example 11.2. First, the cost of the event is not just from one aspect of the event, it must consider all associated costs. Here there were three. However, it is easy to imagine that many more aspects should be considered. Second, the probability was for one pump in one year. Likely, there will be several pumps increasing the probability that more than one will fail. Also, the probability for the pump seal to fail was for a one-year interval, but likely the plant will have a longer expected operating life. So, there is a higher chance of a failure over the plant lifetime. Third, if there was a prior event, then the second environmental event will have a higher penalty. Fourth, the fine only happens if you are caught. There is a probability of the leak, and a probability of being caught. The probability of the fine is if you have a leak and get caught. Since P ( A and B ) = P ( A ) P(B| A) the 0.0001 and 0.2 are multiplied to get the compound probability of the event incurring a cost. Fifth, any forecast of probability of events over a multi-year future and the associated financial impact of such an event will have substantial uncertainty. Finally, this presented an economic valuation of an undesired event, but the violation could concern either ethics or things we should do. Ethics would direct that you do not let
Risk
229
a leak happen. Good manners say that you should respond when someone acknowledges you. If you can evaluate the equivalent economic value of the penalty for such events, then you could include that in the risk calculation.
11.2 Estimating the Financial Penalty The event must be explicitly identified. It cannot be a vague statement like “something bad might happen”, nor can it be a nonspecific statement, like the undesired event is “a pipeline leak”. It needs to be identifiable and mechanistically specific enough to be able to assess both the likelihood of occurrence and the costs of the consequence. Expanding on the pipeline leak example, specific statements might be: “a pipeline leak of 1 gal per hour of crude oil lasting for 5 days because a welded seam fails”, “a pipeline leak of 100 gal of potable water because of an illegal tap into the line to access material”, “a pipeline leak of 50 gal of gasoline because of a pump seal failure”. Each of those cases have their individual probability and individual cost of consequences. Of course, risk analysis extends well beyond pipelines and includes legal terms in a contract, driving a car, rules in raising children, operation of a chemical facility during a hurricane, hiking a woodland trail, etc. But, in any case, the event needs to be explicitly and specifically defined. Three sources of cost associated with the toxic material spill were included in Example 11.2, but there are many more. For instance, if the event happens, then there will be adverse company image issues in the public domain. This may lead to a loss of sales to sensitive customers or to backlash protests by environmentalists that cause insurance premiums on the facility to rise. An event will likely lead to a change in maintenance procedures on pumps, which will have associated costs with increased inspections and the management of change associated with revised training, scheduling, and document updating. When you are considering the cost of an undesirable event, be sure to consider all aspects. Continuing discussion of the financial impact: Also consider the personal impact of those in the company that get blamed for the event. The corporate folks will blame the Plant Manager who will blame the Maintenance and Operations Supervisors. The event will show up on their annual appraisal and will impact their hopes for promotion, or salary increase, or profit sharing. These managers probably also have families, with schoolchildren who will be harassed by classmates because their parent was responsible for polluting their environment. (A good corporate citizen of the local community should add value, not do harm.) The perceived personal impact to these folks might be much larger than the $50,000 cost to the company. How much more will it cost the company in donations to the local schools, library, and community centers to restore community approval of the employees? The total financial impact of the undesired event is context-dependent. If for instance, there have been several recent undesired events associated with the company then the magnitude of the fine may be larger, loss of allegiance by the public larger, and the personal impact on employees larger. It is not easy to forecast all the consequences that might happen, or what the costs of each might be. There is substantial uncertainty on such an estimate. Often the cost associated with an undesired event will be a reasonable “guestimate”. The penalty for an event is context-dependent. People need to be fully aware of the context to be able to estimate the cost.
230
Applied Engineering Statistics
11.3 Frequency or Probability? A common formula for risk uses probability of an event, Equation (11.2). R = pc, which is equivalent to Equation (11.1) for low probability events. Here is why: The Poisson distribution gives the probability of n events occurring in an interval (of space or time) in which the average expected is λ events per time/space unit considered. (This Poisson model also presumes independence of the event, and stationary λ.) The Poisson point probability is l ne -l p ( n) = . If l 1) @ 0. So, for infrequently expected n! events, the probability of a single occurrence in the time interval is effectively the same as the expected frequency, which makes the f and p interchangeable in the risk formula. Frequency could have a value of 2 events per year. However, if frequency in Equation (11.1) is replaced with probability, the value of 2 is not possible. Probability of an event can only range from 0 to 1. Instead of frequency, one could use the term probable number of events in the period, the expected number, not the probability of an event. Alternately, one could use the expected long-term average, λ, and the Poisson distribution to give the probability of n = 1, n = 2, n = 3, etc. number of events. This is developed in the next section. 11.3.1 Estimating Event Probability – Independent Events In the introductory Example 11.2, note that the probability of 0.0001 is for one particular pump failing, and failing in a one-year period. Such a value must come from past experience with similar pumps, similar service, and under similar maintenance programs. It also has substantial uncertainty. Further, the plant will likely have more than one pump in the do-not-let-this-leak service, and likely plans to operate for more than one year. So, we need to use probability to project what might happen with k number of pumps over an n-year horizon. The Poisson distribution models the number of events, x, expected in an interval if the average event rate per interval, λ, is known and events are independent. P (x) =
l x e -l (11.3) x!
This reveals that the probability of one failure of the one pump in a given year is
P ( 1) =
l 1e - l 0.0001e -0.0001 = = 0.000099990 ¼ (11.4) 1! 1
which is a bit less than (but nearly equal to) the λ value, but also that there is a chance that the same pump might fail twice in the year. That probability is
P ( 2) =
l 2e -l @ 5. ´ 10 -9 (11.5) 2!
If a pump fails a second time, then what is the second financial penalty? Probably larger than the first. This calculation presumes independence of the second event. If maintenance procedures create a common cause for failure, then given a first event, the second might have a higher probability. Alternately, if maintenance on the failed item replaces worn out
231
Risk
parts with new parts, then a second failure might be much less. The Poisson model assuming that events are independent may not be true. Also, in an n-year consideration there is the probability of a pump failing in the first year, or the second, or the third, …. For simple analysis, consider that the probability of a second fail in the same year, P ( 2 ) @ 5. ´ 10 -9 , is negligible, and use the p = 0.0001 value. If there are k pumps and an n-year operation, then there are nk total number of possibilities for a failure. The binomial distribution reveals the probability of x number of events when there are nk trials and the probability of an event is p.
P ( x|nk ) =
( nk ) ! p x 1 - p nk - x (11.6) ( ) x ! ( nk - x ) !
So, if there are k = 10 pumps in the critical process lines, and an n = 10 year consideration then nk = 100, and from Equation (11.5) with p = 0.0001: P ( 0|100 ) = .99004933 …
P ( 1|100 ) @ .01 P ( 2|100 ) @ 5. ´ 10 -5
(11.7)
P ( 3|100 ) @ 1.6 ´ 10 -7 The probability of one event is effectively nk = 100 times that of a single expected event per year per pump. And the probability of a second event is about 10,000 times greater than that from a single pump in a single year. A more proper analysis would be to use the Poisson distribution. If the failure rate for one pump in one year is λ = 0.0001 then the expected rate for ten pumps in a ten-year interval is nk l = 10 × 10 × 0.0001 = 0.01, then the probabilities for the first few failures are P ( 0|100 pump_years ) =
P ( 1|100 ) = P ( 2|100 ) =
l 0e -l = .9900498 ¼ 0!
l 1e - l = 0.00990049 ¼ 1! 2 -l
l e 2!
(11.8)
= 0.000049502 ¼
P ( 3|100 ) @ 1.65008 ´ 10 -7 Note the equivalence to the binomial approximation of Equation (11.7) when the probability of a second pump failure is small. After these considerations on both the financial penalty for an event and the probability of an event, returning to Equation (11.2), and a 10-year estimate, with 1) an event probability of 0.01, 2) the probability of it being considered an environmental transgression is 0.2, and 3) all aspects of financial penalty have been re-estimated as $500,000. Then the risk is Risk = 0.01 ´ 0.2 ´ $500, 000 = $1, 000 , which may be large enough to justify taking action to prevent the event, or to contain the material in case there is an event, or it still may be small enough to accept the possible future penalty. Consider: We go outside even though there is the possibility of a mosquito bite.
232
Applied Engineering Statistics
The probable number of the events is often termed the expected frequency. For a continuously operating process, it is the probable number of the events per a specified time interval, and per a unit of the item. For instance, considering the “per unit” aspect of the item, a pipeline with two welded seams will likely have fewer seam failures than another single pipeline of 5,000 welded seams operating for the same duration. And, considering the time interval, the probability of a car accident is low in any one-minute interval, but may be high when considering a 60-year driving period for one person; or again on a per item basis, a fleet of 5,000 delivery vans in a one-minute interval will have a higher expected number of undesired events than a single van in a one-minute interval. Values for such data come from past experience with similar processes, equipment, and operating conditions, but past data need to be adjusted for the forecast situation being considered. If risk is calculated with probability, there is the probability of one event, and the probability of two events, and three events, etc. in any use interval. If there are three events, then the total penalty is the sum of that for the first, and for the second, and for the third. So, the equation needs to be modified to sum the risk associated with all event types: R = p ( 1) (c1 ) + p ( 2 )( c2 ) + p ( 3 ) (c3 ) + (11.9a)
¥
R=
åc p ( n) (11.9b) n
n=i
Here the costs are explicitly shown as different values. The cost consequence of a second event in a period would likely be greater than the first, and a third event may be greater yet. Consider speeding tickets, the first penalty may just be a fine, but the third arrest in a year may lead to license suspension. Consider insults in public, the first might be dismissed as poor judgment by the other, but the third is evidence of a hostile agenda and needs responsive action. In Equation (11.9a) the term p(3) means the probability of only the third event, not the cumulative probability of the first and the second and the third. To have the third, the first two must have happened. A problem with Equation (11.1) is that evaluation of risk scales linearly with the number of events. In it, a frequency of 10, an expected occurrence of 10 undesired events in a year, is 10 times more costly than a single event. Consider: A customer may forgive a single transgression of product or service expectations, but 10 transgressions in a year will create a reputation that could lead to loosing many customers. So, the penalty should increase for repeated transgressions, and Equation (11.9) permits that. Example 11.3: What is the risk over a 10-year period, which is the forecast life of a product? The frequency of an event is 0.001 per year per unit. There are 5 units in use. The cost of the first event is $20,000, the second event is $50,000, and the third event is $150,000. If 4 events, the plant is shut down. We’ll assume that all events are independent and are modeled by the Poisson distribution. With 5 units and a frequency of 0.001 events per year per unit, l = 5 × 0.001 × 10 = 0.05 events per 10-year horizon. The probability of a 1st, 2nd, 3rd … events can be obtained from the Excel cell function POISSON.DIST ( x, l , 0 ) and is
P ( 1) =
l 1e - l = 0.047561¼ 1!
233
Risk
P ( 2) =
l 2e -l = 0.001189¼ 2!
P (3) =
l 3 e -l = 1.9817 ¼´ 10 -5 3!
P ( 4) =
l 4 e -l = 2.477 ¼´ 10 -7 4!
From Equation (11.8) and using rounded probability values
R @ 0.0476 × $20k + 0.00119 × ( $50k ) + 1.9817 ¼´ 10 -5 × ( $150k ) @ $1, 015
The chance of a complete shutdown of operations would be the chance of more than 3 events, 2.5¼´ 10 -7 , less than 1 in a million.
11.3.2 Estimating Event Probability – Common Cause Events Often events are not independent but are influenced by common cause. As a common cause example, a fleet of cars may be comprised of the same make and model, and each may have the same driver’s blind spot. Then, if that attribute causes one driver to have an accident, the same attribute will likely cause many to have a similar accident. If one pipeline weld fails because of the incompatibility of material, welding practices used, or pressure shocks in the line during operation; then many welds will likely fail for the same common cause mechanism. If one employee quits because of the style of a boss, it is likely that others will also quit. Values for event frequency or probability can be obtained from human experience, but they should be adjusted for the specific application, which requires subjective judgment. And, mechanisms for compounding probability for common cause mechanisms also require human judgment. Again, although the equation seems simple, the evaluation of probability of the event is subjective. Equation (11.2) is simple and often good enough, but it is not applicable if the frequency of the event is greater than about 0.05 events per interval, nor does it admit that a second or third event would have a larger penalty. Equation (11.9) permits greater penalty for repeated instances and conditional probability techniques can be used to estimate the likelihood of a second, third, etc. event due to common cause mechanisms. However, it is a step more complicated to use. Again, however, either the Poisson or the binomial model is for independent events, and common cause may make the second much more probable if there is a first. Further, over a n-year horizon there may be an interval with a lapse in maintenance increasing the probability of an event, or because of external events an increase in attention that reduces the probability. Again, there is much uncertainty, and human judgement is needed to determine appropriate probability models and to temper results. 11.3.3 Intermittent or Continuous Use The use may not be continuous, but discrete, or intermittent. Staplers only run out of staples when they are used. Baseball bats only break when they are used. In such cases of intermittent use, the probability of an undesired event is not on a per time basis but on a
234
Applied Engineering Statistics
per instance-of-use basis. The per instance-of-use basis can be converted to a per year basis (or any time interval) by the number of expected uses per year. 11.3.4 Catastrophic Events An event may have a 1 in 1,000 (relatively low) chance but it may have a $1,000,000 (very high) penalty. Normalized on a per year risk basis it is $1,000 per year, which may have little impact on the economic analysis. If, however, the business after-tax profit is $500,000 per year, a one-time $1,000,000 event would wholly consume two years of profit and could bankrupt the enterprise. So, the risk of Equations (11.1) or (11.9), representing an expected average loss as a business expense, do not wholly represent the reality of a catastrophic loss. There are also events that cannot be valued. A value cannot always be placed on the cost consequence of the event. Consider 1) How much is a life worth? Would you simply assign the value as the person’s earning potential over their expected remaining life? How does one place an equivalent monetary value on grief to loved ones, impact on children, etc. Consider 2) How much is a reputation worth? If your enterprise goes bankrupt, is the loss only the value of equity in the enterprise? What are the consequences to personal initiative, future credibility of the leader, loss of friends? Consider 3) How much is unethical practice worth? Some people might say that it is unethical to permit an event with a risk value of $100/yr, just because prevention costs $500/yr, and you think it is a poor business decision to lose $400/yr. Their opinion for you might be, “No undesired events should be permitted.” Consider 4) environmental damage and that impact on endangered species. In some situations, the basic equation does not have meaning, because either 1) emotional or ethical issues are controlling, and a cost cannot be assigned to the consequence of the event, or 2) the event has one-time catastrophic implications that cannot be normalized on a per year basis. On high-consequence events where even one occurrence is catastrophic, traditional risk analysis is not an acceptable tool for analyzing them. Equations (11.1) through (11.9) represent a value-to-value analysis of cost and possible savings for a traditional business economic analysis. For some events, however, there may be no acceptable risk probability, no acceptable frequency. When human life is at risk, a common practice is to make the next generation of facilities or cars have half the probability of causing a death as the current best state of the industry.
æ1ö pnew = ç ÷ pstate of the art (11.10) è2ø
Acknowledging the issues related to the risk equations and catastrophic events, in a LinkedIn discussion Steve Cutchen states, “Risk is a function of the combination of probability and consequence, which may or may not be strictly mathematical, and is impacted by various outside factors.” I’ll add that valuation of probability (or frequency) and of consequence has considerable uncertainty, so that any assessment of risk is somewhat uncertain requiring substantial human judgment.
11.4 Estimating the Penalty from Multiple Possible Events The above analysis considered only one event, such as a pump seal leak. But there are many possible ways that material can spill – from a leaky valve, from a corrosion-hole in
Risk
235
a tank, from a tow-motor hitting and breaking a line, from tank filling, from dropping a barrel during transport, etc. And there are many, many possible undesirable events – a fire, an injury, an overpressure gas release, a squirrel shorting out the transformer, etc. Determine each risk as above. If all the events are independent, then the combined risk is the sum of all risks.
11.5 Using Risk in Comparing Treatments Risk can be reduced by prevention, such as greater maintenance attention, which would reduce the probability of an event. Greater maintenance attention would be a different treatment, and it would have its own cost. There are many types of risk and all have preventive approaches. Flu vaccination is a treatment to prevent (reduce the probability) of getting the disease. Proper exercise and diet are a treatment to reduce the probability of unhealthy habits on the body. Training with regard to safety, compliance, ethics, etc. reduces the probability of undesirable events. Alternately, risk can be reduced by containment. Here one might permit the event to happen but provide some sort of treatment to prevent it from incurring a penalty. Perhaps place catch trays under a seal, or place pumps, pipes, valves, and tanks in a catch basin. If the risk is a fire, have a fire extinguishing system ready. If the risk is embezzlement have an independent auditing system in place to catch its beginning to eliminate the player. If the risk is associated with motorcycle accidents on a highway, impose a helmet law to reduce the severity of an accident. Risk can also be reduced by alternate designs. Instead of conventional direct-drive pumps that need a seal, use magnetically coupled pumps. Instead of using a raw material or making an intermediate product that is toxic, use an alternative approach to making the final product. Relocate manufacturing to a place that does not penalize the possible spill. The solution might be building a financial reserve to handle the costs if they occur, or influencing legislation to deflect accountability, or getting insurance, or having contracts with stand-by assistance sources. Each of those solutions to a problem is a treatment. Usually, the alternate treatment has an economic cost associated with it. When looking at alternate treatments, include risk. Analysis of risk does not provide a solution as to how to minimize risk, but it reveals mechanisms and quantifies aspects, and thereby permits designers to improve a design. The solution depends on the situation and the creativity of the designer.
11.6 Uncertainty There is uncertainty associated with each term in the risk equation. The expected frequency will depend on equipment age, operator’s physical ability, operator’s experience, maintenance practices, supplier of replacement parts, process operating intensity, etc. And, the consequences, have similar uncertainty associated with predicting the future situation and context. Both the frequency and cost will change in time. The evaluator must assign a probable frequency and the cost impact that represent the future for both. One can use
236
Applied Engineering Statistics
past data, but it must be translated to the new application and anticipated future context. These are necessarily subjectively assigned values. If a person can estimate the frequency and cost, then one should be able to also estimate the possible range of the frequency and cost estimates (or provide a subjective feel for the 95% confidence intervals). Then there are several ways to adapt that. One method would be to use the worst-case scenario for the risk, the maximum of both the estimated frequency and cost. This would be a conservative approach, assigning a worst-case situation, but that would not ground the impact in standard economic decision-making. Alternately, one could use a Monte Carlo approach to simulate 1,000 (or many) combinations of the product and use an average. The average, if the uncertainty on both variables is modeled as uniform and symmetric is simply the product of the nominal values. It is not too difficult to show that the average of Equation (11.1) becomes:
1 R= N
N
å éë f
nomial
i =1
+ f range ( r1, i - .5 ) ùû éëcnomial + crange ( r2 , i - .5 ) ùû
(11.11)
= f nomialcnomial Here f range = f max - f min , and r1, i is a uniformly distributed random number on the 0 to 1 interval. But, if not symmetric or from a uniform distribution it will not be so trivial an equivalent. If many Monte Carlo simulations are performed, one could use the average, median, or 75th percentile worst case, or whatever. But, in any case, the estimate of the uncertainty and the distribution is subjective. One needs to be careful to not zealously overestimate the frequency or cost if seeking to minimize risk or underestimate the values if seeking to not let risk impact a project or design decision. The uncertainty associated with choosing frequency or penalty values, or their ranges or distributions, leaves us with the situation in which a best “guestimate” is all that is rationally justified. An estimate of frequency and penalty values that are grounded in past experience must be adjusted to match changes in the new situation.
11.7 Detectability Sometimes risk is only a liability if the event is detected. For instance, if the product container is supposed to contain 30 lbs of material, but the filler machine occasionally places only 28 lbs in a sack, and a customer detects it, then the customer might return the product. As a minimum this will cost the supplier handling of product, return shipping, and clerical time, and possibly future purchases. Similarly, your product is not supposed to contain a particular impurity that contaminates a customer’s product, and if detected you might have to buy all the customer’s contaminated product, etc. This could be very expensive. However, the customer may never realize the defect, and then there is no cost consequence to the supplier. So, Equations (11.1) and (11.9) could be modified to contain a detectability probability, d.
R = fcd (11.12)
R = p ( 1) c1d1 + p ( 2 ) (c2 )d2 + (11.13)
Risk
237
11.8 Achieving Zero Risk Ideally, we prevent any undesired events from happening, but, the reality is that undesired events cannot be avoided. The only way to not have an accident is to not do anything. Don’t manufacture anything, because there might be a forklift accident. Don’t prepare food to eat, because it might contain a microbe. Don’t go for a walk, because you might get a mosquito bite. Don’t shower, because you might slip and fall. But, not doing anything adds no value to human aspirations. We only add value to life by doing; and in doing, one must balance the benefit of activity to the undesired consequences of the activity. Zero risk is neither possible nor desirable. The objective is to balance anticipated benefit to the potential downside.
11.9 Takeaway The calculation of risk should be grounded in data on probability; but substantial interpretation is needed to estimate the financial penalty of the event, and also to propagate the single event probability to parallel cases and multi-year consideration. Although risk could be presented as using simple probability rules, that would misdirect the reader. It is more engineering craft than mathematical science. A useful standard is the AS/NZS ISO 31000:2009 (ISO Guide 73:2002). It is important. Consider risk in design and operational procedures.
11.10 Exercises 1. Person A is driving a rare antique car, and the probability of an accident is the same as that of Person B who is driving a 7-year-old conventional car. Which has greater risk?
2. How would you assess the value of: Losing your wallet with $15 in it? Losing your car keys? Missing a day of work because of illness? 3. How would you assess jamming your finger and needing to wear a cast on your left hand? On your right hand? How would you value the risk for a professional athlete or a surgeon? 4. Derive Equation (11.9b) from (11.9a). 5. Prove Equation (11.11). Expand the sum, argue that the average of the deviation terms is zero. 6. Use the guide in Section 11.3 to derive the equivalence of Equations (11.1) and (11.2). 7. Use Example 11.3 to determine the 99% error on the probable risk, based on the implied uncertainty of each of the four “givens” in the example statement. 8. Draw a timeline of 10 years and consider that there is an average rate of 0.3 events per year. Indicate when the independent and random events might occur on your timeline. Explain your locations.
12 Analysis of Variance
12.1 Introduction Analysis of variance (AOV or ANOVA) is a powerful technique for comparing the means of treatments. The test statistic is F. Analysis of variance is often used as a screening technique to determine whether there is any probable qualitative relationship between the treatments before additional effort and resources are spent in an attempt to develop a quantitative relationship. The terms factor and treatment are used interchangeably to refer to the independent variable. One advantage of ANOVA over other statistical methods to reveal impact is that the treatments do not have to be represented by continuum-valued numbers (those with numerical values representing physical quantities). Although ANOVA can use treatments with such numerical values, ANOVA can also test the effects of treatments that are category or class variables.
12.2 One-Way ANOVA Consider the data array in Table 12.1. Every entry is subscripted by row and column. By convention, the columns represent treatments, and the rows represent replicate observations. The subscripts i and j refer to the rows and columns, respectively; then Yi,j denotes the value of the ith observation of the jth treatment. There are j = 1, 2, ¼ , J number of treatments. The number of observations for each treatment does not have to be the same. The notation Ij represents the number of replicate observations in the jth treatment. The observations are replicates, meaning that samples in each column represent observations from a population expected to have the same distribution (mean and sigma, if the population is normal). A treatment can consist of levels of a single variable (e.g., different temperatures) or different types of a discrete variable (e.g., different types of closures, fasteners, respirators). In either case, only the treatment has different values or classifications. All other conditions between columns should be the same. In a one-way ANOVA, the number of rows represents the number of replicate observations per treatment. Although it is convenient for every treatment to have the same number of observations, that situation is not required. J
There are a total number of N =
åI
j
observations in the table. Be aware that the table
j =1
uses both upper case and lower case I and J. DOI: 10.1201/9781003222330-12
239
240
Applied Engineering Statistics
TABLE 12.1 Data Array for a One-Way ANOVA Column#→
1
2
…
j
…
J
1 2 · · · i ·
Y1,1 Y2,1 · · · Yi,1 ·
Y1,2 Y2,2 · · · Yi,2 ·
… …
Y1,j Y2,j · · · Yi,j
… …
Y1,J Y2,J · · · Yi,J ·
·
·
YI2 ,2
·
·
Row#→
… …
YI j , j
… …
· … …
YI1 ,1
… …
· YI J , J
12.2.1 One-Way ANOVA Method The assumption is that the treatments are all identical, then we expect that the variance associated with all observations should be identical. The supposition is that there is no difference between the means of each treatment. If true, the expected reveal in the data is that the variance, as calculated in any number of ways should be the same, and the varis2 s2 s 2 -s 2 ance ratio will be unity 1 2 = 1, or alternately, 1 2 - 1 = 0 = 1 2 2 . The hypothesis is that s2 s2 s2 s 2 -s 2 F = 1 2 2 = 0 . The statistic will be an F-ratio of variances, but not a ratio of the same s2 construct as that of Chapter 6. The ANOVA F-ratio should be near to zero if the hypothesis is true, and the ratio will be large if not true. The rejection will be one sided, only testing if the F-statistic is too large. If all treatments have the same mean and variance, if there is no difference in the effect of the treatments on the observation, then the variance measured in any number of ways will be the same. All N of the data can be used to calculate estimates of the mean and variance. X=
2
s = =
æ ç è
å
å
J
Ij
j =1
Ij
ååY (12.1) i, j
j =1 i =1
J
Ij
åå (Y I ö -1
1 J
J
1
j =1
i, j
j
÷ ø
j =1 i =1
-X
)
2
(12.2)
1 SSDTotal = MST N -1
The term SSDTotal means the sum of squared deviations total (for all data) J
SSDTotal =
Ij
åå (Y
i, j
j =1 i =1
degree of freedom).
)
2
- X , and MST means the mean-squared deviation or average (per
241
Analysis of Variance
Alternately, we could calculate the mean and variance of each column, associated with each treatment.
Ij
åY
1 Xj = Ij 1 sj = I ( j - 1) 2
i, j
Ij
å (Y
i, j
(12.3)
i =1
- Xj
) = ( I 1- 1) SSD (12.4) 2
j
j
i =1
Here the SSDj means the sum of squared deviations for the treatment. The several s j 2 values could be pooled to have a collective estimate of the variance of all data.
å ( I - 1) s = å ( I - 1) J
sp
2
j =1 J
j
j =1
j
j
2
=
SSDw
(N - J )
= MSw (12.5)
Here the SSDw means the sum of squared deviations for all the treatments, within all treatments. And MSw means the mean squared deviation within, the average (per degree of freedom). If there is no treatment difference, then the pooled variance should ideally have the same value as the overall variance. And the difference should be zero. Alternately, if there is an effect of the treatments, then MST will be larger than MSw due to the treatment making one or more columns of values lower or higher than the others. The difference in the sum of squared deviations is one possible measure of the impact of the treatments.
SSDb = SSDT - SSDw (12.6)
If there is no difference between treatments, then SSDb should be zero. The degrees of freedom (DoF) for SSDb is the difference in DoF for SSDT and SSDw.
n b = ( N - 1) - ( N - J ) = ( J - 1) (12.7)
So, the per DoF impact of the treatments is the variance due to the treatments, or between treatments, scaled by the DoF.
sb 2 =
SSDb = MSb (12.8) ( J - 1)
The F-statistic is
F=
MSb (12.9) MSw
with DoF for numerator and denominator of n b = ( J - 1) and n w = ( N - J ). The F-value is ideally zero. So, reject the hypothesis that all treatments are equivalent if F is too large.
242
Applied Engineering Statistics
Example 12.1: Data in the following table represent the outcome of four treatments. The experiments were controlled to reasonably exclude other possible influences on the outcome. The observations are replicates. Each column represents samples from the same population. Treatment Data n 1 2 3 4 5 6 7 8 9
T1
T2
T3
T4
30.51 29.63 31.46 30.42 30.87
31.66 32.55 30.71 30.87 31.31 31.66
28.95 30.57 30.8 30.91
29.82 29.96 31.29 30.21 29.88 28.32 29.03 30.46 29.38
One-way ANOVA can be used to detect if the treatments have differing impact on the observations. The following table is the output from the Excel Data Analysis Add-In “ANOVA: Single Factor” with a 0.05 level of confidence. ANOVA: Single Factor Summary Groups Column 1 Column 2 Column 3 Column 4
Count
Sum
Average
Variance
5 6 4 9
152.89 188.76 121.23 268.35
30.578 31.46 30.3075 29.81667
0.44787 0.44024 0.839092 0.726675
ANOVA Source of variation Between groups Within groups Total
SS
df
MS
F
P-value
F-crit
9.886041 12.32336 22.2094
3 20 23
3.295347 0.616168
5.348133
0.007199
3.098391
The top section of the analysis table reveals the data count, average and variance values. The lower part of the analysis table indicates the sum of squares, DoF, MS, and F-values as discussed above. The F-value of the data, 5.348…, exceeds the F-critical value of 3.098…. Accordingly, the null hypothesis (that the treatments have no effect on the observation means) should be rejected. The p-value is 0.007…. There is only a 0.7% chance that treatments that have zero impact could generate data that has such an extreme F-value. Note: The test does not indicate whether one was worse or better than the other three, or that two were different from two. It just indicates that not all treatments were equivalent.
243
Analysis of Variance
Example 12.2: Samples of steel from four different batches were analyzed for carbon content. The results are shown below for quadruplicate determinations by the same analyst. Are the carbon contents (given in weight percent) of these batches the same? What are the 99% confidence limits on the average carbon content of each batch? Percent carbon in steel batches A 0.39 0.41 0.36 0.38
B
C
D
0.36 0.35 0.35 0.37
0.32 0.36 0.42 0.40
0.43 0.39 0.38 0.41
1. Assume the data are normally distributed. 2. H0: τj = 0 for all j (there is no treatment effect) vs HA: τj ≠ 0 for all j (a treatment effect exists). 3. The test statistic is F. 4. Set α = 0.05 for the ANOVA. 5. The critical value is F > F(J − 1),J(I − 1),1 − α or F > F3,12,0.95. ANOVA: Single Factor Summary Groups Column 1 Column 2 Column 3 Column 4
Count
Sum
Average
Variance
4 4 4 4
1.54 1.43 1.5 1.61
0.385 0.3575 0.375 0.4025
0.000433 9.17E-05 0.001967 0.000492
ANOVA Source of variation Between groups Within groups Total
SS
df
MS
F
P-value
F crit
0.00425 0.00895 0.0132
3 12 15
0.001417 0.000746
1.899441
0.183559
3.490295
7. As F = 1.8994 < F3,12,0,95 = 3.49, we accept H0: τj = 0 for all j and conclude:
The carbon content of the batches is probably not significantly different. To find the 99% CI on the mean of each batch, we note that MS EE / I is Sp = 0.013655. The intervals are
( 0.3483 < m1 < 0.4267 )
( 0.3158 < m2 < 0.3992 )
( 0.332 < m3 < 0.4167 )
244
Applied Engineering Statistics
( 0.3608 < m4 < 0.4442 )
All four confidence intervals overlap to some degree, providing further evidence (but not a valid statistical condition) that the carbon content of the batches could be the same. Example 12.3: The life of four types of casing designs was estimated for gas wells in a particular field. Ten wells with the same design were chosen for each of the four types. To eliminate any bias due to variations in depth, only wells of the same relative depth were considered. Operators in this field are interested in determining whether or not these four designs have the same life. The results in projected service life (years) as a result of accelerated life testing follow. Do the designs differ? Design 1
Design 2
Design 3
Design 4
13 15 20 21 18 16 14 30 15 12
45 36 10 15 12 28 20 25 30 12
21 14 30 28 19 25 16 13 12 29
30 20 32 36 18 40 22 19 28 35
ANOVA: Single Factor Summary Groups Column 1 Column 2 Column 3 Column 4
Count
Sum
Average
Variance
10 10 10 10
280 174 233 207
28 17.4 23.3 20.7
62 28.04444 134.9 48.01111
ANOVA Source of variation Between groups Within groups Total
SS
df
MS
F
P-value
F crit
600.5 2,456.6 3,057.1
3 36 39
200.1667 68.23889
2.933322
0.04645
2.866266
As FDESIGN = 2.93 > F3,36,0.95 = 2.888, we reject H0 and conclude: Casing design probably does affect the service life. Our conclusion is strengthened by observing the p-value of 0.04645. There is only a 4.64% chance of finding a larger value for FDESIGN. Note: The test does not indicate whether one was worse or better than the other three, or that two were different from two. It just indicates that not all treatments were equivalent.
245
Analysis of Variance
Example 12.4: Four vertical elutriators were used to obtain samples of the concentration of cotton dust in the open-end spinning room of a large textile mill. The results, in micrograms per cubic meter, are shown below. Do these samplers give equivalent results? VE1
VE2
VE 3
VE 4
182.6 173.4 190.1 178.6 188.2
174.3 178.5 180.0
182.0 182.1 184.6 180.9
181.7 183.4 180.6
The hypothesis is that there is no effect of the sampling device on the analysis. ANOVA: Single Factor Summary Groups
Count
Sum
Average
Variance
5 3 4 3
912.9 532.8 729.6 545.7
182.58 177.6 182.4 181.9
47.062 8.73 2.446667 1.99
Column 1 Column 2 Column 3 Column 4
ANOVA Source of variation Between groups Within groups Total
SS
df
MS
F
P-value
F-crit
55.032 217.028 272.06
3 11 14
18.344 19.72982
0.92976
0.458762
3.587434
As the calculated value FTr = 0.93 < F3,11,0.95 = 3.59, we accept the null hypothesis and conclude: The vertical elutriators gave statistically equivalent results. Note: The p-value of 0.4587… is not near to a small value (such as 0.05) that would represent confidence in rejecting the hypothesis. Note: Statistically equivalent does not mean that they are the same. They may be different, but the variation masks detecting if they are different.
12.2.2 Alternate Analysis Approaches There are four treatments in Example 12.1. The ANOVA does not indicate whether all, or just one of the treatments reveals a difference. As an alternate analysis one could do a t-test on each pairing, there are six in this case. The six p-values are indicated in Table 12.2. Now however, there are 6 tests. For each there is a chance that a comparison of treatments that are equivalent will generate data that makes it appear to have a difference, a T-I error. If the desire is to have an overall T-I error probability of 0.05, then each comparison needs to have a lower threshold. If there is no treatment difference, then the
246
Applied Engineering Statistics
TABLE 12.2 p-Value Results of a t-Test on the Data of Example 12.1 T1 T2 T3
T2
T3
T4
0.0565
0.6231 0.0486
0.1121 0.0016 0.3681
T1–T2 comparison should not reveal a difference, AND the T1–T3 should not, …, AND T3–T4 should not. The AND conjunction indicates that probabilities should be multiplied, and if the desired level of significance 0.05, the individual test threshold needs to be a individual = 1 - 6 ( 1 - a overall ) = 1 - 6 ( 1 - 0.05 ) = 0.0085124 ¼. Table 12.2 indicates that the p-value for the T2–T4 comparison of 0.0016 exceeds this 0.0085 threshold. That could trigger the null hypothesis rejection, and as well indicate where the improbable difference exists. As it happens T2 is also near to having a rejectable difference between both T1 and T3. Although the ANOVA of Example 12.1 just revealed the probability of a difference, this t-test is a more detailed inspection of the individual differences and provides insight as to what treatment causes a difference. We have compared the one-way ANOVA to the multiple t-tests on many simulated data cases, and find that both are effective, but that the one-way ANOVA is a bit more powerful in detecting small differences than the multiple t-test approach. The common rule is to do an ANOVA first, and if a difference is detected, then look at the detail to see what the cause is. There are also alternate ways to process the ANOVA ratio of variances. These have more or- less the same results as the method in Section 12.2.1, which is the accepted standard. 12.2.3 Model for One-Way Analysis of Variance The classic model for one-way ANOVA is
Yij = m + t j + e ij (12.10)
where μ is the grand population mean, τj is the effect of the jth treatment (the variable is the jth column), and εij is the random error associated with the ijth observation on Y. As the grand population mean is a measure of location, the τj are measures of the displacement of a group mean μj from the overall mean. If we can show that the τj are probably zero, then presumably the differences in the column means are all zero. This situation would indicate that the treatment effects are probably nil, i.e., that there is no column effect. The corresponding null hypothesis is H0: τj = 0 vs HA: τj ≠ 0 for all j. We use the F-statistic of Equation (12.9) to evaluate the null hypothesis. The rejection region is F > F(J − 1),J(I − 1),1 − α for constant I. Otherwise, the critical value of F for H0: τj ≠ 0 is F(J − 1),Σj(Ij − 1),1 − α for variable Ij. Here, we have a one-tailed null hypothesis involving equality of treatment effects. Why do we put the entire rejection region on one side of the distribution? Although the null hypothesis is stated as H0: τj = 0 for all j, the basis of the hypothesis is in a different, but equivalent, null hypothesis that the difference between column means is ≤0. This original null hypothesis requires a one-tailed alternative hypothesis. Since the treatment effect τj
247
Analysis of Variance
TABLE 12.3 Analysis of Variance for Completely Randomized Design with Equal Numbers of Observations per Treatment Source Mean
df
Normalized SS æ ç è
1
å å Y ö÷ø i
ij
j
IJ
å å æ ç j =1 è
J
Between columns (treatments) Within columns (experimental error) Total
J−1
I
ååY
J(I − 1)
i
j
2 ij
I
= SS M
2
ö Yij ÷ ø - SS = SS = SS M Tr B
i =1
- SS M - SS Tr = SS EE = SS W
ååY
IJ
i
j
2 ij
EMS
2
—
åt
ü ïï J - 1 ý EMS Tr ï Model II : s 2 + Is t2 ïþ
Model I : s 2 + I
j
2 j
σ2 = EMSEE
= SS T
measures the displacement of the column means μj from the overall mean μ, we are only interested in whether the τj are probably >0. If they are, then there probably is a measurable effect due to the treatments. As a result, the null hypothesis about the τj has as its true alternative HA: τj > 0 for all j, which requires a one-sided critical region. If the calculated value of F falls in that region, the null hypothesis is rejected, and we conclude that the treatment effects were probably nonzero. Table 12.3 is the standard format for testing H0: τj = 0 for all j. The total sum of squares, SST, is important in subsequent analyses, as the sum of squares for experimental error, SSEE, is obtained by difference. The term EMS in Table 12.3 stands for “expected mean square,” i.e., the contribution of the corresponding source of variation to the total population variance. We will not derive any of the EMSs for you; they are derived in many statistical theory texts. We include the EMS column in Table 12.3 and subsequent ANOVA tables only so that you can quickly identify the appropriate F test for the variance components. In a one-way ANOVA, the treatments can be either different classes of a discrete variable or different levels of the same continuous variable. This dichotomy in variable type requires different mathematical expressions for each type. Model I requires
å
J
t j2 = 0
j =1
and is only concerned with the J treatments (discrete variables) present in the experiment. Model I is thus a “fixed-effects” model, and the null hypothesis is that no treatment effects are present, i.e., τ1 = τ2 … = τj or H0: τj = 0 for all j. Examples of fixed treatments often involve equipment or processes: Types of spinning frames, soldering irons, heat exchangers, pumps, etc. Model II is concerned with random variables, which are assumed to be normally and independently distributed with mean 0 and variance s t2 . This concept is usually abbreviated as NID 0, s t2 . By their very nature, such treatments are part of a continuous distribution: temperature, flow rate, concentration, etc. The fact that we have selected certain levels of such a variable as a treatment is immaterial: The variable is still part of a continuous, infinite distribution of possible values.
(
)
12.2.4 Subsampling in One-Way Analysis of Variance This is often termed sampling with replicates.
248
Applied Engineering Statistics
Subsampling occurs when the samples (material collected for observation) are divided into subsamples before testing or analysis. For example, in making morning coffee for the family the treatment might be the number of scoops of ground coffee beans. The sample might be the pot of coffee. If the coffee in the pot is divided into cups, one expects each cup to have identical properties. The cup of coffee would be a subsample. From pot to pot, even with the same recipe, the beans have variation, the measurement of bean volume has variation, the metering of water has variation, and the ambient temperature and humidity have variation. So, even with the same treatment, one expects pot-to-pot variation. If the coffee in each pot is perfectly mixed the subsample in each cup should be identical. Then, differences in evaluations of the coffee subsample would be due to measurement device error and sampling handling procedures. As another example, a scoop of fertilizer might be sampled from a manufacturing line, and the scoop divided into 3 replicate subsamples for lab analysis. The scoop represents the material produced by the process operating conditions, the treatment. All of the material within the scoop is expected to be homogeneous, so any variation in subsample-to-subsample analysis should represent lab chemical analysis variation. For the same treatment, scoop-to-scoop variation would represent the process variation of influence on the material. A sample is a part or all of an experimental unit to which a treatment (a set of predetermined values of the independent variables) has been applied. A subsample, or replicate measurement, is the analysis of separate portions of the sample. Certain types of observations (temperature, pressure, etc.) cannot be divided as those variables represent properties of the system or material. Other types can be readily subdivided. Samples of a pharmaceutical can be obtained, divided, and then analyzed. Many test specimens for the determination of tensile strength can be prepared from a single formulation of a polymer. On the other hand, resistors, integrated circuits, etc. cannot be subdivided for testing after manufacture. For this analysis each sample will be divided into the same number of subsamples. Note that in a subsampling situation, n samples are taken, and each is divided into m subsamples. The subsamples are not independent observations but only provide an estimate of the sampling (sample preparation and analysis) error, s h2 . Differences in the results of subsamples within a sample provide an estimate of experimental measurement error. Differences in the results samples from the same treatment reveal the combined effect of natural variability of the process and the measurement. And differences between the treatments yield an error component associated with all three: The treatments, natural process variation, and measurement error. The model for one-way ANOVA with subsampling is Yijk = m + t j + e ij + hijk (12.11)
where , j = 1, 2, ¼ , J are the treatments (levels), i = 1, 2, ¼ , I are observations per treatment, and k = 1, 2, ¼ , m are samples per observation (replicates or subsamples). In this model, τj represents the effect of the treatment. We assume that the experimental errors, εij, are normally and independently distributed with mean 0 and variance σ2. In notational form, we write εij are NID 0, s 2 . We also assume that the sampling errors ηijk are NID 0, s h2 . In our notation, εij is the effect of the ith sample subjected to the jth treatment, and ηijk is the effect of the kth subsample from the ith sample subjected to the jth treatment.
(
)
(
)
249
Analysis of Variance
TABLE 12.4 Analysis of Variance for Subsampling in a Complete Randomized Design (Equal Subclass Numbers) Source
Mean
Treatments (columns)
Experimental error Sampling error Total
df
Normalized SS æ ç è
1
åååY i
j
ijk
k
JIm
å (å å Y ) j
J−1
i
i
J(I − 1)
j
k
ïìïì EMS I íí ïîîïEMS II
- SS M - SS Tr = SS EE
s h2 + ms 2
2
ijk
m
å å å Y - SS - SS - SS å å å Y = SS
JI(m − 1)
i
IJm
j
2 ijk
k
i
—
- SS M = SS Tr
mI
å å (å Y )
ö ÷ ø = SS M
2
ijk
k
EMS
2
M
j
Tr
k
2 ijk
EE
= SS SE
s h2
T
The procedure for one-way ANOVA with subsampling is shown in Table 12.4. “Equal subclass numbers” indicates that each of the n samples is composed of exactly m subsamples. Two expected mean squares, EMSI and EMSII, are included in Table 12.4, corresponding to Model I (discrete variable) and Model II (continuous variable) treatment effects. The shorthand notation in the table uses a bold dot to indicate a summation. The location of the dot subscript tells you whether the summation is over the rows or the columns. The sum of observations in the ith row is thus Yi× , indicating that all values in the ith row have been added, or J
Yi× =
åY (12.12) ij
j =1
Similarly, to sum all the rows (entries) in the jth column, we write I
Y× j =
åY (12.13) ij
i =1
The individual row and column means are represented as Yi× =
Y× j Yi.× and Yj× = (12.14) I J
Note: Each sum has been divided by the number of entries in the column or row, respectively, that contributed to the sum. The overall sum of the matrix or data array is found by summing in one direction and then adding those sums to obtain the overall or total sum, which is I
Y.. =
æ ç ç è
ö Yij ÷ = ÷ j =1 ø J
åå i =1
J
I
å åY (12.15) Yi× =
i =1
×j
j =1
250
Applied Engineering Statistics
The overall, or grand, mean is the sum of the array divided by the number of members, or Y.. =
Y.. (12.16) IJ
We will further simplify the notation by dropping the explicit statement of the range for all summations so that I
å å =
i
J
and
å å (12.17) =
j =1
j
i =1
We will continue to use uppercase letters for random variables and lowercase letters for individual values of the corresponding variable. The expected mean squares for these models are J
å
EMS I = s + ms + Im 2 h
2
j =1
t j2 (12.18) J -1
and
EMS II = s h2 + ms 2 + Ims t2 (12.19)
The ANOVA procedure to follow when subsampling is chosen, is to compare FEE = MSEE/MSSE to the critical region for experimental error. If the ratio MSEE/MSSE is significantly greater than one, the null hypothesis of no experimental error is rejected. In that case, treatments are tested against experimental error by FTr = MSTr/MSEE. If experimental error is not significant (H0: σ2 = 0 is accepted), sampling error s h2 is the primary error source. In such a case, the best approach is to pool the experimental error and sampling error variances to obtain a more accurate variance estimate. If you adopt this approach, you’ll also have to pool the degrees of freedom. The result is also an increase in the precision with which you can estimate the effect of the treatments by FTr = MSTr/ MSPE where
MS PE =
SS EE + SS SE (12.20) dfEE + dfSE
Another use of the expected mean squares is in providing methods for estimating individual variance components. If you need an estimate (S2) of σ2, the true experimental error variance, it can be obtained from the error mean squares as S2 = (MSEE − MSSE)/m. Example 12.5: A packed tower is used to absorb ammonia from a gas into a countercurrent flowing liquid. In the evaluation of a new tower packing, three samples of the “cleaned” gas were taken for ammonia analysis at each of five feed-gas concentrations. Four subsamples were taken from each sample bag. The results are shown below in ppm ammonia. Did the inlet ammonia concentration affect the performance of this pilot-scale absorber?
251
Analysis of Variance
Ammonia Concentration (ppm) in Absorber Outlet Gas % ammonia in entering gas phase
k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4 k=1 k=2 k=3 k=4
I=1
I=2
I=3
Y.j. Y2j. ∑j Y2.j = 1,172,405.
4 j=1
8 j=2
12 j=3
16 j=4
20 j=5
12 13 18 17 14 15 13 10 11 14 12 19 168 28,224
21 24 26 20 25 27 22 24 19 23 21 28 280 78,400
37 37 31 33 36 34 31 39 36 37 34 32 417 173,889
48 42 50 46 43 49 51 41 49 52 48 47 566 320,356
60 62 57 71 66 64 59 65 63 61 58 70 756 571,536
ååY i
j
Y i××2
725
525,625
728
529,984
734
538,756
å Yi2×× = 1, 594, 365
Y… = 2187 2 ij ×
Yi..
SSM = 79,716.15
= 390, 959
SST = 98,307
This example involves sampling (three bags at each inlet gas concentration treatment) and subsampling (four aliquots taken from each bag). We show all the pertinent arithmetic below and to the right of the original data so that you can more easily follow the calculations. The following additional calculations are needed:
SS Tr
åY = j
2 × j×
12
ååY = i
j
- SS M = 17 , 984.26
2 ij ×
- SS M - SS Tr = 39.33
SS EE
SS SE = SS T - SS M - SS Tr - SS EE = 567.25
4
The analysis of variance table for this random effects (Model II) experiment is shown below. The treatment, inlet ammonia concentration, is a continuous variable. Source
df
SS
MS
EMS
Mean
1
79,716.15
—
—
Treatment
4
17,984.26667
4,496.06666
s + ms t2 + mIs 2
Experimental error
10
39.33333
3.93333
s h2 + ms 2
Sampling error
45 60
567.25
12.60555
s h2
2 h
252
Applied Engineering Statistics
The first null hypothesis is H 01 : s 2 = 0 , or that there is no experimental error contribution to overall variance. As
F1 =
MS EE = 0.312 MS SE
is less than F10,45,0.95 ≃ 2.055, we accept H 01 as probably true. Thus, we conclude: The major error source was due to subsampling (creating the subsamples and laboratory analysis). As we already have 45 DoF for the sampling term, there is no need to pool the error sources to improve the quality of our test on treatment effects. We already have enough DoF for a highly precise test. Our second null hypothesis is H 02 : s t2 = 0 vs. H A2 : s t2 ¹ 0 . To test H 02 , we calculate
F2 =
MS Tr = 356.66 MS SE
which is greater than F4,45,0.95 ≃ 2.585. We reject H 02 (it is probably false) and conclude: Absorber performance as measured by outlet ammonia concentration is probably affected by inlet gas concentration.
12.3 Two-Way Analysis of Variance If two independent variables affect the dependent response, we will have to use two-way analysis of variance. In this situation, the experimental data are tabulated as in Table 12.1 so that the rows represent values (levels) of one of the independent variables and the columns represent values (levels) of the other independent variable. Each Yij entry in the table is the value of the dependent variable resulting from the corresponding treatment combination, i.e., the particular combination of the two independent variables that caused the response. In our discussion, we’ll use αi for the row variable and βj for the column variable. This analysis will presume no missing data, the number of entries in each column is the same, I1 = I 2 = I 3 ¼. Also, this first analysis is for the case in which there is only one observation per treatment effect – without replicates. 12.3.1 Model for Two-Way Analysis of Variance The first of three assumptions for the two-way model is that the Yij are normally and independently distributed with mean μ and variance σ2 (ΝΙD(μ,σ2)). We then write the two-way model as
Yij = m + a i + b j + e ij , i = 1 a, j = 1 b (12.21)
where μ is the contribution of the grand mean μ.. to Yij, a i = mi× - m .. is the contribution of the ith level of the row variable, b j = m. j - m.. is the contribution of the jth level of the column variable, and εij is the random experimental error. In this simple two-way ANOVA model, the second assumption is that the row and column variables are simply additive, i.e., that no interaction terms exist.
253
Analysis of Variance
There are two versions of the third assumption One version is that
åb j
2 j
åa i
2 i
= 0 and
= 0. The two resulting hypotheses for this fixed-effects case (Model I) are H 0 : a i = 0 for all i vs. H A : a i ¹ 0 for any i
And
H 0 : b j = 0 for all j vs. H A : b j ¹ 0 for any j
2 The other way the third assumption for the two-way model can be stated is s Tr = 0 . The resulting hypotheses for the Model II case are
H 0 : s a2 = 0 vs. H A : s a2 ¹ 0
and
H 0 : s b2 = 0 vs. H A : s b2 ¹ 0
12.3.2 Two-Way Analysis of Variance Without Replicates Table 12.5 shows the ANOVA calculations for the simple two-way case with only one observation per (i, j) combination. You should observe that in this simple two-way ANOVA with only one observation per treatment combination, the F-test compares the treatments to experimental error by F = MSTr/MSEE, just as in the one-way case. To determine the proper F-ratio for each test, look at the expected mean squares and remember the null hypothesis is that there is no treatment effect. The value of the ratio EMSA/EMSEE will be 1 if the null hypothesis is true because the ratio will be σ2/σ2. Therefore, FA = MSA/MSEE is the proper statistic to test H 0 : å a i2 = 0 . If the null hypothesis is false, the ratio of expected mean squares will not be 1 and the corresponding value of FA will be significantly different from 1, leading us to accept H A : s A2 ¹ 0 . TABLE 12.5 Two-Way Analysis of Variance (One Observation per Treatment Combination) Source Mean
A
df
Normalized SS
EMS
1
Y××2 = SS M ab
—
a−1
åY i
2 i×
b
- SS M = SS A
s2 +
b
åa i
2 i
a -1
Model II
s 2 + bs a2 B
b−1
åY j
a
2 ×j
- SS M = SS B
s2 +
a
åb j
b -1
s 2 + as a2 Error Total
(a – 1) (b – 1) ab
SS T - SS A - SS B - SS M = SS E 2 ij
å å Y = SS T
σ
2
Model I
2 j
Model I Model II
254
Applied Engineering Statistics
Example 12.6: The torque outputs for several pneumatic actuators at several different supply pressures are given in the table below.
Torque (in. lbf) Actuator type
60 psi
80 psi
100 psi
A B C D
205 515 1,775 7,200
270 700 2,450 9,600
340 880 3,100 12,100
(a) Do the different supply pressures significantly affect the output? (b) Does the output vary for the different actuator types? Using our standard solution format, we obtain the results below.
1. Assume that both populations (actuator type and pressure) are normally distributed. 2. As this example involves a mixed model (actuator type = rows are fixed, pressure = columns are random), the hypotheses are
H 01 : a i = 0 for all i vs. H A1 : a i ¹ 0 for any i
H 02 : s b2 = 0 vs. H A2 : s b2 ¹ 0
3. The test statistic for both null hypotheses is F. 4. Set a = 0.05. 5. The critical region for H 01 is FR > F3,6,0.95 = 4.76 and the critical region for H 02 is Fc > F2,6,0.95 = 5.14. 6. The ANOVA table gives the calculated values for the row variable (actuator or type) and the column variable (pressure) – Excel, Data Analysis Add-in, ANOVA: Two-Factor Without Replication:
ANOVA: Two-Factor Without Replication Summary A B C D 60 psi 80 psi 100 psi
Count
Sum
Average
Variance
3 3 3 3 4 4 4
815 2,095 7,325 28,900 9,695 13,020 16,420
271.6667 698.3333 2,441.667 9,633.333 2,423.75 3,255 4,105
4,558.333 33,308.33 438,958.3 6,003,333 10,599,873 18,781,767 29,835,300
255
Analysis of Variance
ANOVA Source of variation
SS
df
MS
F
P-value
F crit
Rows Columns Error Total
1.7E+08 5,653,438 7,306,879 1.83E+08
3 2 6 11
56,781,313 2,826,719 1,217,813
46.62563 2.321143
0.00015 0.179205
4.757063 5.143253
7. As FR = 46.63 > F3,6,0.95 = 4.76, we reject H 01 and conclude:
Pneumatic actuator type probably affects the torque output. Notably, the p-value of 0.00015 permits a very strong claim. As Fc = 2.32 < F2,6,0.95 = 5.14, we do not reject H 02 . The result is somewhat surprising. If we look at the original data, the effect of pressure on output appears obvious and consistent for each of the actuators. However, just as love and hate are opposites, if you do not hate someone, it does not mean that you love that person. You could feel ambivalent, or you could love them but not be sure. The reason that we cannot reject H 02 from the data is that the variability in the torque due to the pressure change is small compared to the total variability in the torque due to all effects. Had pressure spanned a greater range, perhaps 10 to 100 psi, or had actuator type D not been included, there could have been enough evidence to reject H 02 . Although it is conventional to equate “not reject” to “accept”, it can lead to the same erroneous conclusions as equating “not hate” and “love”. Thus, our answer to part (b) is: There is insufficient evidence to reject H 02 : Pressure has a relatively inconsequential effect on torque. We introduce a new concept to aid in reconciling the results of this ANOVA. The variance of a treatment mean in general is the mean square for experimental error divided by the number of observations per treatment. Here, VAR ( y i× ) = Sy2i× =
=
MS EE 4
1, 217 , 813.195 = 304, 453 4
which is small compared to MSActuator but large compared to MSPressure. The results thus tell us that the variability in response (torque output) induced by pressure is not suf2 ficiently larger than experimental error to cause a situation in which s Pressure is so large
(
)
2 (≫0) that s 2 + s Pressure / s 2 is clearly greater than 1 as determined by the F-test.
12.3.3 Interaction in Two-Way ANOVA The elementary situation described by Equation (12.21) needs to be extended when you are interested in evaluating interactions between treatments rather than simply looking at the effects of the treatments alone. Two-way (and higher) analyses of variance can identify interactions provided that the treatment combinations are replicated (repeated independently as exactly as experimental conditions will allow). Interactions are of two types: Antagonistic and synergistic. You are probably familiar with the combined and almost immediate
256
Applied Engineering Statistics
effects of particulate air pollution and the presence of SO2 (or smog) on morbidity. When this combination is present, the incidence of respiratory problems has increased dramatically. This event is an example of a synergistic interaction: Together, the two types of air pollutants are worse than the sum of their individual effects. An antagonistic interaction occurs when the effect of one treatment tends to diminish the effect of the other. For the situation in which T treatments are composed of independently repeated combinations of a levels of variable A and b levels of variable B, the two-way analysis of variance model, Equation (12.21), must be rewritten to include the presence of an interaction term. The revised model is Yijk = m + a i + b j + (ab )ij + e ijk (12.22)
where μ is the true mean effect, αi is the true effect of the ith level of factor A, βj is the true effect of the jth level of factor B, (ab )ij is the true effect of the interaction of the ith level of factor A with the jth level of factor B, and εijk is the true effect of the kth experimental unit subjected to the (ij)th treatment combination, i = 1, 2 … , a, j = 1, 2, … , b , k = 1, 2 … , n. The usual assumption is made that the εijk are NID(0, σ2). Note: The true interaction might not be the simple product αβ as this model presumes. Nature expresses many types of nonlinear mechanisms. So, if this method detects interaction, it does not necessarily mean it is the simple product type. And, if this model does not detect interaction, it might just be in another form. Four possible sets of assumptions can be made regarding the treatments. The first set is that we are only concerned with fixed effects αi and βj of factors A and B. This Model I assumption says that we are interested only in the α levels of factor A and the β levels of factor B actually present in the experiment. These assumptions are stated as
åa = åb = å (ab ) = å (ab ) 2 i
i
2 j
j
2
2
ij
ij
i
= 0 (12.23)
j
Table 12.6 summarizes calculations for two-way analysis of variance with n replications. (Note that the treatments and the interaction are tested against error.) The second set of assumptions is that αi and βj are random effects, or that a Model II situation exists. In this case, we are concerned with both populations of all possible values of factors A and B of which only a random sample is present. We summarize the pertinent assumptions about the populations as follows:
(
)
(
)
a i are NID 0, s a2 (12.24)
b j are NID 0, s b2 (12.25)
(ab )ij are NID ( 0, s ab2 ) (12.26)
The corresponding analysis of variance for this model is given in Table 12.7. From the entries in the EMS column of Table 12.7, you notice that the first F-test is inter2 action vs error to evaluate H 0 : s ab = 0 . If that hypothesis is rejected as probably false, the interaction term is the most significant (greatest) source of error, and the treatments are tested against the interaction.
257
Analysis of Variance
TABLE 12.6 Two-Way Analysis of Variance with Interaction (Model I) Source
df
Mean
1
A
a−1
B
AB
b−1
(a − 1)(b − 1)
Error
ab(n − 1)
Total
abn
Normalized SS æ ç è
åååY i
j
ö ÷ ø = SS M
ijk
k
abn j
ijk
k
bn
ö ÷ ø - SS = SS M A
å (å å Y ) j
i
- SS M = SS B
an
å å (å Y ) i
j
- SS M - SS A - SS B = SS AB
n
åååY åååY i
j
k
i
j
k
s2 +
2
ijk
k
s2 +
2 ijk
- SS A - SS B - SS M = SS E
2 ijk
= SS T
s2 +
n
åa
2 i
åb
2 j
nb
i
a -1
2
ijk
k
—
2
å æçè å å Y i
EMS
2
na
j
b -1
å å (ab ) i
j
2 ij
( a - 1) ( b - 1) σ2
TABLE 12.7 Two-Way Analysis of Variance with Interaction (Model II) Source
Mean
A
B
AB Error Total
df
1
a−1
b−1
(a − 1)(b − 1) ab(n − 1) abn
Normalized SS æ ç è
åååY i
j
k
ijk
abn j
ijk
k
ö ÷ ø = SS M 2
å æçè å å Y i
bn
ö ÷ ø - SS = SS M A
å (å å Y ) j
i
k
å å (å Y ) j
k
ijk
n
åååY åååY i
j
k
i
j
k
2 s 2 + ns ab + nbs a2
2
ijk
- SS M = SS B
an i
EMS
2
2 s 2 + ns ab + nas b2
2
- SS M - SS A - SS B = SS AB
2 ijk
- SS A - SS B - SS AB - SS M = SS E
2 ijk
= SS T
2 s 2 + ns ab
σ2
If the first F-test shows that the interaction is probably insignificant when compared to error, the treatments are tested against the error. The other two sets of assumptions yield mixed models: one variable is of the fixed type, the other is randomly distributed. By convention, in the third model (III), α is considered fixed and β is random. In the fourth model (IV), α is considered random and β is fixed. The assumptions for these two models are as follows:
258
Applied Engineering Statistics
TABLE 12.8 F-Ratios for Hypothesis Testing in Completely Randomized Design with Factorial Treatment Combinations Source
Model I
Model II
Model III
Model IV
Effects Mean
a, b: fixed — MS A MS E
a, b: random — MS A MS AB
a fixed, b random — MS A MS AB
a random, b fixed — MS A MS E
B
MS B MS E
MS B MS AB
MS B MS E
MS B MS AB
AB
MS AB MS E
MS AB MS E
MS AB MS E
MS AB MS E
—
—
—
—
A
E
Model III: åai =
å (ab )
ij
(
)
(
)
= 0, b j are NID 0, s b2 (12.27)
i
Model IV:
åb = å (ab ) j
j
ij
= 0, a i are NID 0, s a2 (12.28)
j
The assumptions of the four ANOVA models are summarized in Table 12.8. The F-tests of treatment effects for mixed models are accomplished in a simple fashion: Always test the interaction vs error. Test the fixed term vs the interaction term. Test the random term vs error. If the interaction is not significantly different from error, all treatments are tested vs error. If, however, the hypothesis H0(ab )ij , for each i, j pair is rejected, then the acceptance of H0: αi = 0 (or H0: βi = 0) over its range should be interpreted to mean that there is probably no significant difference in the levels of A (or B) when averaged over the levels of B (or A). The F-ratios used for hypothesis testing in two-way ANOVA are summarized in Table 12.8. Example 12.7: The effect of temperature and steam/hydrocarbon ratio (S/HC) on ethylene production in a cracking furnace gave the yield data below. One sample was obtained from each replicate pilot-plant run. What conclusions should be drawn from this data?
(S/HC)1 (S/HC)2 (S/HC)3
T1
T2
T3
T4
38 40 36 37 39 37
42 41 40 44 43 42
43 45 46 45 44 44
42 40 44 42 43 42
259
Analysis of Variance
The results follow.
2 H 02 : s TEMP
1. Assume that the steam/HC and the temperature populations are normally distributed for this Model II situation. 2 2 2 2 2. The hypotheses are H 01 : s RATIO = 0 vs. H A1 : s RATIO ¹ 0 , H 02 : s TEMP = 0 vs. H A2 : s TEMP ¹0 2 2 2 = 0 vs. H A2 : s TEMP ¹ 0 , and H 03 : s TEMP*RATIO = 0 vs. H A3 : s TEMP*RATIO ¹ 0 . 3. The test statistic in all cases is F. 4. Set α = 0.05.
5 & 6. SS T =
åååY i
j
k
2 ijk
= 41, 757 ,
SS M =
SS RATIO =
SS TEMP =
åå
2 Y = 41, 583.375 abn
å
å
i
Y i2×× - SS M = 0.75 bn
Y ×2j× - SS M = 138.4583 j an
Yijk2 - SS M - SS TEMP - SS RATIO = 13.916 j n
SS TEMP*RATIO =
SS ERROR = SS T - SS RATIO - SS TEMP - SS TEMP*RATIO - SS M = 20.5 FTEMP*RATIO =
i
MS TEMP*RATIO = 1.358 < F6 ,12 , 0.95 = 3.00 MS ERROR
7. From the test of H 03 , we conclude that the interaction is probably not significant. Testing the treatment terms vs the interaction term is futile, given the result of the H 03 test. We elect to pool the sums of squares of the interaction and error terms (don’t forget to pool the degrees of freedom also) to increase the precision of the treatment tests: MS PE =
SS TEMP*RATIO + SS ERROR = 1.912 6 + 12
Proceeding to test the treatments against the pooled error, we have
FRATIO =
MS RATIO = 0.1961 < F2 ,18 , 0.95 = 3.566 MS PE
FTEMP =
MS TEMP = 24.138 > F3 ,18 , 0.95 = 3.176 MS PE
We conclude:
1. The interaction is not significant. 2. The range of S/HC ratios studied does not reveal a significant effect. 3. Temperature probably affects yield.
The Excel Add-in, Data Analysis, procedure “ANOVA: Two Factor With Replicates” returns nearly identical results. (It does not pool items after rejecting the interaction term.)
260
Applied Engineering Statistics
ANOVA: Two-Factor with Replication Summary
T1
T2
T3
T4
Total
(S/HC)1 Count Sum Average Variance
2 78 39 2
2 83 41.5 0.5
2 88 44 2
2 82 41 2
8 331 41.375 4.553571
(S/HC)2 Count Sum Average Variance
2 73 36.5 0.5
2 84 42 8
2 91 45.5 0.5
2 86 43 2
8 334 41.75 13.92857
(S/HC)3 Count Sum Average Variance
2 76 38 2
2 85 42.5 0.5
2 88 44 0
2 85 42.5 0.5
8 334 41.75 6.214286
Total Count Sum Average Variance
6 227 37.83333 2.166667
6 252 42 2
6 267 44.5 1.1
6 253 42.16667 1.766667
Source of variation
SS
df
MS
F
P-value
F-crit
Sample Columns Interaction Within Total
0.75 138.4583 13.91667 20.5 173.625
2 3 6 12 23
0.375 46.15278 2.319444 1.708333
0.219512 27.01626 1.357724
0.806064 1.27E-05 0.306351
3.885294 3.490295 2.99612
ANOVA
Example 12.8: The masses of airborne cotton dust on sampling filters at a yarn mill are listed below for all four shifts in the warehouse and the bale-opening areas. The same five OSHA-approved samplers were used throughout this annual compliance test. Is there a difference (α = 0.05) in the results between shifts? Is one of these mill locations dustier than the other? Cotton dust sample weight (μg) Site Warehouse
Shift A
Shift B
Shift C
Shift D
610 830 630 660 490
350 130 380 460 400
515 635 485 465 405
635 355 855 655 685 (Continued)
261
Analysis of Variance
Cotton dust sample weight (μg) Site Bale opening
Shift A
Shift B
Shift C
Shift D
430 380 690 500 330
170 250 270 360 370
845 765 605 610 670
870 815 495 530 460
The Excel Add-in, Data Analysis, procedure “ANOVA: Two Factor With Replicates” returns results. ANOVA: Two-Factor with Replication Summary
Shift A
Shift B
Shift C
Shift D
Total
Warehouse Count Sum Average Variance
5 3,220 644 14,980
5 1,720 344 15,930
5 2,505 501 7,230
5 3,185 637 32,420
20 10,630 531.5 30,610.79
Bale opening Count Sum Average Variance
5 2,330 466 19,630
5 1,420 284 6,880
5 3,495 699 10,817.5
5 3,170 634 37,217.5
20 10,415 520.75 42,969.14
10 5,550 555 24,183.33
10 3,140 314 11,137.78
10 6,000 600 18,911.11
10 6,355 635.5 30,952.5
Total Count Sum Average Variance
ANOVA Source of variation
SS
df
MS
F
P-value
F-crit
Sample Columns Interaction Within Total
1,155.625 632,511.9 185,086.9 580,420 1,399,174
1 3 3 32 39
1,155.625 210,837.3 61,695.63 18,138.13
0.063712 11.62398 3.401433
0.802336 2.59E-05 0.029416
4.149097 2.90112 2.90112
From these results, we reject H 01 : å b j = 0 and H 03 : å å (ab )ij = 0 and do not reject H 02 : å a i = 0 . Our conclusions are:
1. An interaction between shift and work area probably exists. 2. There probably is a difference between shifts. 3. There is probably no significant difference due to work areas.
262
Applied Engineering Statistics
Statistics are just numbers unless you use them to help you improve safety, quality, production rate, etc. Use the statistics to point to justified actions. How might we interpret the significance of these results? Should the people on Shift B be praised and those on Shift D be reprimanded? Find the cause for the difference before taking action. In this case, most routine and preventive maintenance is done on the day (A) shift. Only breakdowns are repaired during the night (B) and graveyard (C) shifts. As the swing shift (D) rotates, almost any situation may occur. What does maintenance have to do with dust level? More maintenance creates both more nuisance dust and more of the respirable dust sampled by the samplers. Was the crew in one of those areas trying to “go over the top” when the compliance test was run and so put the squeeze on the people in the warehouse or the opening/cleaning area for a higher throughput? Use the statistical results to help analyze cause-and-effect mechanisms.
12.3.4 Two-Way Analysis of Variance with Replicates The analysis is similar to that above. The Excel Data Analysis procedure ANOVA: TwoFactor with Replication can provide the analysis.
12.4 Takeaway ANOVA just seeks statistical evidence of linear correlation. It does not assign causation. It is very useful as a screening tool to see if there is a relation between treatments and data, which can guide model/relation/cause-and-effect development efforts. Excel Data Analysis has limited power, but it is very convenient for basic cases. If important, use professional statistical software. If the treatments are values of continuum or discrete variables, then correlation analysis of Chapter 13 may be as or more effective than two-way ANOVA.
12.5 Exercises
1. An oil-cracking unit processed three kinds of heavy oil (30% hydro-treated, 50% hydro-treated, and 70% hydro-treated) to reduce their viscosities. Samples of the output were taken each day for 4 days to analyze the boiling point of the product. The boiling point test was carried out four times for each sample. Boiling Point (°F) for Processed Oils Sample day 1
2
Feed 30%
50%
70%
425 431 436 433 431
457 462 460 455 460
510 507 500 505 500
(Continued)
263
Analysis of Variance
Boiling Point (°F) for Processed Oils Sample day
3
4
Feed 30%
50%
70%
423 427 429 428 437 436 431 433 435 425 430
456 463 465 482 476 480 475 470 476 467 465
510 495 498 505 511 506 513 513 505 507 510
Prepare the analysis of variance for these data and interpret the results. 2. The following data give the yields of a product that resulted from trying catalysts from four different suppliers. (a) Are yields influenced by catalysts? (b) What are your recommendations in the selection of a catalyst to obtain the greatest yield? Assume that economics dictate 95% probability of being right on a decision. Catalyst I
II
III
IV
36 33 35 34 32 34
35 37 36 35 37 36
35 39 37 38 39 38
34 31 35 32 34 33
3. A solution of hot potassium carbonate (HPC) is used to scrub CO2 from the gas feed stock in the production of an intermediate in the manufacture of nylon. It has been proposed that the addition of an amine solution to the HPC will increase the purity of the scrubbed gas. The pilot-plant data follow. Does the additive have the desired effect? HPC flow rate (gal/hr)
Amine conc. (Wt. %) 0.5 1.0 3.0 6.0
0.5
0.7
0.9
1.1
1.64 1.31 0.99 0.61
1.32 1.09 0.81 0.45
0.76 0.47 0.15 0.04
0.62 0.38 0.08 0.01
264
Applied Engineering Statistics
4. The removal efficiency of a pilot-scale gas absorber is presumed to be a function of liquid rate A and the presence or absence of a buffer B. For the following data collected as a 22 factorial with r = three replications, evaluate the probable significance of the variance components. Replicates are used as blocks, and the form of data presentation is conventionally rotated 90° counterclockwise for convenience of display. Without buffer Replication 1 2 3
With buffer
6 GPH
12 GPH
6 GPH
12 GPH
21 24 26
37 37 31
48 42 50
60 62 57
5. In Example 12.2 the samples from four batches were all analyzed by the same technician. Suppose, alternately, that each manufacturing source analyzed their own samples in their own labs. Could the data be used to claim that the manufacturer’s products are equivalent?
13 Correlation
13.1 Introduction Correlation means that values of two variables rise or fall with each other. Your personal experience might agree with these examples: 1) The number of apples a tree yields rises with the size of the tree. 2) The noise on a school playground rises with the number of students on the playground. 3) Road noise in a car rises with the car speed. 4) Road noise in a car falls with the how much the windows are wound up. Correlations can be either positive or negative. If a positive correlation, when one variable increases the other increases, and a decrease in one means a decrease in the other. A negative correlation means that they move in opposite directions. If there is no relation between the variables, the correlation is zero. An example of a positive correlation is daylight intensity (perhaps lumens) and sun “height” over the local Earth location (as measured by sine of the angle from the eastern horizon). At dawn, the sun is just rising, the angle is zero degrees, and the sine is zero. The dawn light is low. As the sun rises the angle increases, the height (sine of the angle) increases, and day light intensity increases. At midday the sun is at 90 degrees, the sine is 1, and light intensity is at maximum. Then as the sun path continues its arc, the angle from the eastern horizon increases toward 180 degrees, the sun “height” decreases (the sine of the angle decreases) and the light intensity falls. Even though over time the light rises then falls, a plot of light intensity w.r.t. sun height (sine of the angle) only shows a positive trend. Note: The measurement of light intensity reaching the ground will be affected by the vagaries of cloud movements, changes in local humidity and atmospheric pressure, and sunspot and sun flare activity. So, although ideally, there might be a classroom geometry model of a deterministic trend between light intensity and sun “height” natural vagaries would confound the trend (add noise) as revealed by actual measurements. Correlation does not mean there is a cause-and-effect relation. The two correlated variables could both be effects of a common cause. Here are some examples 1) Leaves falling from the trees do not make the cold weather happen. 2) Gray hair does not cause face wrinkles, nor do face wrinkles cause gray hair. 3) Because of the driver’s aggressive behavior (or due to driving on a dirt/gravel path), cars that have unexpectedly low gas mileage may also have tires that wear out rapidly. This does not mean that low gas mileage causes tires to wear rapidly. Correlation does not affirm a hypothesized mechanism. Another example: The flowers turn during the day to point to the sun. This correlation does not affirm the hypothesis that flowers have tractor beams and pull and push against the sun to make the world go around. The mantra is “Correlation is not Causation”.
DOI: 10.1201/9781003222330-13
265
266
Applied Engineering Statistics
If there was no uncertainty on variables (no noise, no random perturbations, no natural variation, no uncontrolled factors) then either correlation or a lack of correlation would be easy to detect. However, because of variation on the data values, we need statistical tests to reveal the extent of, and confidence in a correlation. Correlation does not have to represent a linear trend. The trend could be quadratic, exponential, or any other. In this analysis we will only consider that the trend is consistently positive or negative, that it does not reverse, that it is monotonic. Returning to the sun position and light intensity, if light intensity is plotted w.r.t. time, initially, in the morning, the correlation is positive, then changes to negative. However, if light intensity is plotted w.r.t. sun “height” it is always a positive trend. Correlation can also exist within a single variable, for instance when it is considered over time. Many variables are measured over time: In manufacturing products, quality metrics are observed over time in quality charts (perhaps a month-long window of daily quality values). In process control, the process might be sampled on 100 ms interval (10 Hz frequency) and the trend over the past minute displayed in the operator’s panel. If there is no trend in value over time, if the variable is constant over time, then the charts would ideally show a flatline horizontal trend. However, sample-to-sample vagaries will make the ideal trend noisy. If the influences on each of the samples are independent, then there will not be a relation between samples. However, if an influence on the prior sample has some persistence and continues its influence on the next sample, then there will be a positive autocorrelation. This can also be termed serial correlation. As an example, reconsider sunlight intensity. If a cloud is blocking the sun, the light intensity is low. The clouds do not blink on and off, randomly, each millisecond. If a cloud is there, it remains there for a while before being blown away, the shadow persists prior to permitting the high intensity light. Here the influence on sun intensity is the cloud. If sunlight intensity is sampled and is relatively low, the presence of the influence will persist for a while and the observation on the next second will also tend to be low. When the sky clears, sequential light intensity values will be high. If the noisy trend is actually changing in time (increasing or decreasing) this background change in the variable will also cause positive autocorrelation between samples. In Chapter 19, on model validation, we will observe the residuals (the difference between model and data) w.r.t. a modeled variable. If the model is true to the data-generating mechanism, then the residuals should have no autocorrelation. By contrast, if the model does not properly capture the mechanism, then because of process-model mismatch, one residual will be positive (or negative), and an adjacent residual will tend to have the same sign because the model is locally in error. In positive autocorrelation, if one sample has a high value, the next will tend to be high; and if one has a low value, the next will trend to be low. The first sample value is not the cause for the second value. The persistence of the influence on the first is the cause for the second.
13.2 Correlation Between Variables 13.2.1 Method Equation (4.12), the Pearson product-moment correlation, provides a classic measure of correlation between two variables, x and y.
267
Correlation
r=
å
=
å
n i =1
( xi - x ) ( y i - y )
n
( xi - x ) i =1
1 ( n - 1)
å
n i =1
2
å
n i =1
( yi - y )
2
(13.1)
( xi - x ) ( y i - y )
sx s y
The values are paired, and the subscript i is a number index for the pairing. The data can be represented as in Table 13.1. It does not matter whether variable x is to the left or right of y in the table. It does not matter whether the index is a sequence of numbers, or letters, or category labels. It does not matter whether the rows are reorganized (such as listing in reverse order), of if two sets of pairs are switched. What matters is that 1) the variable values are rationally paired, 2) there is no missing data in only one column when using Equation (13.1), and 3) the variance of x and y are nearly the same at all of their values. (If there are missing data values in one column, eliminate the extra value on the other variable and decrement n.) If there is zero correlation, then when x is above its average y will be below its average as frequently as y is high. Then the numerator sum in Equation (13.1) will have an approximately equal number of “+” and “−” values and will tend to remain about a value of zero. If there is positive autocorrelation, then when x is above its average y will tend to be above its average, also; and terms in the numerator will tend to be positive. Alternately, if there is negative autocorrelation, then when x is above its average y will tend to be below its average, and terms in the numerator will tend to be negative. When scaled by the standard deviations, the range of the correlation statistic is -1 £ r £ +1 . If r is about zero there is very little correlation. The closer the value of r gets to ±1, the stronger is the evidence for correlation. 13.2.2 An Illustration As a data-based illustration, Table 13.2 shows part of the data from a study of undergraduate student grades in chemical engineering. The column labeled “Major GPA” is a particular student’s grade point average in their upper-level major classes, and the column labeled “STEM GPA” is the grade point average for the same student in all of their first- and TABLE 13.1 Variables for Correlation Testing Index 1 2 3 … i … n
Variable y
Variable x
y1 y2 y3 … yi … yn
x1 x2 x3 … xi … xn
268
Applied Engineering Statistics
TABLE 13.2 A Portion of Student Performance Assessment Data Student ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
STEM GPA
Major GPA
3.804348 3.058824 2.926829 2.686275 3.780488 2.5 3.764706 2.705882 3.313725 3.28125 3.431373 3.680851 2.942308 3.512195 3.511628 3.06383 3.361702 3.392157 3.392157 2.365385
4 1.971429 2.285714 2.914286 3.714286 2.5 3.8 2.771429 3.142857 3.085714 3.714286 3.6 3.029412 2.6 3.852941 3.142857 3.314286 2.542857 2.142857 1.714286
second-year STEM classes (STEM is the acronym for Science, Technology, Engineering, and Mathematics). The hope from the study was to provide indicators to the students transitioning from the second to third year about what they might expect, and to encourage those who might not have adequate ability or preparation to either self-select an alternate major or improve preparation prior to starting the upper-level major classes. Whether the students are listed in alphabetical order, age, number of letters in their name, or distance of their home from the university is irrelevant. The number representing student ID was actually randomly assigned to prevent any traceability to a particular student. Using Equation (13.1) the correlation coefficient value is r = 0.694. With 20 data pairs, this is a strong indicator of a relation. But also, it is not a perfect correlation. In our study we actually explored many metrics of academic performance in the first two college years to see what metrics would best forecast performance in the upper-level major courses. Table 13.3 shows more of the variables for the same 20 students. For convenience in this presentation, the data are rounded. The second-to-last column is the same Major GPA of Table 13.2, and the last column was another key factor, the integer number of D or F grades (unsatisfactory grades) the student earned in the major classes. The other columns are other metrics. “Average ENSC” is the student’s grade in three lower-level engineering science courses. “CHE 2033” is the lower-level introduction to chemical engineering (material and energy balances). “Avg PhysII CalcIII” is the average of the student’s grades in Physics II (light and magnetism) and Calculus III (multivariable calculus). “Adv Chem Lab” is the student’s grade in either organic chemistry or biochemistry lab. “Adv Chem” is the student’s grade in their
269
Correlation
TABLE 13.3 Student Performance Assessment Data ID
STEM GPA
Avg ENSC
CHE 2033
Avg PhysII CalcIII
Adv Chem Lab
Adv Chem
# repeats
Programming
Major GPA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
3.80 3.06 2.93 2.69 3.78 2.50 3.76 2.71 3.31 3.28 3.43 3.68 2.94 3.51 3.51 3.06 3.36 3.39 3.39 2.37
4.00 2.33 3.00 3.33 3.67 2.67 4.00 2.67 3.00 3.33 4.00 3.33 2.33 3.33 4.00 3.00 3.00 3.67 3.33 2.33
4 2 2 2 4 3 4 2 4 4 4 3 3 3 4 3 4 3 3 3
4.00 3.67 3.00 2.33 4.00 2.00 3.67 2.33 3.67 4.00 3.33 4.00 3.00 4.00 3.50 3.33 3.33 3.67 3.67 2.33
4 3 2 3 4 2 4 3 4 3 3 4 3 3 4 3 3 3 3 2
2 1 3 2 3 2 3 2 3 2 2 3 2 2 3 2 3 2 3 2
1 2 0 5 0 1 0 2 0 1 0 0 3 0 2 0 0 0 1 9
4 2 4 3 4 3 3 3 3 4 4 4 4 4 4 4 4 3 3 2
4.00 1.97 2.29 2.91 3.71 2.50 3.80 2.77 3.14 3.09 3.71 3.60 3.03 2.60 3.85 3.14 3.31 2.54 2.14 1.71
Major #D&F 0 7 2 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 5
advanced chemistry elective. “# repeats” is the number of times a student repeated lowerlevel STEM courses to get a passing grade. And “Programming” is the computer programming course. In the initial study we actually considered many other possible metrics such as grades in English, or social science courses, or general chemistry, or calculus II. but these had very little correlation. Further, several courses had similar correlation impact on the outcome and were also strongly inter-correlated, so they are presented here as average. There are 10 data columns in Table 13.3, permitting 45 comparison combinations, or 55 if a column is compared to itself. The Excel Data Analysis Add-In “Correlation” analyzes all 55. Table 13.4 presents the results, with values rounded to 2 decimal digits. The values of 1 along the main diagonal indicate that the variable is compared to itself. Expectedly, the correlation is the perfect +1. The upper right of the table would have exactly the same values as the lower left. Equation (13.1) reveals that reversing the x and y variables does not change the r-value. So, only half of the full table is presented. Some r-values are negative indicating negative correlation. Like ANOVA, the correlation study does not necessarily indicate cause and effect, but it does indicate strong correlations that should become a clue to closer mechanistic analysis. The classifications of STEM GPA, Avg ENSC, CHE 2033, Adv Chem Lab, and Programming each have strong correlation to Major GPA (r-values are 0.69, 0.71, 0.69, 0.78, and 0.66), and are the ones that should be viewed as strong indicators of student success. Further, the students’ performance in Programming has a strong correlation to Major#D&F as indicated by r = −0.71. This is a negative correlation, meaning that a high programming grade correlates to few D and F grades in the Major. These observations could provide clues as to the innate student attributes that are key to the desired success.
STEM GPA Avg ENSC CHE2033 Avg PhysII CalcIII Adv Chem Lab Adv Chem # repeats Program ming Major GPA Major #D&F
1 0.77 0.64 0.90 0.80 0.44 −0.66 0.51 0.69 −0.42
STEM GPA 1 0.58 0.54 0.60 0.39 −0.45 0.46 0.71 −0.51
Avg ENSC
1 0.51 0.55 0.42 −0.32 0.42 0.69 −0.53
CHE 2033
1 0.67 0.25 −0.59 0.41 0.43 −0.18
Avg PhysII CalcIII
Correlation Results of Data in Table 13.3, Rounded Values
TABLE 13.4
1 0.39 −0.39 0.30 0.78 −0.41
Adv Chem Lab
1 −0.34 0.37 0.39 −0.45
Adv Chem
1 −0.56 −0.45 0.56
# repeats
1 0.66 −0.71
Programming
1 −0.71
Major GPA
1
Major #D&F
270 Applied Engineering Statistics
271
Correlation
13.2.3 Determining Confidence in a Correlation There seem to be several approaches to approximating a confidence interval, critical values, or a p-value on the Pearson correlation statistic of Equation (13.1). In one method, calculate a T-statistic from the data results:
T=r
( n - 2 ) (13.2) 1 - r2
The hypothesis is that there is no correlation, so that the ideal r-value and calculated T is zero. Then correlation will be accepted if r is either large positive or large negative. Then use the two-sided t-distribution with u = ( n - 2 ) degrees of freedom. Alternately, in Fischer’s method calculate a modified z-statistic:
z¢ =
1 1+ r Ln (13.3) 2 1- r
Which is approximately normally distributed with a variance of s 2 = 1 / ( n - 3 ) . Neither method is perfect, but both seem to approach the respective distributions rapidly with increasing n and are in reasonable agreement with each other. Example 13.1: Compare Equations (13.2) and (13.3) in determining a p-value for the correlation of the Adv Chem Lab grade to the Major GPA from Table 13.3. The correlation statistic is r = 0.78, and n = 20 sample pairs. Using Equation (13.2) the data T-value is T = 0.78
( 20 - 2 )
= 5.288¼ and with 1 - 0.78 2 u = ( 20 - 2 ) = 18 degrees of freedom, the p-value is 0.00005. 1 1+ r = 1.045¼ and with Using Equation (13.3) the modified z-statistic is z¢ = Ln 2 1- r 1 s= = 0.2425¼, the p-value is 0.000016. 20 - 3 The two approximations do not return exactly the same value, but both agree that if there were no correlation it is very improbable (about one out of 20,000 or one out of 61,000 chance) that the data could have generated an r-value so large. We reject the null hypothesis of no relation. We claim that the Adv Chem Lab grade is a strong indicator of the Major GPA, with a confidence greater than 99.99%. Example 13.2: Compare Equations (13.2) and (13.3) in determining a p-value for the correlation of the Avg PhysII CalcIII grade to the #D&F in the Major from Table 13.3. The correlation statistic is r = −0.18, and n = 20 sample pairs. Using Equation (13.2) the data T-value is T = ( -0.18 )
( 20 - 2 ) 2 1 - ( -0.18 )
= 0.776¼ and with
u = ( 20 - 2 ) = 18 degrees of freedom, the p-value is 0.4476…. 1 1+ r = -0.1819¼ and with Using Equation (13.3) the modified z-statistic is z¢ = Ln 2 1- r 1 s= = 0.2425¼, the p-value is 0.4530…. 20 - 3 Again, the two approximations do not return exactly the same value, but both agree that if there were no correlation it is quite possible that the data could have generated
272
Applied Engineering Statistics
such an r-value. We conclude that there is little evidence to reject the null hypothesis. We accept that there is inadequate evidence to be able to claim a significant correlation between the Avg PhysII CalcIII grade to the #D&F in the Major.
13.3 Autocorrelation 13.3.1 Method Equation (4.13), reproduced here as (13.4), indicates how to calculate the autocorrelation statistic of lag-1, between sequential variables. And Equation (13.5) indicates the general case of a lag-k, between values that are k-intervals apart.
å rr r = å r n
i i -1
i=2 n
1
i =1
å å n
rk =
i
2
ri ri - k
i = k +1 n
ri 2
=
=
1 ( n - 1)
å
n i=2
( xi - x ) ( xi -1 - x ) sx 2
1 ( n - 1)
å
n i = k +1
(13.4)
( xi - x ) ( xi - k - x )
sx 2
(13.5)
i =1
The variable ri in the sums is termed a residual, it could be the difference between model and data, but here it is indicated as the difference between data and average. Note that there are (n − k) terms in the numerator sum, and n (a greater number of) terms in the denominator sum. The (n − 1) coefficient is for the denominator translation from the sum of squared deviations to the variance. The method assumes that data variance is uniform throughout the n values. The range of the autocorrelation statistic is approximately -1 < rk < +1. The -1 statistic is normally distributed, but its mean is not zero. The mean is » . If rk is near to n -k -1 there is no evidence of autocorrelation. n-k Note: Unfortunately, in statistics there are too many variables with the same symbol, r. Take care. The data can be represented as in Table 13.5. Contrasting Table 13.1 to Table 13.5: In Table 13.5 1) There is only one variable. 2) The index reveals sequential order in time, or spatial position, or the value of another variable. 3) You must preserve the order (you cannot interchange some of the rows, but you can list the data in reverse order from n to 1). Similar to Table 13.1, in Table 13.5 the variance on the x-values needs to be nearly the same at all of their values. To visualize autocorrelation, you can plot xi w.r.t. xi − 1 (See Figure 13.1). 13.3.2 An Autocorrelation Illustration Table 13.6 presents a window of time series data representing orifice-measured flow rate sampled at 10 Hz (ten times per second).
273
Correlation
TABLE 13.5 A Variable Sequence for Autocorrelation Testing Index
Variable x
1 2 3 … i … n
x1 x2 x3 … xi … xn
FIGURE 13.1 A visual reveal of autocorrelation with data from Table 13.6.
There is a filter on the transmitter to temper noise (fluctuations due to flow turbulence). Even though turbulence in the flowing fluid should provide random fluctuations at this time interval, the “averaging” by a first-order filter retains some of the prior values, creating autocorrelation. There is also a rounding of the values to increments of 0.5 for digital presentation, which can contribute to autocorrelation if the variation range is not much larger than the discretization interval. Figure 13.1 is a plot of the immediate prior data value w.r.t. the current value. Although there are 20 data values, there are only 19 comparisons on the graph. The general diagonal trend reveals autocorrelation. If one value is high, the next tends to be high. If one is low, the next tends to be low. From Equation (13.4) the r-lag-1 value is r1 = 0.6262 ¼. From Equation (13.5) the r-lag-2 and r-lag-3 values are r2 = 0.2886 ¼ and r3 = -0.05319 ¼.
274
Applied Engineering Statistics
TABLE 13.6 A Sample of Time-Series Data Sample number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Flow rate cuft/min 9.85 9.80 9.65 10.05 10.15 10.05 10.05 10.00 9.90 10.05 10.05 10.05 9.80 9.85 9.75 9.90 10.30 10.55 10.40 10.55
13.3.3 Determining Confidence in Autocorrelation There seem to be commonly accepted approximations to the distribution of the r-lag-k statistic. With the null hypothesis (no autocorrelation) and a large enough number of samples (n - k > ~ 10) then the statistic z’ is approximately the standard normal statistic (mean of zero and variance of unity).
z¢ =
1 + rk ( n - k ) n - k - 1)
(13.6)
Example 13.3: Using the r-lag-k values from Table 13.6, calculate the p-values for lags 1, 2, and 3. From Equation (13.6) the z’ values are 3.040…, 1.502…, and 0.0239…. And from the standard normal distribution, the corresponding p-values are 0.0023…, 0.1329…, and 0.9809…. There is strong evidence that the autocorrelation exists for lag-1 (99.7%), modest support to reject the zero autocorrelation hypothesis for lag-2 (86.7%). But there is not enough evidence to confidently claim that autocorrelation does exist through the third following sample.
As an alternate, for large (n − k) Equation (13.6) reduces to
z¢ = r n (13.7)
Correlation
275
13.4 Takeaway Correlation is not causation. If there is a strong correlation between two variables it might mean that they are both effects of a common cause. Correlation analysis is a simple way to identify strong relations, as a clue about postulating or affirming or rejecting mechanistic cause and effect conjectures. Correlation does not necessarily mean that there is a linear relation between variables. However, the value of the square of the correlation r is the same value of the linear regression correlation coefficient r-squared, of a linear relation between the variables. The methods of this chapter require uniform variance throughout the range of any one variable. If not uniform, then the data in the large variance region will dominate and diminish the effects of the other data.
13.5 Exercises
1. Test a few of the columns in Table 13.3 to see if you get the same values of the correlation r in Table 13.4. 2. Test a few rearrangements of the data order to see if the correlation coefficient value changes. 3. Plot the data in Table 13.6 for Lag-2 and Lag-3 to visually reveal the autocorrelation and the lack of autocorrelation with one and with two samples between data.
14 Steady State and Transient State Identification in Noisy Processes
14.1 Introduction Identification of both steady-state (SS) and transient state (TS) in noisy time- or sequencebased signals is important. SS models are widely used in process control, online process analysis, and process optimization; and, since manufacturing and chemical processes are inherently nonstationary, selected model coefficient values need to be adjusted frequently to keep the models true to the process and functionally useful. Additionally, detection of SS triggers the collection of data for process fault detection, data reconciliation, neural network training, the end of an experimental trial (when you collect data and implement the next set of conditions), etc. But, either the use of SS models for process analysis or their data-based adjustment should only be triggered when the process is at SS. In contrast, transient, time-dependent, or dynamic models are also used in control, forecasting, and scheduling applications. Dynamic models have coefficients representing time-constants and delays, which should only be adjusted to fit data from transient conditions. Detection of TS triggers the collection of data for dynamic modeling. Additionally, detection of TS provides recognition of points of change, wake-up data recording, the beginning of a process response to an event, interruptions to the norm, etc. Characteristic of these online real-time applications is that we only have past and current data. We do not have the future points. In many filtering applications such as image and recording enhancement and detection, filtering is done offline after all the data has been collected. We have access to both the before and after data and can use kernel-type filtering approaches to use both sides to better estimate the in-between value. By contrast, in real-time applications we need to make a decision now, without seeing the future. Our experience has been applying SSID and TSID to chemical processes, which are characterized by time-constants on the order of 1 second to 1 hour, multivariable (coupled and nonlinear), noise of several types, with mild and short-lived autocorrelation, variance changes with operating conditions, controllers and final element dead-band can cause oscillation, and flatlining measurements are not uncommon (for any number of aspects such as maintenance, sensor failure, data discretization). Additionally, control computers are inexpensive, and the operators have education typically at the associate degree level; both aspects require simplicity in algorithms. The approaches presented here might not be right if mission criticality can afford either powerful computers or highly educated operators, or if rotating machinery creates a cyclic response as a background oscillation to the steady signal.
DOI: 10.1201/9781003222330-14
277
278
Applied Engineering Statistics
If a process signal was noiseless, then SS or TS identification would be trivial. At SS there is no change in data value. Alternately, if there is a change in data value, the process is in a TS. However, since process variables are usually noisy, the identification needs to “see” through the noise and should announce probable SS or probable TS situations, as opposed to definitive SS or definitive TS situations. The method also needs to consider more than the most recent pair of samples to confidently make any statement. Since the noise could be a consequence of autocorrelated trends (of infinite types), varying noise amplitude (including zero), individual spikes, non-Gaussian noise distributions, or spurious events, a useful technique also needs to be robust to such aspects. A process might not need to be exactly “flat-lined” to be considered at SS. For practical purposes, a very small trend or oscillation might have a negligible impact on SS data uses. Finally, in observing data, a process might appear to be at SS due to measurement discrimination intervals, when in fact it is changing but the change has not exceeded the data discretization interval. Time, as an example, continually progresses; but a digital watch only updates the screen on one-minute intervals. Time is not at a SS between the numerical display change events. 14.1.1 Approaches and Issues to SSID and TSID A conceptually simple approach to identify SS would be to look at data values in a recent time-window and if the range between high and low is acceptably small, declare SS. This approach, however, requires the human to decide the acceptable range for each variable, and the time window duration; and, if the process noise level changes, the threshold should change. Although the approach is simple to understand, easy to implement, and often works acceptably, it is not a universal approach. Another straightforward implementation of a fully automated method for SSID would be a statistical test of the slope of a linear trend in the time series of a moving window of data. This technique is a natural outcome of traditional statistical regression training. Here, at each sampling, use linear regression to determine the best linear trend line for the past N data points. If the t-statistic for the slope exceeds the critical value, then there is sufficient evidence to confidently reject the SS hypothesis and claim it is probably in a TS. A nice feature of this approach is that the determination is independent of the noise amplitude. However, at the crest or trough of an oscillation centered in the N-data window, this slope would be nearly zero, and SS will be claimed during the TS. The approach is also somewhat of a computational burden. Another straightforward approach is to evaluate the average value in successive data windows. Compute the average and standard deviation of the data in successive datasets of N samples then compare the two averages with a t-test. If the process is at SS, ideally, the averages are equal, but noise will cause the sequential averages to fluctuate. If the fluctuation is excessive relative to the inherent data variability, then the t-statistic (difference in average divided by standard error of the average) will exceed the critical t-value, and the null hypothesis (process is at SS) can be confidently rejected to claim it is probably in a TS. Again, however, when windows are centered on either side of the crest or trough in an oscillation, the averages will be similar, and SS will be falsely accepted. A solution could be to use three or four data windows, each with a unique N, to prevent possible matching to a periodic oscillation. Many other approaches have been used, including Statistical Process Control techniques, dual filters, runs tests, wavelets, polynomial interpolation, thresholds on variance or slope, closure of SS material and energy balances, and variance ratios in the data.
Identification in Noisy Processes
279
Note, these methods reject the null hypothesis, which is a useful indicator that the process is confidently in a TS. But, not rejecting SS, does not permit a confident statement that the process is at SS. A legal judgment of “Not Guilty” is not the same as a declaration of “Innocent”. “Not Guilty” means that there was not sufficient evidence to confidently claim “Guilty” without a doubt. Accordingly, there needs to be a dual approach that can confidently reject TS to claim probably in a SS, as well as rejecting SS to claim probable TS. This chapter presents several dual approaches. Further, such conventional tests have a computational burden that does not make them practicable online, in real time, within most process control computers. A practicable method needs to be computationally simple, robust to the vagaries of process events, easily implemented, and easily interpreted.
14.2 A Ratio of Variances Methods Von Neumann (von Neumann, J., “Distribution of the ratio of the mean square successive difference to the variance”, The Annals of Mathematical Statistics, 1941, 12, 367–395) and Crowe et al. (Crowe, E. L., F. A. Davis, and M. W. Maxfield, Statistics Manual, Dover Publications, New York, NY, 1955) proposed an approach that calculates the variance on a dataset by two approaches – the mean-square deviation from the average and the meansquare deviation between successive data. The ratio of variances is an F-like statistic, and assuming no autocorrelation in the data, it has an expected value of unity when the process is at SS. The filter approach of Cao, S., and R. R. Rhinehart (An Efficient Method for On-Line Identification of Steady-State, Journal of Process Control, Vol. 5, No 6, 1995, pp. 363–374) is similar, but computationally simpler. Begin with this conceptual model of the phenomena: The true process variable (PV) is at a constant value (at SS) and fluctuations on the measurement and signal transmission process create uncorrelated “noise”, independently distributed fluctuations on the measurement. Such random measurement perturbations could be attributed to mechanical vibration, stray electromagnetic interference in signal transmission, thermal electronic noise, flow turbulence, etc. Alternately, the “noise” could represent process fluctuations resulting from nonideal fluid mixing, multiphase mixtures in a boiling situation, crystal size, or molecular weight that create temporal changes to the local measurement. If the noise distribution (mean and variance) were uniform in time, then statistics would classify this time series as stationary. However, for a process, the true value, nominal value, or average may be constant in time, but the noise distribution may change. So, SS does not necessarily mean stationary in a statistical sense of the term. The first hypothesis of this analysis is the conventional null hypothesis that the process is at SS, H0: SS. The statistic, a ratio of variances, will ideally have a value of unity, but due to the vagaries of noise, it will have a distribution of values at SS. As long as the Ratiostatistic value is within the normal range of SS distribution of values, the null hypothesis cannot be rejected. When the R-statistic has an extreme value, then the null hypothesis can be rejected with a certain level of confidence, and probable TS claimed. By contrast, there is no single conceptual model of a TS. A transient condition could be due to a ramp change in the true value, or an oscillation, or a first-order transient to a new value, or a step change, etc. Each is a unique type of transient. Further, each single transient event type has unique characteristics such as ramp rate, cycle amplitude and frequency,
280
Applied Engineering Statistics
and time-constant and magnitude of change. Further, a transient could be comprised of any combination or sequence of the not-at-SS events. Since there is no unique model for TS, there can be no null hypothesis, or corresponding unique statistic that can be used to reject the TS hypothesis and claim probable SS. Accordingly, an alternate approach needs to be used to claim probable SS. The alternate approach used here is to take a transient condition which is barely detectable or decidedly inconsequential (per human judgment) and set the probable SS threshold for the R-statistic as an improbably low value, but not so low as to be improbably encountered when the process is truly at SS. So, there are two one-sided tests, one to reject SS, and one to reject TS, and two critical values as illustrated in Figure 14.1, where the vertical axis is the CDF, and the horizontal axis is the SS Ratio-statistic. The solid curve is the distribution of R-values when at SS and is centered on a value of 1. The dashed curve is the distribution when not at SS, but nearly so, and is centered on about 1.7. The vertical dashed lines are the critical values. The right-most line is the trigger to reject SS. It intersects the at-SS CDF at a value of about 0.95. There is a 5% chance that an at-SS process will generate an R-value greater than 2. However, there is about a 35% chance (1 – 0.65) that the nearly-but-not-at-SS process will generate an R-value greater than 2. The left-most line is the trigger to reject TS. It intersects the at-SS CDF at a value of about 0.25. There is a 25% chance that an at-SS process will generate an R-value less than 0.8. However, there is only about a 1% chance that the nearly-but-not-at-SS process will generate an R-value less than 0.8. In this illustration, the odds of correctly rejecting SS are 0.35/0.05 = 7:1, and the odds of correctly accepting SS are 0.25/0.01 = 25:1. However, in most transient conditions the dashed not-at-SS curve is further to the right, and both odds are better than illustrated. 14.2.1 Filter Method Figure 14.2 illustrates the filter method concept to create an R-statistic. The markers represent process measurements over the 100 sequential samples indicated on the horizontal axis. The process value starts at about 10, ramps to a value of about 15, and then holds steady. The true trend is unknowable, only the measurements can be known, and they are
FIGURE 14.1 Illustration of dual rejection regions. The vertical axis is the CDF, and the horizontal axis is the SS Ratio-statistic. The solid curve is the F-like distribution of R-values when at SS. The dashed curve is the F-like distribution when not at SS.
Identification in Noisy Processes
281
FIGURE 14.2 Filter method concepts.
infected with noise-like fluctuations. The solid line is a first-order filtered value of the data. It starts at about 10 then lags behind the process, and at the end of the chart finally settles to a value representing the process level of 15. The filtered value is not smooth but reveals wiggles due to the high and low vagaries of the data. The method first calculates a filtered value of the process measurements, then the variance in the data is measured by two methods. One is based on the difference between the measurement and the filtered trend. The other is based on deviations between sequential data measurements. If the process is at SS, as illustrated in the 0–10 to and 90–100 time periods, the filtered value, Xf, remains almost in the middle of the data. Then a process variance, s21, estimated by differences between data and filtered value will ideally be equal to the true value of σ2. The variance can also be estimated by the data-to-data (not average-to-data) differs2 ences, s2 2 . Then the ratio of the variances, r = 2 1 will be approximately equal to unity. s2 Alternately, if the process is in a TS, such as in the 20–60 time period, then Xf is not the middle of the data, the filtered value lags behind the process, and the variance as measured by the data-to-filter difference will be much larger than the variance as estimated by sequential data differences, s21 s2 2 , and ratio will be much greater than unity. To minimize computational burden, in this method a filtered value (not an average) provides an estimate of the data mean:
X f , i = l1 Xi + ( 1 - l1 ) X f , i -1 (14.1)
X = the process variable Xf = Filtered value of X λ1 = Filter factor i = Time sampling index Note: Alternate terms for the filtered value are first-order filter, first-order lag, and exponentially weighted moving average.
282
Applied Engineering Statistics
The first method to obtain a measure of the variance uses an exponentially weighted moving “variance” (another first-order filter) based on the difference between the data and the filtered value, representing the average:
u 2 f , i = l2 ( Xi - X f , i -1 ) + ( 1 - l2 )u 2 f , i -1 (14.2) 2
υ2f, i = Filtered value of a measure of variance based on differences between data and filtered values υ2f, i - 1 = Previous filtered value In Equation (14.2), the symbol ν2 is a measure of the variance to be used in the numerator of the ratio statistic. Because it is calculated form the filtered value, not the average, ν is actually a bit larger than σ. The previous value of the filtered measurement is used instead of the most recently updated value to prevent autocorrelation from biasing the variance estimate, v2f,i, keeping the equation for the ratio relatively simple. Equation (14.2) does not provide the variance, even at SS, because using the filtered Xf, rather than the true average, adds a bit of variability to the difference ( X i - X f , i -1 ) . The second method to obtain a measure of variance is an exponentially weighted moving “variance” (another filter) based on sequential data differences:
d 2 f , i = l3 ( Xi - Xi -1 ) + ( 1 - l3 ) d 2 f , i -1 (14.3) 2
δ2f, i = Filtered value of a measure of variance δ2f, i − 1 = Previous filtered value This will be the denominator measure of the variance. Also, it is not the variance, but is effectively twice the variance when the process is at SS. So, it uses the symbol δ instead of σ. The ratio of variances, the R-statistic, may now be computed by the following simple equation:
R=
( 2 - l1 ) v2 f ,i (14.4) d 2 f ,i
Since Equations (14.2) and (14.3) compute a measure of the variance, not the true variance, the ( 2 - l1 ) coefficient in Equation (14.4) is required to scale the ratio, to represent the classic variance ratio. At SS it will fluctuate about a value of unity, during a TS, it will fluctuate about larger values. The calculated R-value is to be compared to its two critical values to determine SS or TS. Complete executable code, including initializations, is presented in VBA in the software on the site www.r3eda.com. The essential assignment statements for Equations (14.1) to (14.4) are: nu2f = l2 * (x - xf) ^ 2 + cl2 * nu2f xf = l1 * x + cl1 * xf delta2f = l3 * (x - x_old) ^ 2 + cl3 * delta2f x_old = x R_Filter = (2 - l1) * nu2f/delta2f
Equation (14.2) Equation (14.1) Equation (14.3) Update prior value Equation (14.4)
Identification in Noisy Processes
283
The coefficients l1, l2, and l3 represent the filter lambda values, and the coefficients cl1, cl2, and cl3 represent the complementary values. cl1 = 1 - l1. Equations (14.2) and (14.1) are calculated in reverse order so that the prior xf value does not need to be stored. The five computational lines of code of this method require direct, no-logic, low storage, and low computational operation calculations. In total there are four variables and seven coefficients to be stored, ten multiplication or divisions, five additions, and two logical comparisons per observed variable. Without prior knowledge of the value for σ, initialize filtered values with 0. This is more convenient than initializing them with a more representative value from recent past data. This leads to an initial not-at-SS value of the initial R-statistic, but after about 35 samples the initial wrong values are incrementally updated with representative values. Being a ratio of variances, the statistic is scaled by the inherent noise level in the data. It is also independent of the dimensions chosen for the variable. Critical values for the R-statistic, based on the process being at SS with independent and identically distributed variation (white noise), were also developed by Cao and Rhinehart (Cao, S., and R. R. Rhinehart, “Critical Values for a Steady-State Identifier,” Journal of Process Control, Vol. 7, No. 2, 1997, pp. 149–152), who suggest that filter values of λ1 = 0.2 and λ2 = λ3 = 0.1 produce the best balance of Type-I and Type-II errors. The null hypothesis is that the process is at SS. If the computed R-statistic is greater than R-critical (a value of about 2.5) then we are confident that the process is not at SS. However, if the R-value is a bit less than the upper critical value, the process may be in a mild TS. Consequently, a value of R less than or equal to a lower R-critical value, ~0.9, means the process may be at SS (Shrowti, N., K. Vilankar, and R. R. Rhinehart, “Type-II Critical Values for a Steady-State Identifier”, Journal of Process Control, Vol. 20, No. 7, pp. 885–890, 2010). Often we assign values of either “0” or “1” to a variable, SS, which represents the state of the process. If R-calculated > R-critical1 ~2.5, “reject” SS, assign SS = 0. Alternately, if R-calculated < R-critical2 ~.9, “accept” that the process may be at SS and assign SS = 1. If in-between values happen for R-calculated, hold the prior 0 or 1 (reject or accept) state, because there is no confidence in changing the most recent declaration. The method presumes no autocorrelation in the time series of the process measurement data at SS. This is ensured by selection of the sampling time interval, which may be longer than the control interval. Running in real time, the identifier does not have to sample at the same rate as the controller. 14.2.2 Choice of Filter Factor Values The filter factors in Equations (14.1), (14.2), and (14.3) can be related to the number of data (the length of the time window) effectively influencing the average or variance calculation. Simplistically, the effective number of data in the window N = 1/λ. If λ = 0.1 then effectively the method is observing N = 10 most recent data points. However, based on a first-order decay, long past data retain some influence, and roughly, the number of data effectively influencing the window of observation is about 3.5/λ. If λ = 0.1 then effectively the method is remembering the impact of N ≈ 35 data points. However, not all data points have equal weighting. The filter is a first-order decay, an exponentially declining weighting N past data. Truly, the long-past data retain some influence, but the collective fractional influence is 1 −(1 − λ)N. With λ = 0.1 and N = 35 the old data only have a 0.025 fractional influence. Larger λ values mean that fewer data are involved in the analysis, which has a benefit of reducing the time for the identifier to catch up to a process change, reducing the average
284
Applied Engineering Statistics
run length (ARL) to a decision. But, larger λ values have an undesired impact of increasing the variability on the statistic, confounding interpretation. The reverse is true: Lower λ values undesirably increase the ARL to detection but increase precision (minimizing statistical errors). Originally, Cao and Rhinehart recommended values were λ1 = 0.2, λ2 = λ3 = 0.1, effectively meaning that the most recent 35 data points are used to calculate the R-statistic value. Since then, we usually recommend λ1 = λ2 = λ3 = 0.1 for convenience and effectiveness. However, λ1 = λ2 = λ3 = 0.05 will have less uncertainty, but a bit longer ARL. These are not a critical choice. 14.2.3 Critical Values If the supposition is that a process is at SS, then a Type-I (T-I) error is a claim of not-at-SS when the process is actually at SS. The concept is best understood by considering the distribution of the R-statistic of a SS process. The left curve in Figure 14.3 represents the statistical distribution of the R-statistic values at SS. The average value is R = 1. Note that the distribution is not the symmetric, bell-shaped, normal distribution. It is an F-type of distribution, skewed, and with no values below zero. The right most curve represents the R-distribution when a process in in the TS, here the average ratio is 3. In either case, the R-statistic will have some variability because of the random fluctuations in the sequential measured data. If the value of R is larger than upper 99% confidence value of about 2.5, illustrated by the vertical dashed line, there is about a 1% chance that the process could be at SS, but about a 70% chance that value would have come from a process in a TS. Bet on it being TS. If the R-value is a bit less than 2.5 there is still a substantial probability (about 30%) that the process in the TS could have generated it. So, don’t use a single critical value to both reject and accept SS or TS. There is also a lower critical value, about 0.9, indicated on the figure. If the process is in a TS, Figure 14.3 shows that the TS process has less than about a 1% chance that it will generate an R-value less than the lower critical value. However, if the process is at SS, then as illustrated in Figure 14.3 there is a about a 40% likelihood that it will generate such an R-value or lower. So, if R < Rlower critical the odds are that process is at SS. Claim SS. This alternate Type-I error is accepting SS if the process is in a TS.
FIGURE 14.3 R-statistic distributions – at SS (left curve), at TS (right curve).
Identification in Noisy Processes
285
However, if the R-value is in between then there is a high likelihood of the process being either at SS or TS. There is no adequate justification to make either claim. So, retain the last claim. Both T-I and alternate T-I errors are important. Critical values can be obtained from Cao and Rhinehart (1997), and from Shrowti, et al. (2010). However, it is more convenient and less dependent on idealizations to visually select data from periods that represent a transient or steady period, and to find the R-critical values that make the algorithm agree with the user interpretation. Experience recommends Rupper~3 to 4 and Rlower~0.85 to 1.0, chosen by visual inspection as definitive demarcations for a transient and steady process. 14.2.4 Illustration Figure 14.4 illustrates the method. The process variable, PV, is connected to the left-hand vertical axis (log10-scale) and is graphed with respect to sample interval. Initially it is at a SS with a value of about 5. At a sample number 200, the PV begins a first-order rise to a value of about 36. At sample number 700, the PV makes a step rise to a value of about 40. The R-statistic is attached to the same left-hand axis and shows an initial kick to a high value as variables are initialized, then relaxes to a value that wanders about the unity SS value. When the PV changes at sample 200, the R-statistic value jumps up to values ranging between 4 and 11, which relaxes back to the unity value as the trend hits a steady value at a time of 500. Then when the small PV step occurs at sample 700, the R-value jumps up to about 4, then decays back to its nominal unity range. The SS value is connected to the right-hand vertical axis and has values of either 0 to 1 that change when the R-value exceeds the two limits, of Rβ,TS and R1 − α,SS. 14.2.5 Discussion of Important Attributes 14.2.5.1 Distribution Separation A good statistic will provide a large separation of the SS and TS distributions relative to the range of the distribution. The distribution shift is the signal that results of the process
FIGURE 14.4 Illustration of the filter method to identify SS and TS.
286
Applied Engineering Statistics
shift from SS to TS, and the width of the distribution is the uncertainty, or noise associated with the statistic. A good method to detect SS and TS will have a large signal-to-noise aspect, a large shift in the not-at-SS distribution relative to the width of the distribution. Figure 14.4 reveals the issue. In several SS periods (sample 100–200, 600–700, and 800– 1000) the R-statistic has values in the 0.8 to 2 range of values. In the TS instances (sample 200–500, and 700) the R-statistic has values in the 5–10 range, which are definitely different from the values in the SS periods. There is good separation in the two ranges of values. Lower filter lambda values, for instance λ = 0.05, make the variation in the R-statistic smaller, meaning that the separation between SS and TS R-values is more definitive. However, larger lambda values, for instance λ = 0.2, make time-to-detection, ARL faster. Again, refer to Figure 14.4. Note that after the step change in the PV at sample 700, the PV holds at its new value, but the method did not return to a SS claim until about sample 740. 14.2.5.2 Average Run Length Another aspect to be considered as a quality performance metric of a statistic is ARL, the number of samples after an event occurs to be confident in declaring a change. In the moving window methods with N data being considered, the last not-at-SS data event must be out of the window for the analysis of the window to be able to declare “at SS”. This would appear to be N data, or the average run length of ARL = N. However, when the process is at SS, the statistic will not always be at the extreme value. There is a probability, β, that it is beyond the left value. When at SS, if there is no autocorrelation in the R-statistic, the number of data required to randomly have a value that is less than the β-probable extreme is 1/β. Then the ARL would be the number of samples to clear the window plus the expected number to likely create an extreme value. So, simplistically, ARL = N + 1/β. The filter factors in Equations (14.1–14.3) can be related to the number of data (the length of the time window) in the average or variance calculation. Roughly, the number of data with a residual influence on the statistic is about 3/λ to 5/λ, depending on your choice of what constitutes an inconsequential residual influence of past data. To determine an ARL, first expand the filter mechanism to reveal the exponentially weighted moving average form: X fi = l Xi + ( 1 - l ) X fi-1 = l Xi + ( 1 - l ) éël X fi-1 + ( 1 - l ) X fi-2 ùû
= l Xi + ( 1 - l ) {l X fi-1 + ( 1 - l ) [l X fi-2 + ( 1 - l ) X fi-3 ]} (14.5) = l Xi + ( 1 - l ) l X fi-1 + ( 1 - l ) l Xi - 2 + 2
+ ( 1 - l ) l Xi - N + ( 1 - l ) N
N +1
l X i - ( N + 1)
Now it is possible to determine the value of N that makes the persisting influence of the old X i -( N +1) trivial, for the event to clear from the statistic. If at SS for N samplings, then X i @ X i -1 @ X i - 2 and X fi ~ X i . Consider the value of N that makes
(1 - l )
N +1
X fold XSS (14.6)
287
Identification in Noisy Processes
As an estimate assume “