700 28 20MB
English Pages 981 Year 2021
Springer Texts in Statistics
Jay L. Devore Kenneth N. Berk Matthew A. Carlton
Modern Mathematical Statistics with Applications Third Edition
Springer Texts in Statistics Series Editors G. Allen, Department of Statistics, Houston, TX, USA R. De Veaux, Department of Mathematics and Statistics, Williams College, Williamstown, MA, USA R. Nugent, Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, USA
Springer Texts in Statistics (STS) includes advanced textbooks from 3rd- to 4th-year undergraduate courses to 1st- to 2nd-year graduate courses. Exercise sets should be included. The series editors are currently Genevera I. Allen, Richard D. De Veaux, and Rebecca Nugent. Stephen Fienberg, George Casella, and Ingram Olkin were editors of the series for many years.
More information about this series at http://www.springer.com/series/417
Jay L. Devore Kenneth N. Berk Matthew A. Carlton •
•
Modern Mathematical Statistics with Applications Third Edition
123
Jay L. Devore Department of Statistics (Emeritus) California Polytechnic State University San Luis Obispo, CA, USA
Kenneth N. Berk Department of Mathematics (Emeritus) Illinois State University Normal, IL, USA
Matthew A. Carlton Department of Statistics California Polytechnic State University San Luis Obispo, CA, USA
ISSN 1431-875X ISSN 2197-4136 (electronic) Springer Texts in Statistics ISBN 978-3-030-55155-1 ISBN 978-3-030-55156-8 (eBook) https://doi.org/10.1007/978-3-030-55156-8 1st edition: © Thomson Brooks Cole, 2007 2nd edition: © Springer Science+Business Media, LLC 2012, corrected publication 2018 3rd edition © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Purpose Our objective is to provide a postcalculus introduction to the discipline of statistics that • • • • •
Has mathematical integrity and contains some underlying theory. Shows students a broad range of applications involving real data. Is up to date in its selection of topics. Illustrates the importance of statistical software. Is accessible to a wide audience, including mathematics and statistics majors (yes, there are quite a few of the latter these days, thanks to the proliferation of “big data”), prospective engineers and scientists, and those business and social science majors interested in the quantitative aspects of their disciplines.
A number of currently available mathematical statistics texts are heavily oriented toward a rigorous mathematical development of probability and statistics, with much emphasis on theorems, proofs, and derivations. The focus is more on mathematics than on statistical practice. Even when applied material is included, the scenarios are often contrived (many examples and exercises involving dice, coins, cards, widgets, or a comparison of treatment A to treatment B). Our exposition is an attempt to provide a reasonable balance between mathematical rigor and statistical practice. We believe that showing students the applicability of statistics to real-world problems is extremely effective in inspiring them to pursue further coursework and even career opportunities in statistics. Opportunities for exposure to mathematical foundations will follow in due course. In our view, it is more important for students coming out of this course to be able to carry out and interpret the results of a two-sample t test or simple regression analysis, and appreciate how these are based on underlying theory, than to manipulate joint moment generating functions or discourse on various modes of convergence.
v
vi
Content and Mathematical Level The book certainly does include core material in probability (Chap. 2), random variables and their distributions (Chaps. 3–5), and sampling theory (Chap. 6). But our desire to balance theory with application/data analysis is reflected in the way the book starts out, with a chapter on descriptive and exploratory statistical techniques rather than an immediate foray into the axioms of probability and their consequences. After the distributional infrastructure is in place, the remaining statistical chapters cover the basics of inference. In addition to introducing core ideas from estimation and hypothesis testing (Chaps. 7–10), there is emphasis on checking assumptions and examining the data prior to formal analysis. Modern topics such as bootstrapping, permutation tests, residual analysis, and logistic regression are included. Our treatment of regression, analysis of variance, and categorical data analysis (Chaps. 11–13) is definitely more oriented to dealing with real data than with theoretical properties of models. We also show many examples of output from commonly used statistical software packages, something noticeably absent in most other books pitched at this audience and level. The challenge for students at this level should lie with mastery of statistical concepts as well as with mathematical wizardry. Consequently, the mathematical prerequisites and demands are reasonably modest. Mathematical sophistication and quantitative reasoning ability are certainly important, especially as they facilitate dealing with symbolic notation and manipulation. Students with a solid grounding in univariate calculus and some exposure to multivariate calculus should feel comfortable with what we are asking of them. The few sections where matrix algebra appears (transformations in Chap. 5 and the matrix approach to regression in the last section of Chap. 12) can easily be deemphasized or skipped entirely. Proofs and derivations are included where appropriate, but we think it likely that obtaining a conceptual understanding of the statistical enterprise will be the major challenge for readers.
Recommended Coverage There should be more than enough material in our book for a year-long course. Those wanting to emphasize some of the more theoretical aspects of the subject (e.g., moment generating functions, conditional expectation, transformations, order statistics, sufficiency) should plan to spend correspondingly less time on inferential methodology in the latter part of the book. We have opted not to mark certain sections as optional, preferring instead to rely on the experience and tastes of individual instructors in deciding what should be presented. We would also like to think that students could be asked to read an occasional subsection or even section on their own and then work exercises to demonstrate understanding, so that not everything would need to be presented in class. Remember that there is never enough time in a course of any duration to teach students all that we’d like them to know!
Preface
Preface
vii
Revisions for This Edition • Many of the examples have been updated and/or replaced, especially those containing real data or references to applications published in various journals. The same is true of the roughly 1300 exercises in the book. • The exposition has been refined and polished throughout to improve accessibility and eliminate unnecessary material and verbiage. For example, the categorical data chapter (Chap. 13) has been streamlined by discarding some of the methodology involving tests when parameters must be estimated. • A section on simulation has been added to each of the chapters on probability, discrete distributions, and continuous distributions. • The material in the chapter on joint distributions (Chap. 5) has been reorganized. There is now a separate section on linear combinations and their properties, and also one on the bivariate normal distribution. • The material in the chapter on statistics and their sampling distributions (Chap. 6) has also been reorganized. In particular, there is now a separate section on the chi-squared, t, and F distributions prior to the one containing derivations of sampling distributions of statistics based on a normal random sample. • The chapters on one-sample confidence intervals (Chap. 8) and hypothesis tests (Chap. 9) place more emphasis on t procedures and less on large-sample z procedures. This is also true of inferences based on two samples in Chap. 10. • Chap. 9 now contains a subsection on using the bootstrap to test hypotheses. • The material on multiple regression models containing quadratic, interaction, and indicator variables has been separated into its own section. And there is now a separate expanded section on logistic regression. • The nonparametric and Bayesian material that previously comprised a single chapter has been separated into two chapters, and material has been added to each. For example, there is now a section on nonparametric inferences about population quantiles.
Acknowledgements We gratefully acknowledge the plentiful feedback provided by reviewers and colleagues. A special salute goes to Bruce Trumbo for going way beyond his mandate in providing us an incredibly thoughtful review of 40+ pages containing many wonderful ideas and pertinent criticisms. Our emphasis on real data would not have come to fruition without help from the many individuals who provided us with data in published sources or in personal communications. We appreciate the production services provided by the folks at Springer.
viii
Preface
A Final Thought It is our hope that students completing a course taught from this book will feel as passionate about the subject of statistics as we still do after so many years in the profession. Only teachers can really appreciate how gratifying it is to hear from a student after he or she has completed a course that the experience had a positive impact and maybe even affected a career choice. Los Osos, CA, USA Normal, IL, USA San Luis Obispo, CA, USA
Jay L. Devore Kenneth N. Berk Matthew A. Carlton
Contents
. . . . . .
1 1 9 25 32 43
............. .............
49 49
1
Overview and Descriptive Statistics . . . . . . . . . . . . 1.1 The Language of Statistics . . . . . . . . . . . . . . 1.2 Graphical Methods in Descriptive Statistics . 1.3 Measures of Center . . . . . . . . . . . . . . . . . . . . 1.4 Measures of Variability . . . . . . . . . . . . . . . . . Supplementary Exercises . . . . . . . . . . . . . . . . . . . . . .
2
Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Sample Spaces and Events . . . . . . . . . . 2.2 Axioms, Interpretations, and Properties of Probability . . . . . . . . . . . . . . . . . . . . 2.3 Counting Methods . . . . . . . . . . . . . . . . . 2.4 Conditional Probability . . . . . . . . . . . . . 2.5 Independence . . . . . . . . . . . . . . . . . . . . 2.6 Simulation of Random Events . . . . . . . . Supplementary Exercises . . . . . . . . . . . . . . . . . .
3
4
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Discrete Random Variables and Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Probability Distributions for Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Expected Values of Discrete Random Variables . . . 3.4 Moments and Moment Generating Functions . . . . . 3.5 The Binomial Probability Distribution . . . . . . . . . . . 3.6 The Poisson Probability Distribution . . . . . . . . . . . . 3.7 Other Discrete Distributions . . . . . . . . . . . . . . . . . . 3.8 Simulation of Discrete Random Variables . . . . . . . . Supplementary Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
55 66 75 87 94 103
.... ....
111 111
. . . . . . . .
115 126 137 144 156 164 173 182
. . . . . .
. . . . . .
. . . . . . . .
. . . . . .
. . . . . . . .
. . . . . . . .
Continuous Random Variables and Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Probability Density Functions and Cumulative Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Expected Values and Moment Generating Functions . . . 4.3 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 The Gamma Distribution and Its Relatives . . . . . . . . . . . 4.5 Other Continuous Distributions . . . . . . . . . . . . . . . . . . . .
189 189 203 213 230 239 ix
x
Contents
4.6 Probability Plots . . . . . . . . . . . . . . . . . . . . . . 4.7 Transformations of a Random Variable . . . . . 4.8 Simulation of Continuous Random Variables Supplementary Exercises . . . . . . . . . . . . . . . . . . . . . . 5
6
7
8
9
. . . .
. . . .
. . . .
. . . .
. . . .
247 258 263 269
Joint Probability Distributions and Their Applications . 5.1 Jointly Distributed Random Variables . . . . . . . . . . . 5.2 Expected Values, Covariance, and Correlation . . . . 5.3 Linear Combinations . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Conditional Distributions and Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 The Bivariate Normal Distribution . . . . . . . . . . . . . 5.6 Transformations of Multiple Random Variables . . . 5.7 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . Supplementary Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
277 277 294 303
. . . . .
. . . . .
. . . . .
. . . . .
317 330 336 342 350
.... ....
357 357
. . . . .
. . . . .
368 380 388 393 395
......... .........
397 397
. . . .
. . . .
. . . .
. . . .
. . . .
Statistics and Sampling Distributions . . . . . . . . . . . . . . . . 6.1 Statistics and Their Distributions . . . . . . . . . . . . . . . 6.2 The Distribution of Sample Totals, Means, and Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 The v2, t, and F Distributions . . . . . . . . . . . . . . . . . 6.4 Distributions Based on Normal Random Samples . . Supplementary Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix: Proof of the Central Limit Theorem . . . . . . . . . . Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Concepts and Criteria for Point Estimation . . 7.2 The Methods of Moments and Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Information and Efficiency . . . . . . . . . . . . . . Supplementary Exercises . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
416 428 436 445
Statistical Intervals Based on a Single Sample . . . . . . . . 8.1 Basic Properties of Confidence Intervals . . . . . . . . . 8.2 The One-Sample t Interval and Its Relatives . . . . . . 8.3 Intervals for a Population Proportion . . . . . . . . . . . 8.4 Confidence Intervals for the Population Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . 8.5 Bootstrap Confidence Intervals . . . . . . . . . . . . . . . . Supplementary Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
451 452 463 475
.... .... ....
481 484 494
Tests of Hypotheses Based on a Single Sample . . . . . . . . . . . . 9.1 Hypotheses and Test Procedures . . . . . . . . . . . . . . . . . . . 9.2 Tests About a Population Mean . . . . . . . . . . . . . . . . . . . 9.3 Tests About a Population Proportion. . . . . . . . . . . . . . . . 9.4 P-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 The Neyman–Pearson Lemma and Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
501 501 512 526 532
. . . .
. . . .
. . . .
542
Contents
xi
9.6 Further Aspects of Hypothesis Testing . . . . . . . . . . . . . . 553 Supplementary Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560 10 Inferences Based on Two Samples . . . . . . . . . . . . . . . . . . 10.1 The Two-Sample z Confidence Interval and Test . . 10.2 The Two-Sample t Confidence Interval and Test . . 10.3 Analysis of Paired Data . . . . . . . . . . . . . . . . . . . . . 10.4 Inferences About Two Population Proportions . . . . 10.5 Inferences About Two Population Variances . . . . . . 10.6 Inferences Using the Bootstrap and Permutation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Supplementary Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
565 565 575 591 602 611
.... ....
617 630
11 The Analysis of Variance . . . . . . . . . . . . . . . . . . . . 11.1 Single-Factor ANOVA . . . . . . . . . . . . . . . . . 11.2 Multiple Comparisons in ANOVA . . . . . . . . 11.3 More on Single-Factor ANOVA . . . . . . . . . . 11.4 Two-Factor ANOVA without Replication . . . 11.5 Two-Factor ANOVA with Replication . . . . . Supplementary Exercises . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
639 640 653 662 672 687 699
12 Regression and Correlation. . . . . . . . . . . . . . . . . . . . . . . . 12.1 The Simple Linear Regression Model . . . . . . . . . . . 12.2 Estimating Model Parameters . . . . . . . . . . . . . . . . . 12.3 Inferences About the Regression Coefficient b1 . . . 12.4 Inferences for the (Mean) Response . . . . . . . . . . . . 12.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6 Investigating Model Adequacy: Residual Analysis . 12.7 Multiple Regression Analysis . . . . . . . . . . . . . . . . . 12.8 Quadratic, Interaction, and Indicator Terms. . . . . . . 12.9 Regression with Matrices . . . . . . . . . . . . . . . . . . . . 12.10 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . Supplementary Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
703 704 713 727 737 745 757 767 783 795 806 817
13 Chi-Squared Tests . . . . . . . . . . . . . . . . . . 13.1 Goodness-of-Fit Tests . . . . . . . . . . 13.2 Two-Way Contingency Tables . . . Supplementary Exercises . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
823 823 840 851
14 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . 14.1 Exact Inference for Population Quantiles . . . 14.2 One-Sample Rank-Based Inference . . . . . . . . 14.3 Two-Sample Rank-Based Inference . . . . . . . . 14.4 Nonparametric ANOVA . . . . . . . . . . . . . . . . Supplementary Exercises . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
855 855 861 871 879 886
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
15 Introduction to Bayesian Estimation . . . . . . . . . . . . . . . . . . . . 889 15.1 Prior and Posterior Distributions . . . . . . . . . . . . . . . . . . . 889 15.2 Bayesian Point and Interval Estimation . . . . . . . . . . . . . . 896
xii
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903 Answers to Odd-Numbered Exercises . . . . . . . . . . . . . . . . . . . . . . . 926 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
Contents
1
Overview and Descriptive Statistics
Introduction Statistical concepts and methods are not only useful but indeed often indispensable in understanding the world around us. They provide ways of gaining new insights into the behavior of many phenomena that you will encounter in your chosen field of specialization. The discipline of statistics teaches us how to make intelligent judgments and informed decisions in the presence of uncertainty and variation. Without uncertainty or variation, there would be little need for statistical methods or statisticians. If the yield of a crop was the same in every field, if all individuals reacted the same way to a drug, if everyone gave the same response to an opinion survey, and so on, then a single observation would reveal all desired information. Section 1.1 establishes some key statistics vocabulary and gives a broad overview of how statistical studies are conducted. The rest of this chapter is dedicated to graphical and numerical methods for summarizing data.
1.1
The Language of Statistics
We are constantly exposed to collections of facts, or data, both in our professional capacities and in everyday activities. The discipline of statistics provides methods for organizing and summarizing data and for drawing conclusions based on information contained in the data. An investigation will typically focus on a well-defined collection of objects constituting a population of interest. In one study, the population might consist of all multivitamin capsules produced by a certain manufacturer in a particular week. Another investigation might involve the population of all individuals who received a B.S. in statistics or mathematics during the most recent academic year. When desired information is available for all objects in the population, we have what is called a census. Constraints on time, money, and other scarce resources usually make a census impractical or infeasible. Instead, a subset of the population—a sample—is selected in some prescribed manner. Thus we might obtain a sample of pills from a particular production run as a basis for investigating whether pills are conforming to manufacturing specifications, or we might select a sample of last year’s graduates to obtain feedback about the quality of the curriculum. We are usually interested only in certain characteristics of the objects in a population: the amount of vitamin C in the pill, the sex of a student, the age of a vehicle, and so on. A characteristic may be categorical, such as sex or college major, or it may be quantitative in nature. In the former case, the value of the characteristic is a category (e.g., female or economics), whereas in the latter case, the © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. L. Devore et al., Modern Mathematical Statistics with Applications, Springer Texts in Statistics, https://doi.org/10.1007/978-3-030-55156-8_1
1
2
1
Overview and Descriptive Statistics
value is a number (e.g., age = 5.1 years or vitamin C content = 65 mg). A variable is any characteristic whose value may change from one object to another in the population. We shall initially denote variables by lowercase letters from the end of our alphabet. Examples include x = brand of computer owned by a student y = number of items purchased by a customer at a grocery store z = braking distance of an automobile under specified conditions Data comes from making observations either on a single variable or simultaneously on two or more variables. A univariate data set consists of observations on a single variable. For example, we might consider the type of computer, laptop (L) or desktop (D), for ten recent purchases, resulting in the categorical data set D
L
L
L
D
L
L
D
L
L
The following sample of lifetimes (hours) of cell phone batteries under continuous use is a quantitative univariate data set: 10:6 10:1
11:2 9:0
10:8 9:5
8:8 11:5
We have bivariate data when observations are made on each of two variables. Our data set might consist of a (height, weight) pair for each basketball player on a team, with the first observation as (72, 168), the second as (75, 212), and so on. If a kinesiologist determines the values of x = recuperation time from an injury and y = type of injury, the resulting data set is bivariate with one variable quantitative and the other categorical. Multivariate data arises when observations are made on more than two variables. For example, a research physician might determine the systolic blood pressure, diastolic blood pressure, and serum cholesterol level for each patient participating in a study. Each observation would be a triple of numbers, such as (120, 80, 146). In many multivariate data sets, some variables are quantitative and others are categorical. Thus the annual automobile issue of Consumer Reports gives values of such variables as type of vehicle (small, sporty, compact, midsize, large), city fuel efficiency (mpg), highway fuel efficiency (mpg), drive train type (rear wheel, front wheel, four wheel), and so on. Branches of Statistics An investigator who has collected data may wish simply to summarize and describe important features of the data. This entails using methods from descriptive statistics. Some of these methods are graphical in nature; the constructions of histograms, boxplots, and scatterplots are primary examples. Other descriptive methods involve calculation of numerical summary measures, such as means, standard deviations, and correlation coefficients. The wide availability of statistical computer software packages has made these tasks much easier to carry out than they used to be. Computers are much more efficient than human beings at calculation and the creation of pictures (once they have received appropriate instructions from the user!). This means that the investigator doesn’t have to expend much effort on “grunt work” and will have more time to study the data and extract important messages. Throughout this book, we will present output from various packages such as R, SAS, and Minitab.
1.1 The Language of Statistics
3
Example 1.1 Charity is a big business in the USA. The website charitynavigator.com gives information on roughly 5500 charitable organizations, and there are many smaller charities that fly below the navigator’s radar. Some charities operate very efficiently, with fundraising and administrative expenses that are only a small percentage of total expenses, whereas others spend a high percentage of what they take in on such activities. Here is data on fundraising expenses, as a percentage of total expenditures, for a random sample of 60 charities: 6.1 2.2 7.5 6.4 8.8 15.3
12.6 3.1 3.9 10.8 5.1 16.6
34.7 1.3 10.1 83.1 3.7 8.8
1.6 1.1 8.1 3.6 26.3 12.0
18.8 14.1 19.5 6.2 6.0 4.7
2.2 4.0 5.2 6.3 48.0 14.7
3.0 21.0 12.0 16.3 8.2 6.4
2.2 6.1 15.8 12.7 11.7 17.0
5.6 1.3 10.4 1.3 7.2 2.5
3.8 20.4 5.2 0.8 3.9 16.2
Without any organization, it is difficult to get a sense of the data’s most prominent features: what a typical (i.e., representative) value might be, whether values are highly concentrated about a typical value or quite dispersed, whether there are any gaps in the data, what fraction of the values are less than 20%, and so on. Figure 1.1 shows a histogram. In Section 1.2 we will discuss construction and interpretation of this graph. For the moment, we hope you see how it describes the way the percentages are distributed over the range of possible values from 0 to 100. Of the 60 charities, 36 use less than 10% on fundraising, and 18 use between 10% and 20%. Thus 54 out of the 60 charities in the sample, or 90%, spend less than 20% of money collected on fundraising. How much is too much? There is a delicate balance: most charities must spend money to raise money, but then money spent on fundraising is not available to help beneficiaries of the charity. Perhaps each individual giver should draw his or her own line in the sand.
40
Number of charities
30
20
10
0
0
10
20
30
40
50
60
70
80
90
Fundraising expenses (percent of total expenditures)
Figure 1.1 A histogram for the charity fundraising data of Example 1.1
■
4
1
Overview and Descriptive Statistics
Having obtained a sample from a population, an investigator would frequently like to use sample information to draw some type of conclusion (make an inference of some sort) about the population. That is, the sample is typically a means to an end rather than an end in itself. Techniques for generalizing from a sample to a population in a precise and objective way are gathered within the branch of our discipline called inferential statistics. Example 1.2 The authors of the article “Fire Safety of Glued-Laminated Timber Beams in Bending” (J. of Structural Engr. 2017) conducted an experiment to test the fire resistance properties of wood pieces connected at corners by sawtooth-shaped “fingers” along with various types of commercial adhesive. The beams were all exposed to the same fire and load conditions. The accompanying data on fire resistance time (min) for a sample of timber beams bonded with polyurethane adhesive appeared in the article: 47:0 41:0
53:0 52:5 34:0 36:5
52:0 47:5 49:0 47:5
56:5 45:0 34:0 34:0
43:5 48:0 36:0 42:0
48:0
Suppose we want an estimate of the true average fire resistance time under these conditions. (Conceptualizing a population of all such beams with polyurethane bonding under these experimental conditions, we are trying to estimate the population mean.) It can be shown that, with a high degree of confidence, the population mean fire resistance time is between 41.2 and 48.0 min; this is called a confidence interval or an interval estimate. On the other hand, this data can also be used to predict the fire resistance time of a single timber beam under these conditions. With a high degree of certainty, the fire resistance time of a single such beam will exceed 29.4 min; the number 29.4 is called a lower prediction bound. ■ Probability Versus Statistics The main focus of this book is on presenting and illustrating methods of inferential statistics that are useful in research. The most important types of inferential procedures—point estimation, hypothesis testing, and estimation by confidence intervals—are introduced in Chapters 7–9 and then used in more complicated settings in Chapters 10–15. The remainder of this chapter presents methods from descriptive statistics that are most used in the development of inference. Chapters 2–6 present material from the discipline of probability. This material ultimately forms a bridge between the descriptive and inferential techniques. Mastery of probability leads to a better understanding of how inferential procedures are developed and used, how statistical conclusions can be translated into everyday language and interpreted, and when and where pitfalls can occur in applying the methods. Probability and statistics both deal with questions involving populations and samples, but do so in an “inverse manner” to each other. In probability, properties of the population under study are assumed known (e.g., in a numerical population, some specified distribution of the population values may be assumed), and questions regarding a sample taken from the population are posed and answered. In statistics, characteristics of a sample are available to the experimenter, and this information enables the experimenter to draw conclusions about the population. The relationship between the two disciplines can be summarized by saying that probability reasons from the population to the sample (deductive reasoning), whereas inferential statistics reasons from the sample to the population (inductive reasoning). This is illustrated in Figure 1.2.
1.1 The Language of Statistics
5 Probability
Population
Inferential
Sample
statistics
Figure 1.2 The relationship between probability and inferential statistics
Before we can understand what a particular sample can tell us about the population, we should first understand the uncertainty associated with taking a sample from a given population. This is why we study probability before statistics. As an example of the contrasting focus of probability and inferential statistics, consider drivers’ use of seatbelts in automobiles. According to the article “Somehow, Way Too Many Americans Still Aren’t Wearing Seatbelts” (www.wired.com, Sept. 2016), data collected by observers from the National Highway Traffic Safety Administration indicates that 88.5% of drivers and front seat passengers buckle up. But this percentage varies considerably by location. In the 34 states in which a driver can be pulled over and cited for nonusage, 91.2% wore their seatbelts in 2015. By contrast, in the 15 states where a citation can be given only if a driver is pulled over for another infraction and the one state where usage is not mandatory (New Hampshire), usage drops to 78.6%. In a probability context, we might assume that 85% of all drivers in a particular metropolitan area regularly use seatbelts (an assumption about the population) and then ask, “How likely is it that a sample of 100 drivers will include at most 70 who regularly use their seatbelt?” or “How many drivers in a sample of size 100 can we expect to regularly use their seatbelt?” On the other hand, in inferential statistics, sample information is available, e.g., a sample of 100 drivers from this area reveals that 80 regularly use their seatbelts. We might then ask, “Does this provide strong evidence for concluding that less than 90% of all drivers in this area are regular seatbelt users?” In this latter scenario, sample information will be employed to answer a question about the structure of the entire population from which the sample was selected. Next, consider a study involving a sample of 25 patients to investigate the efficacy of a new minimally invasive method for rotator cuff surgery. The amount of time that each individual subsequently spends in physical therapy is then determined. The resulting sample of 25 PT times is from a population that does not actually exist. Instead it is convenient to think of the population as consisting of all possible times that might be observed under similar experimental conditions. Such a population is referred to as a conceptual or hypothetical population. There are a number of situations in which we fit questions into the framework of inferential statistics by conceptualizing a population. Collecting Data Statistics deals not only with the organization and analysis of data once it has been collected but also with the development of techniques for collecting that data. If data is not properly collected, an investigator might not be able to answer the questions under consideration with a reasonable degree of confidence. One common problem is that the target population—the one about which conclusions are to be drawn—may be different from the population actually sampled. In that case, an investigator must be very cautious about generalizing from the circumstances under which data has been gathered. For example, advertisers would like various kinds of information about the television-viewing habits of potential customers. The most systematic information of this sort comes from placing monitoring devices in a small number of homes across the USA. It has been conjectured that placement of such devices in and of itself alters viewing behavior, so that characteristics of the sample may be different from those of the target population. As another example, a sample of five engines
6
1
Overview and Descriptive Statistics
with a new design may be experimentally manufactured and tested to investigate efficiency. These five could be viewed as a sample from the conceptual population of all prototypes that could be manufactured under similar conditions, but not necessarily as representative of all units manufactured once regular production gets under way. Methods for using sample information to draw conclusions about future production units may be problematic. Similarly, a new drug may be tried on patients who arrive at a clinic (i.e., a voluntary sample), but there may be some question about how typical these patients are. They may not be representative of patients elsewhere or patients at the same clinic next year. When data collection entails selecting individuals or objects from a list, the simplest method for ensuring a representative selection is to take a simple random sample. This is one for which any particular subset of the specified size (e.g., a sample of size 100) has the same chance of being selected. For example, if the list consists of 1,000,000 serial numbers, the numbers 1, 2, …, up to 1,000,000 could be placed on identical slips of paper. After placing these slips in a box and thoroughly mixing, slips could be drawn one by one until the requisite sample size has been obtained. Alternatively (and much to be preferred), a computer’s random number generator could be employed to generate 100 distinct numbers between 1 and 1,000,000. Sometimes alternative sampling methods can be used to make the selection process easier, to obtain extra information, or to increase the degree of precision in conclusions. One such method, stratified sampling, entails separating the population units into nonoverlapping groups and taking a sample from each one. For example, a cell phone manufacturer might want information about customer satisfaction for units produced during the previous year. If three different models were manufactured and sold, a separate sample could be selected from each of the three corresponding strata. This would result in information on all three models and ensure that no one model was over-or underrepresented in the entire sample. Frequently a “convenience” sample is obtained by selecting individuals or objects without systematic randomization. As an example, a collection of bricks may be stacked in such a way that it is extremely difficult for those in the center to be selected. If the bricks on the top and sides of the stack were somehow different from the others, the resulting sample data would not be representative of the population. Often an investigator will assume that such a convenience sample approximates a random sample, in which case a statistician’s repertoire of inferential methods can be used; however, this is a judgment call. Researchers may also collect data by carrying out some sort of designed experiment. This may involve deciding how to allocate several different treatments (such as fertilizers or drugs) to various experimental units (plots of land or patients). Alternatively, an investigator may systematically vary the levels or categories of certain factors (e.g., amount of fertilizer or dose of a drug) and observe the effect on some response (such as corn yield or blood pressure). Example 1.3 Neonicotinoid insecticides (NNIs) are popular in agricultural use, especially for growing corn, but scientists are increasingly concerned about their effects on bee populations. An article in Science (June 30, 2017) described the results of a two-year study in which scientists randomly assigned some bee colonies to be exposed to “field-realistic” levels and durations of NNIs, while other colonies did not have NNI exposure. The researchers found that bees in the colonies exposed to NNIs had a 23% reduced life span, on average, compared to those in nonexposed colonies. One possible explanation for this result is chance variation—i.e., that NNIs really don’t affect bee colony health and the observed difference is just “random noise,” in the same way that tossing two identical coins 10 times each will usually produce different numbers of heads. However, in this case, inferential methods discussed in this textbook (and in the original article) suggest that chance
1.1 The Language of Statistics
7
variation by itself cannot adequately explain the magnitude of the observed difference, indicating that NNIs may very well be responsible for the reduced average life span. ■ Exercises: Section 1.1 (1–13) 1. Give one possible sample of size 4 from each of the following populations: a. All daily newspapers published in the USA b. All companies listed on the New York Stock Exchange c. All students at your college or university d. All grade point averages of students at your college or university 2. For each of the following hypothetical populations, give a plausible sample of size 4: a. All distances that might result when you throw a football b. Page lengths of books published 5 years from now c. All possible earthquake strength measurements (Richter scale) that might be recorded in California during the next year d. All possible yields (in grams) from a certain chemical reaction carried out in a laboratory 3. Consider the population consisting of all cell phones of a certain brand and model, and focus on whether a cell phone needs service while under warranty. a. Pose several probability questions based on selecting a sample of 100 such cell phones. b. What inferential statistics question might be answered by determining the number of such cell phones in a sample of size 100 that need warranty service? 4. Give three different examples of concrete populations and three different examples of hypothetical populations. For one each of your concrete and hypothetical populations, give an example of a probability question and an example of an inferential statistics question.
5. The authors of the article “From Dark to Light: Skin Color and Wages among African Americans” (J. of Human Resources 2007: 701–738) investigated the association between darkness of skin and hourly wages. For a sample of 948 African Americans, skin color was classified as dark black, medium black, light black, or white. a. What variables were recorded for each member of the sample? b. Classify each of these variables as quantitative or categorical. 6. Consumer Reports compared the actual polyunsaturated fat percentages for different brands of “low-fat” margarine. Twenty-six containers of margarine were purchased; for each one, the brand was noted and the percent of polyunsaturated fat was determined. a. What variables were recorded for each margarine container in the sample? b. Classify each of these variables as quantitative or categorical. c. Give some examples of inferential statistics questions that Consumer Reports might try to answer with the data from these 26 margarine containers. d. “The average polyunsaturated fat content for the five Parkay margarine containers in the sample was 12.8%.” Is the preceding sentence an example of descriptive statistics or inferential statistics? 7. The article “Is There a Market for Functional Wines? Consumer Preferences and Willingness to Pay for Resveratrol-Enriched Red Wine” (Food Quality and Preference 2008: 360–371) included the following information for a variety of Spanish wines: a. Region of origin b. Price of the wine, in euros c. Style of wine (young or crianza)
8
1
d. Production method (conventional or organic) e. Type of grapes used (regular or resveratrol-enhanced) Classify each of these variables as quantitative or categorical. 8. The authors of the article cited in the previous exercise surveyed 300 wine consumers, each of whom tasted two different wines. For each individual in the study, the following information was recorded: a. b. c. d.
Gender Age, in years Monthly income, in euros Educational level (primary, secondary, or university) e. Willingness to pay (WTP) for the first wine tasted, in euros f. WTP for the second wine tasted, in euros. (WTP is a very common measure for consumer products. Researchers ask, “How much would you be willing to pay for this item?”) Classify each of the variables (a)– (f) as quantitative or categorical. 9. Many universities and colleges have instituted supplemental instruction (SI) programs, in which a student facilitator meets regularly with a small group of students enrolled in the course to promote discussion of course material and enhance subject mastery. Suppose that students in a large statistics course (what else?) are randomly divided into a control group that will not participate in SI and a treatment group that will participate. At the end of the term, each student’s total score in the course is determined. a. Are the scores from the SI group a sample from an existing population? If so, what is it? If not, what is the relevant conceptual population? b. What do you think is the advantage of randomly dividing the students into the two groups rather than letting each student choose which group to join? c. Why didn’t the investigators put all students in the treatment group?
Overview and Descriptive Statistics
10. The California State University (CSU) system consists of 23 campuses, from San Diego State in the south to Humboldt State near the Oregon border. A CSU administrator wishes to make an inference about the average distance between the hometowns of students and their campuses. Describe and discuss several different sampling methods that might be employed. 11. A certain city divides naturally into ten district neighborhoods. A real estate appraiser would like to develop an equation to predict appraised value from characteristics such as age, size, number of bathrooms, distance to the nearest school, and so on. How might she select a sample of single-family homes that could be used as a basis for this analysis? 12. The amount of flow through a solenoid valve in an automobile’s pollution control system is an important characteristic. An experiment was carried out to study how flow rate depended on three factors: armature length, spring load, and bobbin depth. Two different levels (low and high) of each factor were chosen, and a single observation on flow was made for each combination of levels. a. The resulting data set consisted of how many observations? b. Does this study involve sampling an existing population or a conceptual population? 13. In a famous experiment carried out in 1882, Michelson and Newcomb obtained 66 observations on the time it took for light to travel between two locations in Washington, D.C. A few of the measurements (coded in a certain manner) were 31, 23, 32, 36, 22, 26, 27, and 31. a. Why are these measurements not identical? b. Does this study involve sampling an existing population or a conceptual population?
1.2 Graphical Methods in Descriptive Statistics
1.2
9
Graphical Methods in Descriptive Statistics
There are two general types of methods within descriptive statistics: graphical and numerical summaries. In this section we will discuss the first of these types—representing a data set using visual techniques. In Sections 1.3 and 1.4, we will develop some numerical summary measures for data sets. Many visual techniques may already be familiar to you: frequency tables, histograms, pie charts, bar graphs, scatterplots, and the like. Here we focus on a selected few of these techniques that are most useful and relevant to probability and inferential statistics. Notation Some general notation will make subsequent discussions easier. The number of observations in a single sample, that is, the sample size, will often be denoted by n. So n = 4 for the sample of universities {Stanford, Iowa State, Wyoming, Rochester} and also for the sample of pH measurements {6.3, 6.2, 5.9, 6.5}. If two samples are simultaneously under consideration, either m and n or n1 and n2 can be used to denote the numbers of observations. Thus if {3.75, 2.60, 3.20, 3.79} and {2.75, 1.20, 2.45} are GPAs for two sets of friends, respectively, then m = 4 and n = 3. Given a data set consisting of n observations on some variable x, the individual observations will be denoted by x1, x2, x3, …, xn. The subscript bears no relation to the magnitude of a particular observation. Thus x1 will not in general be the smallest observation in the set, nor will xn typically be the largest. In many applications, x1 will be the first observation gathered by the experimenter, x2 the second, and so on. The ith observation in the data set will be denoted by xi. Stem-and-Leaf Displays Consider a numerical data set x1, x2, …, xn for which each xi consists of at least two digits. A quick way to obtain an informative visual representation of the data set is to construct a stem-and-leaf display, or stem plot. Steps for constructing a stem-and-leaf display 1. Select one or more leading digits for the stem values. The trailing digits become the leaves. 2. List possible stem values in a vertical column. 3. Record the leaf for every observation beside the corresponding stem value. 4. Order the leaves from smallest to largest on each line. 5. Indicate the units for stems and leaves someplace in the display. If the data set consists of exam scores, each between 0 and 100, the score of 83 would have a stem of 8 and a leaf of 3. For a data set of automobile fuel efficiencies (mpg), all between 8.1 and 47.8, we could use the tens digit as the stem, so 32.6 would then have a leaf of 2.6. Usually, a display based on between 5 and 20 stems is appropriate. For a simple example, assume a sample of seven test scores: 93, 84, 86, 78, 95, 81, 72. Then the first-pass stem plot would be 7|82 8|461 9|35
10
1
Overview and Descriptive Statistics
With the leaves ordered this becomes 7|28 8|146 9|35
Stem: tens digit Leaf: ones digit
Occasionally stems will be repeated to spread out the stem-and-leaf display. For instance, if the preceding test scores included dozens of values in the 70s, we could repeat the stem 7 twice, using 7L for scores in the low 70s (leaves 0, 1, 2, 3, 4) and 7H for scores in the high 70s (leaves 5, 6, 7, 8, 9). Example 1.4 Job prospects for students majoring in an engineering discipline continue to be very robust. How much can a new engineering graduate expect to earn? Here are the starting salaries for a sample of 38 civil engineers from one author’s home institution (Spring 2016), courtesy of the university’s Graduate Status Report: 58,000 65,000 55,000 55,000
62,000 60,000 63,000 52,000
56,160 61,000 60,000 70,000
67,000 80,000 70,000 80,000
66,560 62,500 68,640 60,320
58,240 75,000 72,000 65,000
60,000 60,000 83,000 70,000
61,000 68,000 50,128 65,000
70,000 57,600 56,000
61,000 65,000 63,000
Figure 1.3 shows a stem-and-leaf display of these 38 starting salaries. Hundreds places and lower have been truncated; for instance, the lowest salary in the sample was $50,128, which is represented by 5|0 in the first row.
5L 5H 6L 6H 7L 7H 8L
02 5566788 000001112233 55556788 00002 5 003
Stem: $10,000 Leaf: $1000
Figure 1.3 Stem-and-leaf display for starting salaries of civil engineering graduates
Typical starting salaries were in the $60,000–$65,000 range, with most graduates starting between $55,000 and $70,000. A lucky (and/or exceptionally talented!) handful of students earned $80,000 or more upon graduation. ■ Most graphical displays of quantitative data, including the stem-and-leaf display, convey information about the following aspects of the data: • • • • • •
Identification of a typical or representative value Extent of spread about the typical value Presence of any gaps in the data Extent of symmetry in the distribution of values Number and location of peaks Presence of any outlying values (i.e. unusally small or large)
1.2 Graphical Methods in Descriptive Statistics
11
Example 1.5 Figure 1.4 presents stem-and-leaf displays for a random sample of lengths of golf courses (yards) that have been designated by Golf Magazine as among the most challenging in the USA. Among the sample of 40 courses, the shortest is 6433 yards long, and the longest is 7280 yards. The lengths appear to be distributed in a roughly uniform fashion over the range of values in the sample. Notice that a stem choice here of either a single digit (6 or 7) or three digits (643, …, 728) would yield an uninformative display, the first because of too few stems and the latter because of too many.
a
b
64| 33 35 64 70 Stem: Thousands and hundreds digits 65| 06 26 27 83 Leaf: Tens and ones digits 66| 05 14 94 67| 00 13 45 70 70 90 98 68| 50 70 73 90 69| 00 04 27 36 70| 05 11 22 40 50 51 71| 05 13 31 65 68 69 72| 09 80
Stem-and-leaf of yardage N = 40 Leaf Unit = 10 64 3367 65 0228 66 019 67 0147799 68 5779 69 0023 70 012455 71 013666 72 08
Figure 1.4 Stem-and-leaf displays of golf course yardages: (a) two-digit leaves; (b) display from Minitab with truncated one-digit leaves
■
Dotplots A dotplot is an attractive summary of numerical data when the data set is reasonably small or there are relatively few distinct data values. Each observation is represented by a dot above the corresponding location on a horizontal measurement scale. When a value occurs more than once, there is a dot for each occurrence, and these dots are stacked vertically. As with a stem-and-leaf display, a dotplot gives information about location, spread, extremes, and gaps. Example 1.6 For decades, The Economist has used its “Big Mac index,” defined for any country as the average cost of a McDonald’s Big Mac, as a humorous way to compare product costs across nations and also examine the valuation of the US dollar worldwide. Here are values of the Big Mac index, converted to US dollars, for 56 countries reported by The Economist on January 1, 2019 (listed in alphabetical order by country name): 2.00 3.77 2.55 4.19 1.65 2.00
4.35 3.24 2.34 3.18 3.20 1.94
2.33 3.81 4.58 5.86 4.28 3.81
3.18 4.60 3.60 2.73 2.24 5.58
4.55 2.23 2.75 3.31 4.02 4.31
4.07 4.64 3.46 3.14 3.18 2.80
5.08 3.23 4.31 2.67 5.84
3.89 3.49 2.20 2.80 6.62
3.05 2.55 2.54 3.30 2.24
3.73 3.03 2.32 2.29 3.72
Figure 1.5 shows a dotplot of these values. We can see that the average cost of a Big Mac in the USA, $5.58, is higher than in all but three countries (Sweden, Norway, and Switzerland). A typical Big Mac index value is around $3.20, but those values vary substantially across the globe. The distribution extends farther to the right of that typical value than to the left, due to a handful of comparatively large Big Mac prices in the USA and a few other countries.
12
1
Overview and Descriptive Statistics
U.S. ($5.58)
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
Big Mac index (US$) Figure 1.5 A dotplot of the data from Example 1.6
How can the Big Mac index be used to assess the strength of the US dollar? As an example, a Big Mac cost £3.19 in Britain, which converts to $4.07 using the exchange rate at the time the data was collected. Since that same amount of British currency ought to buy $5.58 worth of American goods according to the Big Mac index, it appears that the British pound was substantially undervalued at the time. ■ Histograms While stem-and-leaf displays and dotplots are useful for smaller data sets, histograms are well-suited to larger samples or the results of a census. Consider first data resulting from observations on a “counting variable” x, such as the number of traffic citations a person received during the last year, or the number of people arriving for service during a particular period. The frequency of any particular x value is simply the number of times that value occurs in the data set. The relative frequency of a value is the fraction or proportion of times the value occurs: relative frequency of a value ¼
number of times the value occurs number of observations in the data set
Suppose, for example, that our data set consists of 200 observations on x = the number of major defects in a new car of a certain type. If 70 of these x values are 1, then the frequency of the value 1 is (obviously) 70, while the relative frequency of the value 1 is 70/200 = .35. Multiplying a relative frequency by 100 gives a percentage; in the defect example, 35% of the cars in the sample had just one major defect. The relative frequencies, or percentages, are usually of more interest than the frequencies themselves. In theory, the relative frequencies should sum to 1, but in practice the sum may differ slightly from 1 because of rounding. A frequency distribution is a tabulation of the frequencies and/or relative frequencies. Example 1.7 How unusual is a no-hitter or a one-hitter in a major league baseball game, and how frequently does a team get more than 10, 15, or 20 hits? Table 1.1 is a frequency distribution for the number of hits per team per game for all games in the 2016 regular season, courtesy of the website www.retrosheet.org.
1.2 Graphical Methods in Descriptive Statistics
13
Table 1.1 Frequency distribution for hits per team in 2016 MLB games Hits/Team/ Game
Number of games
0 1 2 3 4 5 6 7 8 9 10 11
1 19 56 151 279 380 471 554 564 556 469 386
Relative frequency
Hits/Team/ Game
.0002 .0039 .0115 .0311 .0575 .0783 .0970 .1141 .1161 .1145 .0966 .0795
12 13 14 15 16 17 18 19 20 21 22
Number of games
Relative frequency
297 202 176 102 71 56 32 22 3 4 5 4,856
.0612 .0416 .0362 .0210 .0146 .0115 .0066 .0045 .0006 .0008 .0010 .9999
The corresponding histogram in Figure 1.6 rises rather smoothly to a single peak and then declines. The histogram extends a bit more to the right (toward large values) than it does on the left— a slight “positive skew.”
Relative frequency
.10
.05
0
0
5
10
15
20
Number of hits
Figure 1.6 Relative frequency histogram of x = number of hits per team per game for the 2016 MLB season
Either from the tabulated information or from the histogram, we can determine the following: proportion of instances of at most two hits
relative ¼
frequency for x ¼ 0
relative
relative þ
frequency for x ¼ 1
¼ :0002 þ :0039 þ :0115 ¼ :0156
þ
frequency for x ¼ 2
14
1
Overview and Descriptive Statistics
Similarly, proportion of instances of between 5 and 10 hits ðinclusiveÞ ¼ :0783 þ :0970 þ þ :0966 ¼ :6166 That is, roughly 62% of the time that season, a team had between 5 and 10 hits (inclusive) in a game. Incidentally, the only no-hitter of the season (notice the frequency of 1 for x = 0) came on April 21, 2016, with Jake Arrieta pitching the complete game for the Chicago Cubs against the Cincinnati Reds. ■ Constructing a histogram for measurement data (e.g., weights of individuals, reaction times to a particular stimulus) requires subdividing the measurement axis into a suitable number of class intervals or classes, such that each observation is contained in exactly one class. Suppose, for example, that we have 50 observations on x = fuel efficiency of an automobile (mpg), the smallest of which is 27.8 and the largest of which is 31.4. Then we could use the class boundaries 27.5, 28.0, 28.5, …, and 31.5 as shown here:
27.5 28.0 28.5 29.0 29.5 30.0 30.5 31.0 31.5
When all class widths are equal, a histogram is constructed as follows: first, mark the class boundaries on a horizontal axis like the one above. Then, above each interval, draw a rectangle whose height is the corresponding relative frequency (or frequency). One potential difficulty is that occasionally an observation falls on a class boundary and therefore does not lie in exactly one interval, for example, 29.0. We will use the convention that any observation falling on a class boundary will be included in the class to the right of the observation. Thus 29.0 would go in the 29.0–29.5 class rather than the 28.5–29.0 class. This is how Minitab constructs a histogram; in contrast, the default histogram in R does it the other way, with 29.0 going into the 28.5– 29.0 class. Example 1.8 Power companies need information about customer usage to obtain accurate forecasts of demands. Investigators from Wisconsin Power and Light determined energy consumption (BTUs) during a particular period for a sample of 90 gas-heated homes. For each home, an adjusted consumption value was calculated to account for weather and house size. This resulted in the accompanying data (part of the stored data set furnace.mtw available in Minitab), which we have ordered from smallest to largest. 2.97 6.80 7.73 8.61 9.60 10.28 11.12 12.31 13.47
4.00 6.85 7.87 8.67 9.76 10.30 11.21 12.62 13.60
5.20 6.94 7.93 8.69 9.82 10.35 11.29 12.69 13.96
5.56 7.15 8.00 8.81 9.83 10.36 11.43 12.71 14.24
5.94 7.16 8.26 9.07 9.83 10.40 11.62 12.91 14.35
5.98 7.23 8.29 9.27 9.84 10.49 11.70 12.92 15.12
6.35 7.29 8.37 9.37 9.96 10.50 11.70 13.11 15.24
6.62 7.62 8.47 9.43 10.04 10.64 12.16 13.38 16.06
6.72 7.62 8.54 9.52 10.21 10.95 12.19 13.42 16.90
6.78 7.69 8.58 9.58 10.28 11.09 12.28 13.43 18.26
1.2 Graphical Methods in Descriptive Statistics
15
We let Minitab select the class intervals. The most striking feature of the histogram in Figure 1.7 is its resemblance to a bell-shaped (and therefore symmetric) curve, with the point of symmetry roughly at 10. 30
Percent
20
10
0 1
3
5
7
9 11 BTUN
13
15
17
19
Figure 1.7 Minitab histogram of the energy consumption data from Example 1.8
Class Frequency Relative frequency
1– m − 1. (This is why some authors suggest using min(m − 1, n − 1) as df in place of Welch’s formula). What impact does this have on the CI and test procedure? 98. The accompanying summary data on compression strength (lb) for 12 10 8 in. boxes appeared in the article “Compression of Single-Wall Corrugated Shipping
Supplementary Exercises
631
Containers Using Fixed and Floating Test Platens” (J. Testing Eval. 1992: 318–320). The authors stated that “the difference between the compression strength using fixed and floating platen method was found to be small compared to normal variation in compression strength between identical boxes.” Do you agree? Method
Sample size
Sample mean
Sample SD
Fixed Floating
10 10
807 757
27 41
99. The authors of the article “Dynamics of Canopy Structure and Light Interception in Pinus elliotti, North Florida” (Ecol. Monogr. 1991: 33–51) planned an experiment to determine the effect of fertilizer on a measure of leaf area. A number of plots were available for the study, and half were selected at random to be fertilized. To ensure that the plots to receive the fertilizer and the control plots were similar, before beginning the experiment tree density (the number of trees per hectare) was recorded for eight plots to be fertilized and eight control plots, resulting in the given data. Minitab output follows. Fertilizer plots Control plots
1024 1216 1312 1280 1216 1312 992
1120
1104 1072 1088 1328 1376 1280 1120 1200
Two sample T for fertilize vs. control
c. Interpret the given confidence interval. 100. Is the response rate for questionnaires affected by including some sort of incentive to respond along with the questionnaire? In one experiment, 110 questionnaires with no incentive resulted in 75 being returned, whereas 98 questionnaires that included a chance to win a lottery yielded 66 responses (“Charities, No; Lotteries, No; Cash, Yes,” Public Opinion Q. 1996: 542–562). Does this data suggest that including an incentive increases the likelihood of a response? State and test the relevant hypotheses at significance level .10 by using the P-value method. 101. The article “Quantitative MRI and Electrophysiology of Preoperative Carpal Tunnel Syndrome in a Female Population” (Ergonomics 1997: 642–649) reported that (−473.3, 1691.9) was a large-sample 95% confidence interval for the difference between true average thenar muscle volume (mm3) for sufferers of carpal tunnel syndrome and true average volume for nonsufferers. Calculate and interpret a 90% confidence interval for this difference. 102. The following summary data on bending strength (lb-in/in) of joints is taken from the article “Bending Strength of Corner Joints Constructed with Injection Molded Splines” (Forest Prod. J. April 1997: 89– 92). Assume normal distributions. Type
Fertilize Control
N 8 8
Mean 1184 1196
Std. Dev. 126 118
SE Mean 44 42
95% CI for mu fertilize-mu control: (−144, 120)
a. Construct a comparative boxplot and comment on any interesting features. b. Would you conclude that there is a significant difference in the mean tree density for fertilizer and control plots? Use a = .05.
Without side coating With side coating
Sample size
Sample mean
Sample SD
10 10
80.95 63.23
9.59 5.96
a. Calculate a 95% lower confidence bound for true average strength of joints with a side coating. b. Calculate a 95% lower prediction bound for the strength of a single joint with a side coating.
632
10
c. Calculate a 95% confidence interval for the difference between true average strengths for the two types of joints. 103. An experiment was carried out to compare various properties of cotton/polyester spun yarn finished with softener only and yarn finished with softener plus 5% DP-resin (“Properties of a Fabric Made with Tandem Spun Yarns,” Textile Res. J. 1996: 607– 611). One particularly important characteristic of fabric is its durability, that is, its ability to resist wear. For a sample of 40 softener-only specimens, the sample mean stoll-flex abrasion resistance (cycles) in the filling direction of the yarn was 3975.0, with a sample standard deviation of 245.1. Another sample of 40 softener-plus specimens gave a sample mean and sample standard deviation of 2795.0 and 293.7, respectively. Calculate a confidence interval with confidence level 99% for the difference between true average abrasion resistances for the two types of fabrics. Does your interval provide convincing evidence that true average resistances differ for the two types of fabrics? Why or why not? 104. The derailment of a freight train due to the catastrophic failure of a traction motor armature bearing provided the impetus for a study reported in the article “Locomotive Traction Motor Armature Bearing Life Study” (Lubricat. Engr. August 1997: 12– 19). A sample of 17 high-mileage traction motors was selected, and the amount of cone penetration (mm/10) was determined both for the pinion bearing and for the commutator armature bearing, resulting in the following data: Motor Commutator Pinion
1 211 226
2 273 278
3 305 259
4 258 244
5 270 273
6 209 236
7
8
9
10
11
12
Commutator Pinion
223 290
288 287
296 287
233 242
262 288
291 242
Motor
13
14
15
16
17
Commutator Pinion
278 278
275 208
210 281
272 274
264 274
Motor
Inferences Based on Two Samples
Calculate an estimate of the population mean difference between penetration for the commutator armature bearing and penetration for the pinion bearing, and do so in a way that conveys information about the reliability and precision of the estimate. [Note: A normal probability plot validates the necessary normality assumption.] Would you say that the population mean difference has been precisely estimated? Does it look as though population mean penetration differs for the two types of bearings? Explain. 105. The article “Two Parameters Limiting the Sensitivity of Laboratory Tests of Condoms as Viral Barriers” (J. Test. Eval. 1996: 279– 286) reported that, in brand A condoms, among 16 tears produced by a puncturing needle, the sample mean tear length was 74.0 lm, whereas for the 14 brand B tears, the sample mean length was 61.0 lm (determined using light microscopy and scanning electron micrographs). Suppose the sample standard deviations are 14.8 and 12.5, respectively (consistent with the sample ranges given in the article). The authors commented that the thicker brand B condom displayed a smaller mean tear length than the thinner brand A condom. Is this difference in fact statistically significant? State the appropriate hypotheses and test at a = .05. 106. Information about hand posture and forces generated by the fingers during manipulation of various daily objects is needed for designing high-tech hand prosthetic devices. The article “Grip Posture and Forces During Holding Cylindrical Objects with Circular Grips” (Ergonomics 1996: 1163– 1176) reported that for a sample of 11 females, the sample mean four-finger pinch strength (N) was 98.1 and the sample standard deviation was 14.2. For a sample of 15 males, the sample mean and sample standard deviation were 129.2 and 39.1, respectively.
Supplementary Exercises
633
a. A test carried out to see whether true average strengths for the two genders were different resulted in t = 2.51 and P-value = .019. Does the appropriate test procedure described in this chapter yield this value of t and the stated P-value? b. Is there substantial evidence for concluding that true average strength for males exceeds that for females by more than 25 N? State and test the relevant hypotheses. 107. After the Enron scandal in the fall of 2001, faculty in accounting began to incorporate ethics more into accounting courses. One study looked at the effectiveness of such educational interventions “pre-Enron” and “post-Enron.” The data below shows students’ improvement in score on the Accounting Ethical Dilemma Instrument (AEDI) across a one-semester accounting class in Spring 2001 (“pre-Enron”) and another in Spring 2002 (“post-Enron”). (From “A Note in Ethics Educational Interventions in an Undergraduate Auditing Course: Is There an ‘Enron Effect’?” Issues Account. Educ. 2004: 53–71.) Improvement in AEDI score Class
n
Mean
SD
2001 (pre-Enron) 2002 (post-Enron)
37 21
5.48 6.31
13.83 13.20
a. Test to see whether the 2001 class showed a statistically significant improvement in AEDI score across the semester. b. Test to see whether the 2002 class showed a statistically significant improvement in AEDI score across the semester. c. Test to see whether the 2002 class showed a statistically significantly greater improvement in AEDI score than the 2001 class. In this respect, does there appear to be an “Enron effect”? 108. Torsion during hip external rotation (ER) and extension may be responsible for certain kinds of injuries in golfers and other athletes. The article “Hip Rotational
Velocities during the Full Golf Swing” (J. Sport Sci. Med. 2009: 296–299) reported on a study in which peak ER velocity and peak IR (internal rotation) velocity (both in deg/s) were determined for a sample of 15 female collegiate golfers during their swings. The following data was supplied by the article’s authors. Golfer 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ER
IR
Diff.
−130.6 −125.1 −51.7 −179.7 −130.5 −101.0 −24.4 −231.1 −186.8 −58.5 −219.3 −113.1 −244.3 −184.4 −199.2
−98.9 −115.9 −161.6 −196.9 −170.7 −274.9 −275.0 −275.7 −214.6 −117.8 −326.7 −272.9 −429.1 −140.6 −345.6
−31.7 −9.2 109.9 17.2 40.2 173.9 250.6 44.6 27.8 59.3 107.4 159.8 184.8 −43.8 146.4
a. Is it plausible that the differences came from a normally distributed population? b. The article reported that mean(sd) = –145.3(68.0) for ER velocity and = −227.8(96.6) for IR velocity. Based just on this information, could a test of hypotheses about the difference between true average IR velocity and true average ER velocity be carried out? Explain. c. Do an appropriate hypothesis test about the difference between true average IR velocity and true average ER velocity and interpret the result. 109. The accompanying summary data on the ratio of strength to cross-sectional area for knee extensors is taken from the article “Knee Extensor and Knee Flexor Strength: CrossSectional Area Ratios in Young and Elderly Men” (J. Gerontol. 1992: M204–M210). Group Young Elderly men
Sample size
Sample mean
Standard error
13 12
7.47 6.71
.22 .28
634
10
Does this data suggest that the true average ratio for young men exceeds that for elderly men? Carry out a test of appropriate hypotheses using a = .05. Be sure to state any assumptions necessary for your analysis.
Meal
1
2
3
4
5
Stated Measured Meal
180 212 6
220 319 7
190 231 8
230 306 9
200 211 10
Stated Measured
370 431
250 288
240 265
80 145
180 228
110. The accompanying data on response time appeared in the article “The Extinguishment of Fires Using Low-Flow Water Hose Streams—Part II” (Fire Tech. 1991: 291– 320). The samples are independent, not paired. Good visibility Poor visibility
.43 1.17 1.47
.37
.47
.68
.58
Obtain a 95% confidence interval for the difference of population means. By roughly what percentage are the actual calories higher than the stated value? Note that the article calls this a convenience sample and suggests that therefore it should have limited value for inference. However, even if the ten meals were a random sample from their local store, there could still be a problem in drawing conclusions about a purchase at your store.
.50 2.75
.80 1.58 1.53 4.33 4.23 3.25 3.22
The authors analyzed the data with the pooled t test. Does the use of this test appear justified? [Hint: Check for normality.] 111. The accompanying data on the alcohol content of wine is representative of that reported in a study in which wines from the years 1999 and 2000 were randomly selected and the actual content was determined by laboratory analysis (London Times August 5, 2001). Wine
1
2
3
4
5
6
Actual Label
14.2 14.0
14.5 14.0
14.0 13.5
14.9 15.0
13.6 13.0
12.6 12.5
The two-sample t test gives a test statistic value of .62 and a two-tailed P-value of .55. Does this convince you that there is no significant difference between true average actual alcohol content and true average content stated on the label? Explain. 112. The article “The Accuracy of Stated Energy Contents of Reduced-Energy, Commercially Prepared Foods” (J. Am. Diet. Assoc. 2010: 116–123) presented the accompanying data on vendor-stated gross energy and measured value (both in kcal) for 10 different supermarket convenience meals):
Inferences Based on Two Samples
113. How does energy intake compare to energy expenditure? One aspect of this issue was considered in the article “Measurement of Total Energy Expenditure by the Doubly Labelled Water Method in Professional Soccer Players” (J. Sports Sci. 2002: 391– 397), which contained the accompanying data (MJ/day). Player Expenditure Intake
1
2
3
4
5
6
7
14.4 14.6
12.1 9.2
14.3 11.8
14.2 11.6
15.2 12.7
15.5 15.0
17.8 16.3
Test to see whether there is a significant difference between intake and expenditure. Does the conclusion depend on whether a significance level of .05, .01, or .001 is used? 114. An experimenter wishes to obtain a CI for the difference between true average breaking strength for cables manufactured by company I and by company II. Suppose breaking strength is normally distributed for both types of cable with r1 = 30 psi and r2 = 20 psi. a. If costs dictate that the sample size for the type I cable should be three times the sample size for the type II cable,
Supplementary Exercises
635
diabetic, (3) diabetic treated with a low dose of insulin, (4) diabetic treated with a high dose of insulin. The accompanying table gives sample sizes and sample standard deviations. Denote the sample size for the ith treatment by ni and the sample variance by S2i ði ¼ 1; 2; 3; 4Þ. Assuming that the true variance for each treatment is r2, construct a pooled estimator of r2 that is unbiased, and verify using rules of expected value that it is indeed unbiased. What is your estimate for the following actual data? [Hint: Modify the pooled estimator S2p from Section 10.2.]
how many observations are required if the 99% CI is to be no wider than 20 psi? b. Suppose a total of 400 observations is to be made. How many of the observations should be made on type I cable samples if the width of the resulting interval is to be a minimum? 115. To assess the tendency of people to rationalize poor performance, 246 college students were randomly assigned to one of two groups: a negative feedback group and a positive feedback group. All students took a test which asked them to identify people’s emotions based on photographs of their faces. Those in the negative feedback group were all given D grades, while those in the positive feedback group received A’s (regardless of how they actually performed). A follow-up questionnaire asked students to assess the validity of the test and the importance of being able to read people’s faces. The results of these two follow-up surveys appear below. Test validity rating Group Positive feedback Negative feedback
Face reading importance rating
n
x
s
x
s
123 123
6.95 5.51
1.09 0.79
6.62 5.36
1.19 1.00
a. Test the hypothesis that negative feedback is associated with a lower average validity rating than positive feedback at the a = .01 level. b. Test the hypothesis that students receiving positive feedback rate face-reading as more important, on average, than do students receiving negative feedback. Again use a 1% significance level. c. Is it reasonable to conclude that the results seen in parts (a) and (b) are attributable to the different types of feedback? Why or why not? 116. The insulin-binding capacity (pmol/mg protein) was measured for four different groups of rats: (1) nondiabetic, (2) untreated
Treatment
Sample size Sample SD
1
2
3
4
16 .64
18 .81
8 .51
12 .35
117. Suppose a level .05 test of H0: l1 l2 = 0 versus Ha: l1 l2 > 0 is to be performed, assuming r1 = r2 = 10 and normality of both distributions, using equal sample sizes (m = n). Evaluate the probability of a type II error when l1 l2 = 1 and n = 25, 100, 2500, and 10,000. Can you think of real situations in which the difference l1 l2 = 1 has little practical significance? Would sample sizes of n = 10,000 be desirable in such situations? 118. Are male college students more easily bored than their female counterparts? This question was examined in the article “Boredom in Young Adults—Gender and Cultural Comparisons” (J. Cross-Cult. Psych. 1991: 209–223). The authors administered a scale called the Boredom Proneness Scale to 97 male and 148 female U.S. college students. Does the accompanying data support the research hypothesis that the mean Boredom Proneness Rating is higher for men than for women? Test the appropriate hypotheses using a .05 significance level. Sex Male Female
Sample size
Sample mean
Sample SD
97 148
10.40 9.26
4.83 4.68
636
10
119. Researchers sent 5000 resumes in response to job ads that appeared in the Boston Globe and Chicago Tribune. The resumes were identical except that 2500 of them had “white sounding” first names, such as Brett and Emily, whereas the other 2500 had “black sounding” names such as Tamika and Rasheed. The resumes of the first type elicited 250 responses and the resumes of the second type only 167 responses (these numbers are consistent with information that appeared in a January 15, 2003, report by the Associated Press). Does this data strongly suggest that a resume with a “black” name is less likely to result in a response than is a resume with a “white” name? 120. Is touching by a coworker sexual harassment? This question was included on a survey given to federal employees, who responded on a scale of 1–5, with 1 meaning a strong negative and 5 indicating a strong yes. The table summarizes the results. Sex Female Male
Sample size
Sample mean
Sample SD
4343 3903
4.6056 4.1709
.8659 1.2157
Of course, with 1–5 being the only possible values, the normal distribution does not apply here, but the sample sizes are sufficient that it does not matter. Obtain a twosided confidence interval for the difference in population means. Does your interval suggest that females are more likely than males to regard touching as harassment? Explain your reasoning. 121. Let X1, …, Xm be a random sample from a Poisson distribution with parameter l1, and let Y1, …, Yn be a random sample from another Poisson distribution with parameter l2. We wish to test H0: l1 l2 = 0 against one of the three standard alternatives. When m and n are large, the CLT justifies using a large-sample z test. However, the fact that VðXÞ ¼ l=n suggests that a different denominator should be used in
Inferences Based on Two Samples
standardizing X Y. Develop a largesample test procedure appropriate to this problem, and then apply it to the following data to test whether the plant densities for a particular species are equal in two different regions (where each observation is the number of plants found in a randomly located square sampling quadrat having area 1 m2, so for region 1, there were 40 quadrats in which one plant was observed, etc.): Frequency
Region 1 Region 2
0
1
2
3
4
5
6
7
28 14
40 25
28 30
17 18
8 49
2 2
1 1
1 1
m = 125 n = 140
122. Referring to the previous exercise, develop a large-sample confidence interval formula for l1 l2 . Calculate the interval for the data given there using a confidence level of 95%. 123. Refer back to the pooled t procedures described at the end of Section 10.2. The test statistic for testing H0 : l1 l2 ¼ D0 is Tp ¼
ðX YÞ D0 ðX YÞ D0 qffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ffi S2p S2p 1 S þ p m n m þ n
Show that when l1 l2 ¼ D0 (some alternative value for the difference), then Tp has a noncentral t distribution with df = m + n – 2 and noncentrality parameter D0 D0 d ¼ pffiffiffiffiffiffiffiffiffi r m1 þ 1n [Hint: Look back at Exercises 39–40, as well as Chapter 9 Exercise 38.] 124. Let R1 be a rejection region with significance level a for testing H01: h 2 X1 versus Ha1: h 62 X1, and let R2 be a level a rejection region for testing H02: h 2 X2 versus Ha2: h 62 X2, where X1 and X2 are two disjoint sets of possible values of h. Now
Supplementary Exercises
consider testing H0: h 2 X1 [ X2 versus the alternative Ha: h 62 X1 [ X2. The proposed rejection region is R1 \ R2. That is, H0 is rejected only if both H01 and H02 can be rejected. This procedure is called a union–intersection test (UIT). a. Show that the UIT is a level a test. b. As an example, let lT denote the mean value of a particular variable for a generic (test) drug, and lR denote the mean value of this variable for a brand-name (reference) drug. In bioequivalence testing, the relevant hypotheses are H0: lT/lR dL or lT/lR dU (the two aren’t bioequivalent) versus Ha: dL < lT/lR < dU (bioequivalent). The limits dL and dU are standards set by regulatory agencies; the FDA often uses .80 and 1.25 = 1/.8, respectively. By
637
taking logarithms and letting η = ln(l), s = ln(d), the hypotheses become H0: either ηT − ηR sL or sU versus Ha: sL < ηT − ηR < sU. With this setup, a type I error involves saying the drugs are bioequivalent when they are not. The FDA mandates a = .05. Let D be an estimator of ηT − ηR with standard error SD such that standardized variable T = [D − (ηT − ηR)]/SD has a t distribution with v df. The standard test procedure is referred to as TOST for “two one-sided tests” and is based on the two test statistics TU = (D − sU)/SD and TL = (D − sL)/SD. If v = 20, state the appropriate conclusion in each of the following cases: (1) sL = 2.0, sU = −1.5; (2) sL = 1.5, sU = −2.0; (3) sL = 2.0, sU = −2.0.
The Analysis of Variance
11
Introduction In studying methods for the analysis of quantitative data, we first focused on problems involving a single sample of numbers and then turned to a comparative analysis of two different samples. Now we are ready for the analysis of several samples. The analysis of variance, or more briefly ANOVA, refers broadly to a collection of statistical procedures for the analysis of quantitative responses. The simplest ANOVA problem is referred to variously as a single-factor, single-classification, or one-way ANOVA and involves the analysis of data sampled from two or more numerical populations (distributions). The characteristic that labels the populations is called the factor under study, and the populations are referred to as the levels of the factor. Examples of such situations include the following: 1. An experiment to study the effects of five different brands of gasoline on automobile engine operating efficiency (mpg) 2. An experiment to study the effects of four different sugar solutions (glucose, sucrose, fructose, and a mixture of the three) on bacterial growth 3. An experiment to investigate whether hardwood concentration in pulp has an effect on tensile strength of bags made from the pulp 4. An experiment to decide whether the color density of fabric specimens depends on the amount of dye used In (1) the factor of interest is gasoline brand, and there are five different levels of the factor. The factor in (2) is sugar, with four levels (or five, if a control solution containing no sugar is used). The factor in both of these first two examples is categorical in nature, and the levels correspond to possible categories of the factor. In (3) and (4), the factors are concentration of hardwood and amount of dye, respectively; both these factors are quantitative in nature, so the levels identify different settings of the factor. When the factor of interest is quantitative, statistical techniques from regression analysis (discussed in Chapter 12) can also be used to analyze the data. Here we first introduce single-factor ANOVA. Section 11.1 presents the F test for testing the null hypothesis that the population means are identical. Section 11.2 considers further analysis of the data when H0 has been rejected. Section 11.3 covers some other aspects of single-factor ANOVA. Many experimental situations involve studying the simultaneous impact of more than one factor. Various aspects of two-factor ANOVA are considered in the last two sections of the chapter.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. L. Devore et al., Modern Mathematical Statistics with Applications, Springer Texts in Statistics, https://doi.org/10.1007/978-3-030-55156-8_11
639
640
11.1
11
The Analysis of Variance
Single-Factor ANOVA
Single-factor ANOVA focuses on a comparison of two or more populations. For example, McDonalds may wish to compare the average revenue associated with three different advertising campaigns, or a team of animal nutritionists may carry out an experiment to compare the effect of five different diets on weight gain, or FedEx may want to compare the strengths of cardboard shipping boxes from four different vendors. Let I = the number of populations/treatments being compared l1 = the mean of population 1 (or the true average response when treatment 1 is applied) .. . lI = the mean of population I (or the true average response when treatment I is applied) Then the hypotheses of interest are H0: l1 ¼ l2 ¼ ¼ lI versus Ha: at least two of the li’s are different If I = 4, H0 is true only if all four li’s are identical. Ha would be true, for example, if l1 = l2 6¼ l3 = l4, if l1 = l3 = l4 6¼ l2, or if all four li’s differ from each other. A test of these hypotheses requires that we have available a random sample from each population or treatment. Since ANOVA focuses on a comparison of means, you may wonder why the method is called analysis of variance (actually, analysis of variability would be a better name). The following example illustrates why it is appropriate to consider variability. Example 11.1 The article “Compression of Single-Wall Corrugated Shipping Containers Using Fixed and Floating Test Platens” (J. Test. Eval. 1992: 318–320) describes an experiment in which several different types of boxes were compared with respect to compression strength (lb). Table 11.1 presents the results of an experiment involving I = 4 types of boxes (the sample means and standard deviations are in good agreement with values given in the article). Table 11.1 The data and summary quantities for Example 11.1 Type of box 1 2 3 4
Compression strength (lb) 655.5 721.4 789.2 686.1 737.1 671.7 535.1 559.0
788.3 679.1 772.5 732.1 639.0 717.2 628.7 586.9
734.3 699.4 786.9 774.8 696.3 727.1 542.4 520.0 Grand mean =
Sample mean
Sample SD
713.00
46.55
756.93
40.34
698.07
37.20
562.02
39.87
682.50
With li denoting the true average compression strength for boxes of type i (i = 1, 2, 3, 4), the null hypothesis is H0: l1 = l2 = l3 = l4. Figure 11.1a shows a comparative boxplot for the four samples. There is a substantial amount of overlap among observations on the first three box types, but compression strengths for the fourth type appear considerably smaller than for the others. This suggests that H0 is not true.
11.1
Single-Factor ANOVA
641
a 1
2
3
4 400
500
600
700
800
900
400
500
600
700
800
900
500
600
700
800
900
b 1
2
3
4
c 1
2
3
4 400
Figure 11.1 Boxplots for Example 11.1: (a) original data; (b) data with one mean altered; (c) data with standard deviations altered
642
11
The Analysis of Variance
The comparative boxplot in Figure 11.1b is based on adding 120 to each observation in the fourth sample (giving a mean of 682.02 and the same standard deviation) and leaving the other samples unaltered. Because the sample means are now closer together, it is no longer obvious whether H0 should be rejected. Lastly, the comparative boxplot in Figure 11.1c is based on inflating the standard deviation of each sample while maintaining the values of the original sample means. Once again, it is unclear from the graph whether H0 is true, even though now the sample means are separated by the same amounts as they were in Figure 11.1a. These graphs suggest that it’s insufficient to consider only how far apart the sample means are in assessing whether the population means are different; we must also account for the amount of variability within each of the I samples. ■ Notation and Assumptions In two-sample problems, we used the letters X and Y to designate the observations in the two samples. Because this is cumbersome for three or more samples, it is customary to use a single letter with two subscripts. The first subscript identifies the sample number, corresponding to the population or treatment being sampled, and the second subscript denotes the position of the observation within that sample. Let Xij = the random variable denoting the jth measurement from the ith population or treatment xij = the observed value of Xij when the experiment is performed The observed data is often displayed in a rectangular table, such as Table 11.1. There, samples from the different populations appear in different rows of the table, and xij is the jth number in the ith sample. For example, x23 = 786.9 (the third observation from the second population), and x41 = 535.1. When there is potential ambiguity, we will write xi,j rather than xij (e.g., if there were 15 observations on each of 12 treatments, x112 could mean x1,12 or x11,2). It is assumed that the Xij’s within any particular sample are independent—a random sample from the ith population or treatment distribution—and that different samples are independent of each other. In some studies, different samples contain different numbers of observations. However, the concepts and methods of single-factor ANOVA are most easily developed for the case of equal sample sizes, known as a balanced study design. Unequal sample sizes will be considered in Section 11.3. Restricting ourselves for the moment to balanced designs, let J denote the number of observations in each sample (J = 6 in Example 11.1). The data set consists of n = IJ observations. The sample means will be denoted by X 1 ; X 2 ; . . .; X I . That is, X i ¼
J 1X Xij J j¼1
i ¼ 1; 2; . . .; I
Similarly, the average of all IJ observations, called the grand mean, is X ¼
I X J I 1X 1X Xij ¼ X i IJ i¼1 j¼1 I i¼1
For the strength data in Table 11.1, x1 ¼ 713:00, x2 ¼ 756:93, x3 ¼ 698:07, x4 ¼ 562:02, and x:: ¼ 682:50. Additionally, let S21 ; S22 ; . . .; S2I represent the sample variances:
11.1
Single-Factor ANOVA
643
S2i ¼
J 2 1 X Xij X i J 1 j¼1
i ¼ 1; 2; . . .; I
From Example 11.1, s1 = 46.55, s21 ¼ 2166:90, and so on. ANOVA ASSUMPTIONS
The I population or treatment distributions are all normal with the same variance r2. That is, the Xij’s are independent and normally distributed with E Xij ¼ li V Xij ¼ r2
The plausibility of independence of the samples (and of the individual observations within the samples) stems from a study’s design. At the end of this section we discuss methods for checking the plausibility of the normality and equal variance assumptions. Sums of Squares and Mean Squares Example 11.1 suggests the need for two distinct measures of variation: between-samples variation (i.e., the disparity between the I sample means) and within-samples variation (assessing variation separately within each sample and then combining). The test procedure we will develop shortly is based on the following measures of variation in the data. DEFINITION
A measure of between-samples variation is the treatment sum of squares SSTr, given by SSTr ¼
XX i
h
X i X
j
¼ J X 1 X
2
2
¼
X 2 J X i X i
þ þ X I X
2 i
A measure of within-samples variation is the error sum of squares SSE, given by SSE ¼
XX 2 X Xij X i ¼ ðJ 1ÞS2i i
j
i
¼ ðJ 1Þ S21 þ S22 þ þ S2I
Thus SSTr assesses variation in the means from the different samples, whereas SSE entails assessing variability within each sample separately (via the sample variance) and then combining these assessments.
644
11
The Analysis of Variance
Example 11.2 (Example 11.1 continued) For the box compression data displayed in Figure 11.1a, SSTr ¼ 6ð713:00 682:50Þ2 þ 6ð756:93 682:50Þ2 þ 6ð698:07 682:50Þ2 þ 6ð562:02 682:50Þ2 ¼ 127; 374:7 while SSE ¼ ð6 1Þð46:55Þ2 þ ð6 1Þð40:34Þ2 þ ð6 1Þð37:20Þ2 þ ð6 1Þð39:87Þ2 ¼ 33; 838:4 Similar computations can be applied to the two modified versions of the original data; the various sums of squares are summarized below.
SSTr SSE
Figure 11.1a
Figure 11.1b
Figure 11.1c
127,374.7 33,838.4
14,004.6 33,838.4
127,374.7 211,488.0
For the altered data on which Figure 11.1b is based, x4 ¼ 682:02 and the revised grand mean is x:: ¼ 712:50. This greatly reduces SSTr, reflecting the fact that the sample means are less far apart in Figure 11.1b than in Figure 11.1a. Since the standard deviations of the samples were not changed, SSE for the data displayed in Figure 11.1a, b are identical. For the data used to construct Figure 11.1c, the value of SSTr is unchanged from Figure 11.1a—the xi ’s were not altered, so this measure of between-samples variation stays the same. On the other hand, SSE for Figure 11.1c is much larger than for the actual data. Since the altered data exhibits much greater within-samples variation than does the actual data, the corresponding SSE should be correspondingly greater. ■ A descriptive understanding of the treatment and error sums of squares is provided by the following fundamental identity. THEOREM (Fundamental ANOVA Identity)
SSTr þ SSE ¼ SST where SST is the total sum of squares given by SST ¼
XX i
xij x
2
j
The proof of the identity follows from squaring both sides of the relationship xij x ¼ xij xi þ ðxi x Þ
ð11:1Þ
and summing over all i and j. This gives SST on the left and SSTr and SSE as the two extreme terms on the right; the cross-product term is easily seen to be zero (Exercise 13). The interpretation of the fundamental identity is an important aid to understanding ANOVA. SST is a measure of the total variation in the data—the sum of all squared deviations about the grand mean. The identity says that this total variation can be partitioned into two pieces. SSTr is the amount of variation (between samples) that can be explained by possible differences in the li’s: when the li’s
11.1
Single-Factor ANOVA
645
differ substantially, the individual sample means should be further from the grand mean than when the li’s are identical or close to one another. SSE measures variation that would be present (within samples) even if H0 were true; thus, SSE the part of total variation that is unexplained by the truth or falsity of H0. If explained variation is large relative to unexplained variation, then H0 should be rejected in favor of Ha. Formal inference will require the sampling distributions of both the statistics SSTr and SSE. Recall a result from Section 6.4: if X1, …, Xn is a random sample from a normal distribution, then the sample mean X and the sample variance S2 are independent. Also, X is normally distributed, and ðn 1ÞS2 =r2 has a chi-squared distribution with n − 1 df. Similar results hold in our ANOVA situation. THEOREM
When the ANOVA assumptions are satisfied, 1. SSE and SSTr are independent random variables. 2. SSE/r2 has a chi-squared distribution with IJ – I df. Furthermore, when H0: l1 ¼ ¼ lI is true, 3. SSTr/r2 has a chi-squared distribution with I – 1 df.
Proof Independence of SSTr and SSE follows from the fact that SSTr is based on the individual sample means whereas SSE is based on the sample variances, and X i is independent of S2i for each i. Next, SSE/r2 can be expressed as the sum of chi-squared rvs: SSE ðJ 1ÞS21 ðJ 1ÞS2I ¼ þ þ r2 r2 r2 Each term in the sum has a v2J1 distribution, and dfs add because the samples are independent. Thus SSE/r2 also has a chi-squared distribution, with df = (J – 1) + + (J – 1) = I(J – 1) = IJ – I. Now suppose H0 is true and let Yi ¼ X i for i ¼ 1; . . .; I. Then Y1, Y2, …, YI are independent and normally distributed with the same mean and with variance r2/J. Thus, by the key result from Section 6.4, ðI 1ÞS2Y =ðr2 =J Þ has a chi-squared distribution with I – 1 df when H0 is true. Furthermore, 2 2 SSTr ðI 1ÞS2Y ðI 1ÞJ 1 X J X ¼ Yi Y ¼ 2 X i X ¼ 2 ; 2 2 r I1 i r r r =J so under H0, SSTr=r2 v2I1 .
■
SSTr and SSE provide measures of between- and within-samples variability, respectively, but they are not (yet) directly comparable. Analogous to the definition of sample variance in Chapter 1, wherein a sum of squares was divided by its degrees of freedom, we make the following definitions. DEFINITION
The mean square for treatments MSTr and the mean square for error MSE are MSTr ¼
SSTr I1
MSE ¼
SSE IJ I
646
11
The Analysis of Variance
The word “mean” is again being used in the sense of average; a mean square is a sum of squares divided by its associated degrees of freedom. The next proposition sets the stage for our ultimate ANOVA hypothesis test. PROPOSITION
When the ANOVA assumptions are satisfied, E(MSE) = r2; that is, MSE is an unbiased estimator of r2. Moreover, when H0: l1 ¼ ¼ lI is true, E(MSTr) = r2; in this case, MSTr is also an unbiased estimator of r2.
Proof The expected value of a chi-squared variable with m df is just m. Thus, from the previous theorem, E
SSE SSE ¼ r2 ¼ IJ I ) EðMSEÞ ¼ E r2 IJ I
and H0 true ) E
SSTr r2
¼ I 1 ) EðMSTrÞ ¼ E
SSTr I1
¼ r2
■
MSTr is unbiased for r2 when H0 is true, but what about when H0 is false? It can be shown (Exercise 14) that in this case, E(MSTr) > r2. This is because the X i ’s tend to differ more from each other, and therefore from the grand mean, when the li’s are not identical than when they are the same. The F Test It follows from the preceding discussion that when H0 is true the values of MSTr and MSE should be close to each other. Equivalently, the ratio of these two quantities should be relatively near 1. On the other hand, if H0 is false then MSTr ought to exceed MSE, so their ratio will tend to exceed 1. This suggests that a sensible test statistic for ANOVA is MSTr/MSE, but how large must this be to provide convincing evidence against H0? Answering this question requires knowing the sampling distribution of this ratio. In Section 6.3 we introduced a family of probability distributions called F distributions, which became the basis for inference on the ratio of two variances in Section 10.5. If Y1 and Y2 are two independent chi-squared random variables with m1 and m2 df, respectively, then the ratio F = (Y1/m1)/(Y2/m2) has an F distribution with m1 numerator df and m2 denominator df. Appendix Table A.8 gives F critical values for a = .10, .05, .01, and .001. Values of m1 are identified with different columns of the table and the rows are labeled with various values of m2. For example, the F critical value that captures upper-tail area .05 under the F curve with m1 = 4 and m2 = 6 is F.05,4,6 = 4.53, whereas F.05,6,4 = 6.16 (so don’t accidentally switch numerator and denominator df!). The key theoretical result that justifies the ANOVA test procedure is that the test statistic MSTr/MSE has an F distribution when H0 is true.
11.1
Single-Factor ANOVA
PROPOSITION
647
When the ANOVA assumptions are satisfied and H0: l1 ¼ ¼ lI is true, the test statistic F = MSTr/MSE has an F distribution with I – 1 numerator df and IJ – I denominator df.
This theorem follows immediately from rewriting F as h
i SSTr =ðI 1Þ 2 F¼h r i SSE =ðIJ IÞ r2 and then applying the definition of the F distribution along with properties of SSTr and SSE established earlier in this section. Here, finally, is the test procedure. F TEST FOR SINGLE-FACTOR ANOVA
Null hypothesis: H0: l1 ¼ ¼ lI Alternative hypothesis: Ha: not all of the li’s are equal SSTr=ðI 1Þ Test statistic value: f ¼ MSTr MSE ¼ SSE=ðIJ IÞ Rejection region for level a test: f [ Fa;I1;IJI P-value calculation: area under FI1; IJI curve to the right of f
Refer to Section 10.5 to see how P-value information for F tests can be obtained from the table of F critical values. Alternatively, statistical software packages will automatically include the P-value with ANOVA output. The computations building up to the test statistic value f are often summarized in a tabular format, called an ANOVA table, as displayed in Table 11.2. Tables produced by statistical software customarily include a P-value column to the right of the f column. Table 11.2 An ANOVA table Source of variation Treatments Error Total
df
Sum of squares
Mean square
f
I−1 IJ − I IJ − 1
SSTr SSE SST
MSTr = SSTr/(I − 1) MSE = SSE/(IJ − I)
MSTr/MSE
Example 11.3 With the ever-increasing power demand driven by everyone’s electronic devices, engineers have begun exploring ways to tap into the energy discharge from household items—a hot stove pan, a candle flame, or even a hot soup bowl. The article “Low-Power Energy Harvesting of Thermoelectric Battery Charger with Step-Up DC–DC Converter: Applicable Case Study for Personal Electronic Gadgets” (J. Energy Engr. 2017) describes an experiment to compare the charging characteristics of five thermoelectric modules under certain conditions. Consider the accompanying data on the maximum power per unit area (mW/cm2) for J = 4 replications of the experiment on each of the I = 5 modules.
648
11 Max. power per cm2
Module 1 2 3 4 5
98.8 82.5 77.7 82.6 91.9
93.1 87.5 74.7 80.5 87.5
96.6 88.8 76.2 82.8 86.9
91.8 91.3 78.8 84.1 90.0
The Analysis of Variance
xi
si
95.08 87.53 76.85 82.50 89.08
3.21 3.70 1.79 1.49 2.31
Let li denote the true mean max power per unit area when module i is used (i = 1, 2, 3, 4, 5). The null hypothesis H0: l1 = l2 = l3 = l4 = l5 states that the true average is identical for the five modules. Let’s carry out a test at significance level .01 to see whether H0 should be rejected in favor of the assertion that true average max power per unit area is not the same for all modules. (At this point the plausibility of the normality and equal variance assumptions should be checked; we defer those tasks to later in this section.) Since I – 1 = 5 – 1 = 4 and IJ – I = 20 – 5 = 15, the F critical value for the rejection region is F.01,4,15 = 4.89. The grand mean for the 20 observations is x ¼ 86:20 mW=cm2 . The treatment and error sums of squares are h i SSTr ¼ 4 ð95:08 86:20Þ2 þ þ ð89:08 86:20Þ2 ¼ 759:6 h i SSE ¼ ð4 1Þ ð3:21Þ2 þ þ ð2:31Þ2 ¼ 104:2 The remaining computations are summarized in the accompanying ANOVA table. Because f = 27.33 > F.01,4,15 = 4.89, H0 is rejected at significance level .01. The P-value is the area under the F4,15 curve to the right of 27.33, which is 0 to several decimal places (and, in particular, far less than a = .01). The modules clearly do not yield the same mean maximum power per cm2 in size. Source of variation
df
Sum of squares
Mean square
f
P-value
Treatments Error Total
4 15 19
759.6 104.2 863.8
189.90 6.95
27.33
F.001,4,40 = 5.70, H0 is decisively rejected. We now use Tukey’s procedure to look for significant differences among the li’s. From Appendix Table A.9, Q.05,5,40 = 4.04 (the second subscript on Q is 5 and not 5 – 1 as in F). Applying Equation (11.2), the CI for each li – lj is
656
11
ðxi xj Þ 4:04
The Analysis of Variance
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi :088=9 ¼ xi xj 0:4
Due to the balanced design, the margin of error is the same on all
5 2
¼ 10 Tukey CIs; if the sample
sizes differed, this would not be the case (see Section 11.3). The resulting CIs are displayed in Table 11.5; those marked with an asterisk do not include zero: Table 11.5 Tukey’s simultaneous confidence intervals for Example 11.4 i
j
CI for li – lj
i
j
CI for li – lj
1 1 1 1 2
2 3 4 5 3
(0.3, 1.1)* (0.8, 1.6)* (−0.2, 0.6) (1.0, 1.8)* (0.1, 0.9)*
2 2 3 3 4
4 5 4 5 5
(−0.9, −0.1)* (0.3, 1.1)* (−1.4, −0.6)* (−0.2, 0.6) (0.8, 1.6)*
Thus brands 1 and 4 are not significantly different from one another, but they are significantly higher than the other three brands in their true average contents. Brand 2 is significantly better than 3 and 5 but worse than 1 and 4, and brands 3 and 5 do not differ significantly. ■ While the CIs in Table 11.5 correctly indicate which means are believed to be significantly different, this display is rather unwieldy. It is preferable to list out the observed sample means (say, from smallest to largest) and somehow indicate which are “close enough” that the corresponding population means are not judged to be significantly different. The following box describes how nonsignificant differences can be identified visually using an “underscoring pattern.” Tukey’s Method 1. List the sample means in increasing order (make sure to identify the correfor Identifying sponding population/treatment for each xi ). Significantly 2. Starting at the far left, use the Tukey intervals to determine which means are Different li ’s not significantly different from the first one in the list. Underscore that set of means with a single line segment. 3. Continue in this fashion for the second mean, third mean, etc., always underscoring in the rightward direction. Duplicate underscorings should only be drawn once. Any pair of sample means not underscored by the same line correspond to a pair of population or treatment means that are judged significantly different. In fact, it is not necessary to construct the Tukey intervals in order to perform step 2. Rather, the CI pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi for li lj will include 0 if and only if xi and xj differ by less than Qa;I;IðJ1Þ MSE=J , the margin of error of the confidence interval (11.2). This margin of error is sometimes referred to as Tukey’s honestly significant difference (HSD).
11.2
Multiple Comparisons in ANOVA
657
As an example, consider I = 5 with x2 \x5 \x4 \x1 \x3 Suppose the Tukey confidence intervals for l2 – l5 and l2 – l4 include zero, but those for l2 – l1 and l2 – l3 do not. Then we draw a line segment starting from 2 and extending to 4: Group: 2 Sample mean: x2
5 x5
4 x4
1 x1
3 x3
Next, suppose the mean for group 5 is not significantly different from that of group 4. Since we have already accounted for that set, no duplicate line is drawn. Finally, if groups 4 and 1 are not significantly different, that pair is underscored: Group: Sample mean:
2 x2
5 x5
4 x4
1 x1
3 x3
The fact that x3 isn’t underlined at all indicates that l3 is statistically significantly different from all other group means. Example 11.6 (Example 11.5 continued) The five sample means, arranged in order, are Brand of filter: Sample mean:
5 13:1
3 13:3
2 13:8
4 14:3
1 14:5
Only two of the Tukey CIs included zero: the interval for l1 – l4 and that for l3 – l5. Equivalently, pffiffiffiffiffiffiffiffiffiffiffiffiffiffi only those two pairs of means differ by less than the honestly significant difference 4:04 :088=9 ¼ :4 The resulting underscoring pattern is Brand of filter: Sample mean:
5 13:1
3 13:3
2 13:8
4 14:3
1 14:5
The mean for brand 2 is not underlined at all, since l2 was judged to be significantly different from all other means. If x2 ¼ 13:6 rather than 13.8 with HSD = .4, the underscoring configuration would be Brand of filter: Sample mean:
5 13:1
3 13:3
2 13:6
4 14:3
1 14:5
The interpretation of this underscoring must be done with care, since we seem to have concluded that brands 5 and 3 do not differ significantly, 3 and 2 also do not, yet 5 and 2 do differ. One could say here that although evidence allows us to conclude that brands 5 and 2 differ from each other, neither has been shown to be significantly different from brand 3. ■ Example 11.7 Almost anyone who attends physical therapy receives transcutaneous electrical nerve stimulation (TENS), possibly combined with a cold compress, to address pain. The article “Effect of Burst TENS and Conventional TENS Combined with Cryotherapy on Pressure Pain Threshold” (Physiotherapy 2015: 155–160) summarizes an experiment in which 112 healthy women were each
658
11
The Analysis of Variance
randomly assigned to one of seven treatments listed in the accompanying table (so, J = 16 for each treatment). After the treatment, researchers measured each woman’s pain threshold, in kg of force applied to the top of the humerus, resulting in the accompanying data. Pain threshold (kg of force) Treatment
xi
si
(1) (2) (3) (4) (5) (6) (7)
2.8 2.3 3.2 4.3 4.4 5.7 3.0
0.7 0.9 1.0 0.9 0.9 0.8 0.7
Control Placebo TENS Conventional TENS Burst TENS Cryotherapy Cryotherapy + burst TENS Cryotherapy + conventional TENS
Let li = the true mean pain threshold after the ith treatment (i = 1, …, 7). We wish to test the null hypothesis H0: l1 ¼ ¼ l7 against the alternative that not all li’s are equal. From the sample means and standard deviations, SSTr = 133.67 and SSE = 75.75, giving the ANOVA table in Table 11.6. Since f ¼ 30:88 [ F:01;6;105 ¼ 2:98, H0 is rejected at the .01 level; in fact, P-value 0. We conclude that the true mean pain threshold differs across the seven treatments. Table 11.6 ANOVA table for Example 11.7 Source of variation Treatments Error Total
df
Sum of squares
Mean square
f
6 105 111
133.67 75.75 209.42
22.28 0.72
30.88
Next, we apply Tukey’s method. There are I = 7 treatments and 105 df for error, so Q:01;7;105 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 5:02 (interpolating from Table A.9) and Tukey’s HSD ¼ 5:02 0:72=16 ¼ 1:06. Ordering the means and underscoring yields ð2Þ 2:3
ð1Þ 2:8
ð7Þ 3:0
ð3Þ 3:2
ð4Þ 4:3
ð5Þ 4:4
ð6Þ 5:7
Higher pain thresholds are considered better. In that respect, treatment 6 (cryotherapy plus burst TENS) is the clear winner, since the mean pain threshold under that treatment is highest and significantly different than all others. Treatments 4 and 5 (just burst TENS or just cryotherapy) are nextbest and are not significantly different from each other. Finally, the other four treatments are not honestly significantly different—and, in particular, treatments 3 and 7 are comparable to the control and placebo groups. Many research journals and some statistical software packages express the underscoring scheme with letter groupings. Figure 11.3 (p. 659) shows Minitab output from this analysis. The A, B, C letter groupings correspond to the three nonoverlapping sets identified above: treatment 6, treatments 4 and 5, and the rest.
11.2
Multiple Comparisons in ANOVA
659
Grouping Information Using the Tukey Method and 99% Confidence Treatment
N
Mean
Grouping
6
16
5.700
A
5
16
4.400
B
4
16
4.300
B
3
16
3.200
C
7
16
3.000
C
1
16
2.800
C
2
16
2.300
C
Means that do not share a letter are significantly different.
Figure 11.3 Tukey’s method using Minitab
■
The Interpretation of a in Tukey’s Procedure We stated previously that the simultaneous confidence level is controlled by Tukey’s method. So what does “simultaneous” mean here? Consider calculating a 95% CI for a population mean l based on a sample from that population and then a 95% CI for a population proportion p based on another sample selected independently of the first one. Prior to obtaining data, the probability that the first interval will include l is .95, and this is also the probability that the second interval will include p. Because the two samples are selected independently of each other, the probability that both intervals will include the values of the respective parameters is (.95)(.95) = (.95)2 = .9025. Thus the simultaneous or joint confidence level for the two intervals is roughly 90%—if pairs of intervals are calculated over and over again from independent samples, in the long run 90.25% of the time the first interval will capture l and the second will include p. Similarly, if three CIs are calculated based on independent samples, the simultaneous confidence level will be 100(.95)3% 86%. Clearly, as the number of intervals increases, the simultaneous confidence level that all intervals capture their respective parameters will decrease. Now suppose that we want to maintain the simultaneous confidence level at 95%. Then for two pffiffiffiffiffiffiffi independent samples, the individual confidence level for each would have to be 100 :95% 97:5%. The larger the number of intervals, the higher the individual confidence level would have to be to maintain the 95% simultaneous level. The tricky thing about the Tukey intervals is that they are not based on independent samples— MSE appears in every one, and various intervals share the same xi ’s (e.g., in the case I = 4, three different intervals all use x1 ). This implies that there is no straightforward probability argument for ascertaining the simultaneous confidence level from the individual confidence levels. Nevertheless, if Q.05 is used, the simultaneous confidence level is controlled at 95%, whereas using Q.01 gives a simultaneous 99% level. To obtain a 95% simultaneous level, the individual level for each interval must be considerably larger than 95%. Said in a slightly different way, to obtain a 5% experimentwise or family error rate, the individual or per-comparison error rate for each interval must be considerably smaller than .05. Confidence Intervals for Other Parametric Functions In some situations, a CI is desired for a function of the li’s more complicated than a difference P ci li , where the ci’s are constants. One such function is 12 ðl1 þ l2 Þ li lj . Let h ¼ 1 3 ðl3 þ l4 þ l5 Þ, which in the context of Example 11.5 measures the difference between the group consisting of the first two brands and that of the last three brands. Because the Xij’s are normally
660
11
The Analysis of Variance
P distributed with E(Xij) = li and V(Xij) = r2, the natural estimator ^ h ¼ i ci X i is normally distributed, unbiased for h, and Vð^hÞ ¼ V
X
! ci X i
¼
X
i
i
r2 X 2 c2i V X i ¼ c J i i
^^h results in a t variable ð^ Estimating r2 by MSE and forming r h hÞ=^ r^h , which can be manipulated to P obtain the following 100(1 – a)% confidence interval for ci l i : X
cixi ta=2;IðJ1Þ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X MSE c2i =J
ð11:3Þ
Example 11.8 (Example 11.5 continued) The parametric function for comparing the first two (store) brands of oil filter with the last three (national) brands is h ¼ 12 ðl1 þ l2 Þ 13 ðl3 þ l4 þ l5 Þ, from which X
c2i
2 2 2 2 2 1 1 1 1 1 5 ¼ þ þ þ þ ¼ 2 2 3 3 3 6
With ^h ¼ 12 ðx1 þ x2 Þ 13 ðx3 þ x4 þ x5 Þ ¼ :583 and MSE = .088, a 95% CI for h is :583 2:021
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð:088Þ ð5=6Þ=9 ¼ :583 :182 ¼ ð:401; :765Þ
■
P Notice that in the foregoing example the coefficients c1, …, c5 satisfy ci ¼ 12 þ 12 13 13 13 ¼ 0. P When the coefficients sum to 0, the linear combination h ¼ ci li is called a contrast among the means, and the analysis is available in a number of statistical software programs. Sometimes an experiment is carried out to compare each of several “new” treatments to a control treatment. In such situations, a multiple comparisons technique called Dunnett’s method is appropriate. Exercises: Section 11.2 (15–26) 15. An experiment to compare the spreading rates of five different brands of yellow interior latex paint available in a particular area used 4 gallons (J = 4) of each paint. The sample average spreading rates (ft2/gal) for the five brands were x1 ¼ 462:0, x2 ¼ 512:8, x3 ¼ 437:5, x4 ¼ 469:3, and x5 ¼ 532:1. The computed value of F was found to be significant at level a = .05. With MSE = 272.8, use Tukey’s procedure to investigate significant differences in the true average spreading rates between brands. 16. In the previous exercise, suppose x3 ¼ 427:5. Now which true average spreading rates differ significantly from each
other? Be sure to use the method of underscoring to illustrate your conclusions, and write a paragraph summarizing your results. 17. Repeat the previous exercise supposing that x2 ¼ 502:8 in addition to x3 ¼ 427:5. 18. Consider the data on maximum load for cracked wood beams presented in Exercise 4. Would it make sense to apply Tukey’s method to this data? Why or why not? [Hint: The P-value from the analysis of variance is .169.] 19. Use Tukey’s procedure on the data of Example 11.1 to identify differences in true average compression strength among the
11.2
Multiple Comparisons in ANOVA
661
four box types. Is your answer consistent with the boxplot in Figure 11.1a? 20. Use Tukey’s procedure on the data of Example 11.3 to identify differences in true mean maximum power per unit area among the five modules. 21. Of the five modules in Example 11.3, the first two are existing commercial devices and the last three are prototypes constructed by the study’s authors. Compute a 95% t CI for the contrast h ¼ 12 ðl1 þ l2 Þ 1 3 ðl3 þ l4 þ l5 Þ. 22. The article “Iron and Manganese Present in Underground Water Promote Biochemical, Genotoxic, and Behavioral Alterations in Zebrafish” (Environ. Sci. Pollut. Res. 2019: 23555–23570) reports the following data on micronucleus frequency in the muscle tissues for zebrafish exposed to varying concentrations of iron and manganese. Micronuclei are a warning sign of possible DNA damage. There were J = 10 zebrafish in each group. Treatment 1. 2. 3. 4. 5. 6. 7.
Control Fe 0.8 Fe 1.3 Mn 0.2 Mn 0.4 Fe 0.8/Mn 0.2 Fe 1.3/Mn 0.4
xi
si
53.72 139.11 93.62 134.66 141.12 101.25 124.50
4.30 6.41 4.49 13.20 8.32 15.41 9.60
a. Perform an analysis of variance at the .01 significance level. [Note: Though unequal variances are a concern here, the balanced study design should at least partially mitigate that issue.] b. Apply Tukey’s method to determine which treatments result in significantly different mean micronucleus frequencies. c. Suppose that 100(1 − a)% CIs for k different parametric functions are computed from the same ANOVA data set. Then it is easily verified that the simultaneous confidence level is at least 100(1 − ka)%. Compute CIs with simultaneous confidence level at
least 98% for the contrasts 1 l1 6 ðl2 þ l3 þ l4 þ l5 þ l6 þ l7 Þ and 14 ðl2 þ l3 þl6 þ l7 Þ 12 ðl4 þ l5 Þ. 23. Refer back to the bike helmet data of Exercise 5. a. Apply Tukey’s procedure at the .05 level to determine which bike helmets have honestly significantly different mean peak linear acceleration under the specified experimental conditions. Use SSE = 3900. b. Seven of the 10 brands are considered road helmets (elongated shape with aerodynamic venting), while three brands—BMIPS (1), GMIPS (5), and NW (7)—are nonroad helmets. Compute a 95% CI for the contrast h ¼ 1 1 7 ðl2 þ þ l10 Þ 3 ðl1 þ l5 þ l7 Þ, where the first sum spans across the seven road helmet brands. 24. Consider the accompanying data on plant growth after the application of different types of growth hormone. Hormone 1 2 3 4 5
Growth 13 21 18 7 6
17 13 15 11 11
7 20 20 18 15
14 17 17 10 8
a. Perform an F test at level a = .05. b. What happens when Tukey’s procedure is applied? 25. Consider a single-factor ANOVA experiment in which I = 3, J = 5, x1 ¼ 10, x2 ¼ 12, and x3 ¼ 20. Determine a value of SSE for which f > F.05,2,12, so that H0: l1 = l2 = l3 is rejected, yet when Tukey’s procedure is applied none of the li’s differ significantly from each other. 26. Refer to the previous exercise and suppose x1 ¼ 10, x2 ¼ 15, and x3 ¼ 20. Can you now find a value of SSE that produces such a contradiction between the F test and Tukey’s procedure?
662
11.3
11
The Analysis of Variance
More on Single-Factor ANOVA
In this section, we consider some additional issues relating to single-factor ANOVA. These include an alternative description of the model parameters, power and b for the F test, the relationship of the test to procedures previously considered, data transformation, a random effects model, and formulas for the case of unequal sample sizes. An Alternative Description of the ANOVA Model The assumptions of single-factor ANOVA can be described succinctly through the model equation Xij ¼ li þ eij where eij represents a random deviation from the population or true treatment mean li. The eij’s are assumed to be independent, normally distributed rvs (implying that the Xij’s are also) with E(eij) = 0 [so that E(Xij) = li] and V(eij) = r2 [from which V(Xij) = r2 for every i and j]. An alternative description of single-factor ANOVA will give added insight and suggest appropriate generalizations to models involving more than one factor. Define a parameter l by l¼
I 1X l I i¼1 i
and parameters a1, …, aI by a i ¼ li l
ði ¼ 1; . . .; IÞ
Then the treatment mean li can be written as l + ai, where l represents the true average overall response across all populations/treatments, and ai is the effect, measured as a departure from l, due to the ith treatment. Whereas we initially had I parameters, we now have I + 1: l, a1, …, aI. However, P because ai ¼ 0 (the average departure from the overall mean response is zero), only I of these new parameters are independently determined, so there are as many independent parameters as there were before. In terms of l and the ai’s, the model equation becomes Xij ¼ li þ ai þ eij
ði ¼ 1; . . .; I; j ¼ 1; . . .; JÞ
In the next two sections, we will develop analogous models for two-factor ANOVA. P The claim that the li’s are identical is equivalent to the equality of the ai’s, and because ai ¼ 0, the null hypothesis becomes H0 : a1 ¼ a2 ¼ ¼ aI ¼ 0 In Section 11.1, it was stated that MSTr is an unbiased estimator of r2 when H0 is true but otherwise tends to overestimate r2. More precisely, Exercise 14 established that EðMSTrÞ ¼ r2 þ
J X J X 2 ai ðli lÞ2 ¼ r2 þ I1 I1
P P When H0 is true, a2i ¼ 0 so E(MSTr) = r2 (MSE is unbiased whether or not H0 is true). If a2i is P 2 used as a measure of the extent to which H0 is false, then a larger value of ai will result in a greater
11.3
More on Single-Factor ANOVA
663
tendency for MSTr to overestimate r2. More generally, formulas for expected mean squares for multifactor models are used to suggest how to form F ratios to test various hypotheses. Power and b for the F Test Consider a set of parameter values a1, a2, …, aI for which H0 is not true. The power of the ANOVA F test is the probability that H0 is rejected when that set is the set of true values, and the probability of a type II error is b = 1 – power. One might think that power and b would have to be determined separately for each different configuration of ai’s. Fortunately, power for the F test depends on the ai’s P 2 only through a , and so it can be simultaneously evaluated for many different alternatives. For P 2 i example, ai ¼ 4 for each of the following sets of ai’s, so power is identical for all three alternatives: 1. a1 ¼ 1, a2 ¼ 1, a3 ¼ 1, a4 ¼ 1 pffiffiffi pffiffiffi 2. a1 ¼ 2, a2 ¼ 2, a3 ¼ 0, a4 ¼ 0 pffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffi pffiffiffi 3. a1 ¼ 3, a2 ¼ 1=3, a3 ¼ 1=3, a4 ¼ 1=3 When H0 is false, the test statistic MSTr/MSE has a noncentral F distribution, a three-parameter family. For one-way ANOVA, the first two parameters are still m1 = numerator df = I – 1 and m2 = denominator df = IJ – I, while the noncentrality parameter k is given by k¼
J X 2 ai r2
Power is an increasing function of the noncentrality parameter k (and b is a decreasing function of k). Thus, for fixed values of r2 and J, the null hypothesis is more likely to be rejected for alternatives far P P from H0 (large a2i ) than for alternatives close to H0. For a fixed value of a2i , power increases and b decreases as the sample size J on each treatment increases, whereas power decreases and b increases as the error variance r2 increases (since greater underlying variability makes it more difficult to detect any given departure from H0). Because hand computation of power, b, and sample size for the F test are quite difficult (as in the case of t tests), software is typically required. For one-way ANOVA, and with the noncentral F parameters specified as above, b ¼ PðH0 is not rejected when H0 is falseÞ ¼ P F\Fa;I1;IJI when F noncentral F ¼ noncentral cdf evalulated at Fa;I1;IJI and power = 1 – b. Many statistical packages (including SAS, JMP, and R) have a function that calculates the cumulative area under a noncentral F curve (required inputs are the critical value Fa, the numerator df, the denominator df, and k), and this area is b. Example 11.9 The effects of four different heat treatments on yield point (tons/in2) of steel ingots are to be investigated. A total of J = 8 ingots will be cast using each treatment. Suppose the true standard deviation of yield point for any of the four treatments is r = 1. How likely is it that H0 will not be rejected at level .05 if three of the treatments have the same expected yield point and the other treatment has an expected yield point that is 1 ton/in2 greater than the common value of the other three (i.e., the fourth yield is on average 1 standard deviation above those for the first three treatments)?
664
11
Suppose that l1 = l2 = l3 and l4 = l1 + 1, so l ¼ ð a2 ¼ 14, a3 ¼ 14, a4 ¼ 34 so 8 k¼ 2 1
"
1 4
2
P
The Analysis of Variance
li Þ=4 ¼ l1 þ 14. Then a1 ¼ l1 l ¼ 14,
2 2 2 # 1 1 3 þ þ þ ¼6 4 4 4
The degrees of freedom are m1 = I – 1 = 3 and m2 = IJ – I = 28, and the F critical value for the .05 test is F:05;3;28 ¼ 2:947. The probability of a type II error here is b .54, obtained through the R command pf(2.947,df1=3,df2=28,ncp=6), and power .46. This power is rather low, so we might decide to increase the value of J. How many ingots of each type would be required to yield b .05 (about 95% power) for the alternative under consideration? By trying different values of J, it can be verified that J = 24 will meet the requirement, but any smaller J will not. ■ In lieu of directly accessing the noncentral F cdf, some software packages will calculate power when the user specifies all the necessary information. For example, R has a function that allows specification of all I of the means, along with any three among J, r2, a, and power. The function calculates whichever quantity is unspecified. For example, we might wish to calculate the power of the test with a = .05, r = 1, I = 4, J = 2, l1 = 100, l2 = 101, l3 = 102, and l4 = 106. The R function calculates power = .904 (and so b = .096). Minitab v.19 does something rather different. The user is asked to specify the maximum difference between li’s rather than the individual means. For example, in the previous scenario the maximum difference is 106 – 100 = 6. However, power depends not only on this maximum difference but on the values of all the li’s. In this situation Minitab calculates the smallest possible value of power subject to l1 = 100 and l4 = 106, which occurs when the two other li’s are both halfway between 100 and 106. This power is .86, so we can say that the power is at least .86 and b is at most .14 when the two most extreme li’s are separated by 6. The software will also determine the necessary common sample size if maximum difference and minimum power are specified. Relationship of the F Test to the t Test When the number of populations is just I = 2, the ANOVA F test is testing H0: l1 = l2 versus Ha: l1 6¼ l2. In this case, a two-tailed, two-sample t test could also be used. In Section 10.2, we mentioned the pooled t test, which requires equal variances, as an alternative to the two-sample t procedure. With a little algebra, it can be shown that the single-factor ANOVA F test and the twotailed pooled t test are equivalent: for any given data set, the P-values for the two tests will be identical, so the same conclusion will be reached by either test. (The test statistic values are related by f = t2.) The two-sample t test is more flexible than the F test when I = 2 for two reasons. First, it is not based on the assumption that r1 = r2; second, it can be used to test Ha: l1 > l2 (an upper-tailed t test) or Ha: l1 < l2 (a lower-tailed test) as well as Ha: l1 6¼ l2. Single-Factor ANOVA When Sample Sizes Are Unequal When the sample sizes from each population or treatment are not equal (i.e., an unbalanced study P design), let J1, J2, …, JI denote the I sample sizes and let n ¼ i Ji denote the total number of observations. The accompanying box gives ANOVA formulas and the test procedure.
11.3
More on Single-Factor ANOVA
665
P Pi P Grand mean: X ¼ 1n Ii¼1 Jj¼1 Xij ¼ 1n Ii¼1 Ji X i (a weighted average of the sample means) Fundamental ANOVA Identity: SST = SSTr + SSE where the three sums of squares and associated dfs are SSTr ¼
Ji I X X
X i X
2
¼
i¼1 j¼1
SSE ¼
df ¼ I 1
i¼1
Ji I X X
Xij X i
2
i¼1 j¼1
SST ¼
I X 2 Ji X i X
¼
I X i¼1
Ji I X X
Xij X
2
ðJi 1ÞS2i
df ¼
I X
ðJi 1Þ ¼ n I
i¼1
¼ SSTr þ SSE
df ¼ n 1
i¼1 j¼1
Test statistic value: f ¼
MSTr MSE
where MSTr ¼
SSTr SSE and MSE ¼ I1 nI
Rejection region: f Fa;I1;nI P-value: area under the FI1;nI curve to the right of f Verification of the fundamental ANOVA identity proceeds as in the case of equal sample sizes. However, it is somewhat trickier here to show that MSTr/MSE has the F distribution under H0. Validity of the test procedure requires assuming, as before, that the population distributions are all normal with the same variance. The methods described at the end of Section 11.1 for assessing these with the residuals eij ¼ xij xi can also be applied here. Example 11.10 The article “On the Development of a New Approach for the Determination of Yield Strength in Mg-Based Alloys” (Light Metal Age, Oct. 1998: 51–53) presented the following data on elastic modulus (GPa) obtained by a new ultrasonic method for specimens of an alloy produced using three different casting processes. Process Permanent molding Die casting Plaster molding
Observations 45.5 44.2 46.0
45.3 43.9 45.9
45.4 44.7 44.8
44.4 44.2 46.2
44.6 44.0 45.1
43.9 43.8 45.5
44.6 44.6
44.0 43.1
Ji
xi
si
8 8 6
44.71 44.06 45.58
.624 .501 .549
22
Let l1, l2, and l3 denote the true average elastic moduli for the three different processes under the given circumstances. The relevant hypotheses are H0: l1 = l2 = l3 versus Ha: at least two of the li’s are different. The test statistic is, of course, F = MSTr/MSE, based on I – 1 = 2 numerator df and n – I = 22 – 3 = 19 denominator df. Relevant quantities include
666
11
The Analysis of Variance
1 ½8ð357:7Þ þ 8ð352:5Þ þ 6ð273:5Þ ¼ 44:71 22 SSTr ¼ 8ð44:71 44:71Þ2 þ 8ð44:06 44:71Þ2 þ 6ð45:58 44:71Þ2 ¼ 7:93 x:: ¼
SSE ¼ ð8 1Þð:624Þ2 þ ð8 1Þð:501Þ2 þ ð6 1Þð:549Þ2 ¼ 6:00 The remaining computations are displayed in the accompanying ANOVA table. Since f = 12.56 > F.001,2,19 = 10.16, the P-value is smaller than .001. Thus the null hypothesis should be rejected at any reasonable significance level; there is compelling evidence for concluding that true average elastic modulus somehow depends on which casting process is used. Source of variation
df
Sum of squares
Mean square
f
Treatments Error Total
2 19 21
7.93 6.00 13.93
3.965 .3158
12.56
■ Multiple Comparisons When Sample Sizes Are Unequal There is more controversy among statisticians regarding which multiple comparisons procedure to use when sample sizes are unequal than there is in the case of equal sample sizes. The procedure that we present here is called the Tukey–Kramer procedure for use when the I sample sizes J1, J2, …, JI are reasonably close to each other (“mild imbalance”). It modifies Tukey’s method [Equation (11.2)] by using averages of pairs of 1/Ji’s in place of 1/J. Let dij ¼ Qa;I;nI
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi MSE 1 1 þ 2 Ji Jj
Then the probability is approximately 1 – a that
X i X j dij li lj X i X j þ dij
for every i and j (i = 1, …, I and j = 1, …, I) with i 6¼ j. The simultaneous confidence level 100(1 – a)% is now only approximate. The underscoring method can still be used, but now the honestly significant difference dij used to decide whether xi and xj can be connected by a line segment will depend on Ji and Jj. Example 11.11 (Example 11.10 continued) The sample sizes for the elastic modulus data were J1 = 8, J2 = 8, J3 = 6, and I = 3, n – I = 19, MSE = .316. A simultaneous confidence level of approximately 95% requires Q.05,3,19 = 3.59, from which d12
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :316 1 1 þ ¼ :713 ¼ 3:59 2 8 8
d13 ¼ :771
d23 ¼ :771
Since x1 x2 ¼ 44:71 44:06 ¼ :65\d12 , l1 and l2 are judged not significantly different. The accompanying underscoring scheme shows that l1 and l3 differ significantly, as do l2 and l3.
11.3
More on Single-Factor ANOVA 2. Die 44.06
667 1. Permanent 44.71
3. Plaster 45.58
■ Data Transformation The use of ANOVA methods can be invalidated by substantial differences in the variances r21 ; . . .; r2I , which until now have been assumed equal with common value r2. It sometimes happens that V Xij ¼ r2i ¼ gðli Þ, a known function of li (so that when H0 is false, the variances are not equal). For example, if Xij has a Poisson distribution with parameter li (approximately normal if li 10), then r2i ¼ li , so g(li) = li is the known function. In such cases, one can often transform the Xij’s to h(Xij)’s so that they will have approximately equal variances (while hopefully leaving the transformed variables approximately normal), and then the F test can be used on the transformed observations. The basic idea is that, if h() is a smooth function, then we can express it approximately using the first terms of a Taylor series: h(Xij) h(li) + h′(li)(Xij – li). Then V[h(Xij)] V(Xij) [h′(li)]2 = g(li) [h′(li)]2. We now wish to find the function h() for which g(li) [h′(li)]2 = c (a constant) for every i. Solving this for h′(li) and integrating give the following result. PROPOSITION
If V(Xij) = g(li), a known function of li, then a transformation h(Xij) that “stabilizes the variance” so that V[h(Xij)] is approximately the same for R each i is given by hðxÞ / ½gðxÞ1=2 dx.
In the Poisson case, g(x) = x, so h(x) should be proportional to pffiffiffiffiffi should be transformed to h xij ¼ xij before the analysis.
R
x1=2 dx ¼ 2x1=2 . Thus Poisson data
A Random Effects Model The single-factor problems considered so far have all been assumed to be examples of a fixed effects ANOVA model. By this we mean that the chosen levels of the factor under study are the only ones considered relevant by the experimenter. The single-factor fixed effects model is Xij ¼ l þ ai þ eij
with
X
ai ¼ 0
ð11:4Þ
where the eij’s are random and both l and the ai’s are fixed parameters whose values are unknown. In some single-factor problems, the particular levels studied by the experimenter are chosen, either by design or through sampling, from a large population of levels. For example, to study the effect of using different operators on task performance time for a particular machine, a sample of five operators might be chosen from a large pool of operators. Similarly, the effect of soil pH on the yield of soybean plants might be studied by using soils with four specific pH values chosen from among the many possible pH levels. When the levels used are selected at random from a larger population of possible levels, the factor is said to be random rather than fixed, and the fixed effects model (11.4) is no longer appropriate. An analogous random effects model is obtained by replacing the fixed ai’s in (11.4) by random variables. The resulting model description is
668
11
The Analysis of Variance
Xij ¼ l þ Ai þ eij with EðAi Þ ¼ E eij ¼ 0 V eij ¼ r2 V ðAi Þ ¼ r2A
ð11:5Þ
with all Ai’s and eij’s normally distributed and independent of each other. P ai ¼ 0 in (11.4); it states that the The condition E(Ai) = 0 in (11.5) is similar to the condition expected or average effect of the ith level measured as a departure from l is zero. For the random effects model (11.5), the hypothesis of no effects due to different levels is H0 : r2A ¼ 0, which says that different levels of the factor contribute nothing to variability of the response. Critically, although the hypotheses in the single-factor fixed and random effects models are different, they are tested in exactly the same way: by forming F = MSTr/MSE and rejecting H0 if f Fa,I–1,n–I. This can be justified intuitively by noting in the random effects model that E(MSE) = r2 (as for fixed effects), whereas EðMSTrÞ ¼ r2 þ
P 2 1 Ji n r2A I1 n
ð11:6Þ
P where again J1, J2, …, JI are the sample sizes and n ¼ Ji . The factor in parentheses on the right side of (11.6) is nonnegative, so once again E(MSTr) = r2 if H0 is true (i.e., if r2A ¼ 0) and E(MSTr) > r2 if H0 is false. Example 11.12 When items are machined out of metal (or plastic or wood) sheets by drills, undesirable burrs form along the edge. The article “Observation of Drilling Burr and Finding out the Condition for Minimum Burr Formation” (Int. J. Manuf. Engr. 2014) reports on a study of the effect that cutting speed has on burr size. Eighteen measurements were made at each of three speeds (20, 25, and 31 m/min) that were randomly selected from the range of possible speeds for the particular equipment used in the experiment. Each measurement is the burr height (mm) from drilling into a low-alloy steel specimen. The data is summarized in the accompanying table along with the derived ANOVA table. The very small f statistic and correspondingly large P-value indicates that H0: r2A ¼ 0 should not be rejected. The data does not indicate cutting speed impacts burr size.
Speed (m/min) 20 25 31
xi
si
1.558 1.998 1.867
2.018 2.415 2.148
Source of variation
df
SS
MS
f
P-value
Cutting speed Error Total
2 51 53
1.837 246.795 248.632
0.9186 4.8931
0.19
.828
■
11.3
More on Single-Factor ANOVA
669
Exercises: Section 11.3 (27–44) 27. The following data refers to yield of tomatoes (kg/plot) for four different levels of salinity; salinity level here refers to electrical conductivity (EC), where the chosen levels were EC = 1.6, 3.8, 6.0, and 10.2 nmhos/cm: EC 1.6 3.8 6.0 10.2
Yield 59.5 55.2 51.7 44.6
53.3 59.1 48.8 48.5
56.8 52.8 53.9 41.0
63.1 54.5 49.0 47.3
58.7
c. Apply the Tukey–Kramer method to identify significant differences among the li’s. 30. The article “From Dark to Light: Skin Color and Wages among African Americans” (J. Human Res. 2007: 701–738) includes the following information on hourly wages ($) for representative samples of the indicated populations.
46.1 Skin color
Use the F test at level a = .05 to test for any differences in true average yield due to the different salinity levels. 28. Apply the modified Tukey’s method to the data in the previous exercise to identify significant differences among the li’s. 29. A study at Bentley College, a large business school in the eastern United States, examined students’ anxiety levels toward the subject of accounting (“Determinants of Accounting Anxiety in Business Students,” J. College Teach. Learn. 2004). A representative sample of 1020 students completed the Accounting Anxiety Rating Scale (AARS) questionnaire; higher scores (out of 100) indicate greater anxiety. Summary data broken down by grade level appears in the accompanying table. Class level
n
x
s
Freshman Sophomore Junior Senior Graduate
86 224 225 198 287
48.95 51.45 52.89 52.92 45.55
9.13 11.29 11.32 11.32 10.10
a. Comment on the plausibility (or necessity) of the normality and equal variance assumptions for this example. b. Test at a = .05 the hypothesis that the mean accounting anxiety level for business students varies by class level.
White Light Black Medium Black Dark Black
n
x
s
513 51 177 207
15.94 14.42 13.23 11.72
7.73 6.05 6.64 5.60
a. Does population mean hourly wage appear to depend on skin color? Carry out an appropriate test of hypotheses. b. Identify significant differences among the li’s at the .05 significance level. 31. The authors of the article “Exploring the Impact of Social Media Practices on Wine Sales in US Wineries” (J. Direct Data, Digital Market. Pract. 2016: 272–283) interviewed 361 winery managers. Each manager was asked to report, as a percentage of sales, the impact of social media use on their wine sales. Each winery’s social media presence was then categorized by the number of social media platforms it used: 0–2, 3–5, or 6 or more. Summary information appears in the accompanying table. Test to see whether an association exists between social media presence and sales at the .01 significance level. No. of platforms Two or fewer Three to five Six or more
n
Mean
SD
107 164 90
12.76 17.23 21.56
13.00 14.07 17.35
670
11
32. The article “Can You Test Me Now? Equivalence of GMA Tests on Mobile and Non-Mobile Devices” (Int. J. Select Assess. 2017: 61–71) describes a study in which 1769 people were recruited through Amazon Mechanical Turk to take a general mental ability test. (Researchers used the WPT-R test, a variation on the Wonderlic Personnel Test used by the NFL to evaluate players.) Participants were randomly assigned to take the test on one of three electronic devices; the data is summarized below. Device Computer Tablet Smartphone
n
Mean score
SD
724 476 569
33.98 34.40 34.26
7.65 7.75 7.59
The goal of the study was to determine whether the three devices can be considered “equivalent” for the purpose of administering mental ability tests. Perform an ANOVA at the .05 significance level, and explain what you discover. 33. Lipids provide much of the dietary energy in the bodies of infants and young children. There is a growing interest in the quality of the dietary lipid supply during infancy as a major determinant of growth, visual and neural development, and long-term health. The article “Essential Fat Requirements of Preterm Infants” (Amer. J. Clin. Nutrit. 2000: 245S–250S) reported the following data on total polyunsaturated fats (%) for infants who were randomized to four different feeding regimens: breast milk, cornoil-based formula, soy-oil-based formula, or soy-and-marine-oil-based formula: Regimen Breast milk CO SO SMO
Sample size
Sample mean
Sample SD
8
43.0
1.5
13 17 14
42.4 43.1 43.5
1.3 1.2 1.2
The Analysis of Variance
a. What assumptions must be made about the four total polyunsaturated fat distributions before carrying out a singlefactor ANOVA to decide whether there are any differences in true average fat content? b. Carry out the test suggested in part (a). What can be said about the P-value? 34. Samples of six different brands of diet/ imitation margarine were analyzed to determine the level of physiologically active polyunsaturated fatty acids (PAPFUA, in percentages). The data below is consistent with a study carried out by Consumer Reports: Brand
PAPFUA
Imperial Parkay Blue Bonnet Chiffon Mazola Fleischmann’s
14.1 12.8 13.5 13.2 16.8 18.1
13.6 12.5 13.4 12.7 17.2 17.2
14.4 13.4 14.1 12.6 16.4 18.7
14.3 13.0 14.3 13.9 17.3 18.4
12.3
18.0
a. Use ANOVA to test for differences among the true average PAPFUA percentages for the different brands. b. Compute CIs for all (li – lj)’s. c. Mazola and Fleischmann’s are cornbased, whereas the others are soybeanbased. Compute a CI for l1 þ l2 þ l3 þ l4 l5 þ l6 4 2 [Hint: Modify the expression for Vð^ hÞ that led to (11.3) in the previous section.] 35. Subacromial impingement syndrome (SIS) refers to shoulder pain resulting from a particular impingement of the rotator cuff tendon. The article “Evaluation of the Effectiveness of Three Physiotheraputic Treatments for SIS” (Physiotherapy 2016: 57–63) reports a study in which 99 SIS sufferers were randomly assigned to receive one of three treatments across 20 sessions. The Constant–Murley score (CMS), a
11.3
More on Single-Factor ANOVA
671
standard measure of shoulder functionality and pain, was administered to each subject before the experiment and again one month post-treatment. The accompanying table summarizes the change in CMS (higher numbers are better) for the subjects. Treatment
n
Mean
SD
Ultrasound Phonophoresis Iontophoresis
32 33 34
5.1 6.4 6.5
10.3 9.3 18.1
Test to see whether the true mean increase in CMS differs across the three different treatments, at the a = .05 significance level. 36. In single-factor ANOVA with sample sizes Ji (i = 1, …, I), show that SSTr ¼ 2 P P 2 2 i X i X ¼ i JP i Ji X i nX , where n ¼ i Ji . 37. When sample sizes are equal (Ji = J), the parameters a1, a2, …, aI of the alternative parameterization to the model equation are P restricted by ai ¼ 0. For unequal sample sizes, the most natural restriction is P Ji ai ¼ 0. Use this to show that EðMSTrÞ ¼ r2 þ
1 X 2 Ji a i I1
What is E(MSTr) when H0 is true? [This P expectation is correct if Ji ai ¼ 0 is P replaced by the restriction ai ¼ 0, or any other single linear restriction on the ai’s used to reduce the model to I independent P parameters, but Ji ai ¼ 0 simplifies the algebra and yields natural estimates for the model parameters.] 38. Reconsider Example 11.9 involving an investigation of the effects of different heat treatments on the yield point of steel ingots. a. If J = 8 and r = 1, what is b for a level .05 F test when l1 = l2, l3 = l1 – 1, and l4 = l1 + 1?
b. For the alternative li’s of part (a), what value of J is necessary to obtain b = .05? c. If there are I = 5 heat treatments, J = 10, and r = 1, what is b for the level .05 F test when four of the li’s are equal and the fifth differs by 1 from the other four? 39. For unequal sample sizes, the noncentrality parameter for F test power calculations is P 2 2 k¼ Ji ai =r . Referring to Exercise 27, what is the power of the test when l2 = l3, l1 = l2 – r, and l4 = l2 + r? 40. The following data on number of cycles to failure ( 106) appears in the article “An Experimental Study of Influence of Lubrication Methods on Efficiency and Contact Fatigue Life of Spur Gears” (J. Tribol. 2018). Lubrication condition D1 J1 J2 J4
Cycles to failure ( 106) 18.79 10.44 14.62 12.53 8.35 14.62 16.70 18.79 12.53 26.10 10.44 18.27 10.44 14.62 16.70 29.23 22.97 18.50 6.26 6.26 6.26 6.26 4.70 5.22
a. Calculate the standard deviation of each sample. Why should we be reluctant to proceed with an analysis of variance? b. Take the logarithm of the observations. Do these transformed values adhere better to the conditions for ANOVA? Explain. c. Perform one-way ANOVA on these transformed values. 41. Many studies have been conducted to measure mercury (Hg) levels in fish, but little information exists on Hg concentrations in marine mammals. The article “Factors Influencing Exposure of North American River Otters…to Mercury Relative to a Large-Scale Reservoir in Northern British Columbia, Canada” (Ecotoxicology 2019: 343–353) describes a study in which river otters living in five Canadian
672
11
reservoirs were tested for Hg concentration (mg/kg). The accompanying summary information was extracted from a graph in the article. Hg concentration
ln(Concentration)
Reservoir
n
Mean
SD
Mean
SD
PW MK NW 50K FR
15 7 20 20 27
4.1 9.8 14.2 16.1 18.2
3.0 4.5 6.7 10.4 7.1
1.30 2.28 2.62 2.75 2.82
.41 .21 .18 .20 .38
Hg concentration distributions at all five reservoirs were heavily right-skewed. a. Why should we hesitate to perform oneway ANOVA on the Hg concentration data? b. Consider the transformation y = ln(concentration). Estimated summary information appears above, and taking the logarithm greatly reduces skewness. Apply the ANOVA F test to the transformed data, and report your findings at the .05 significance level. c. Apply Tukey’s method to the logtransformed data, if appropriate.
11.4
The Analysis of Variance
42. Simplify E(MSTr) for the random effects model when J1 = J2 = = JI = J. 43. Suppose that Xij is a binomial variable with parameters n and pi (so it is approximately normal when npi 10 and nqi 10). Then since li ¼ npi , V Xij ¼ r2i ¼ npi ð1 pi Þ ¼ li ð1 li =nÞ. How should the Xij’s be transformed so as to stabilize the variance? [Hint: gðli Þ ¼ li ð1 li =nÞ.] 44. In an experiment to compare the quality of four different brands of magnetic tape (A–D), five 5000-foot reels of each brand were selected and the number of cosmetic flaws on each reel was determined. No. of flaws
Brand A B C D
10 14 13 17
5 12 18 16
12 17 10 12
14 9 15 22
8 8 18 14
It is believed that the number of flaws has approximately a Poisson distribution for each brand. Analyze the data at level .01 to see whether the expected number of flaws per reel is the same for each brand.
Two-Factor ANOVA without Replication
In many situations there are two factors of simultaneous interest. For example, a baker might experiment with I = 3 temperatures (325, 350, 375 °F) and J = 2 baking times (45 min, 60 min) to optimize a new cake recipe. Or, an industrial engineer might wish to study the surface roughness resulting from a certain machining process; she might carry out an experiment at various combinations of cutting speed and feed rate. Call the two factors of interest A and B. When factor A has I levels and factor B has J levels, there are IJ different combinations (pairs) of levels of the two factors, each called a treatment. If the data includes multiple observations for each treatment, the study is said to include replication. With Kij = the number of observations on the treatment (factor A, factor B) = (i, j), we focus in this section on the case Kij = 1 (i.e., no replication), so that the data consists of IJ observations. We will first discuss the fixed effects model, in which the only levels of interest for the two factors are those actually represented in the study. The case in which at least one factor is random is considered briefly at the end of the section.
11.4
Two-Factor ANOVA without Replication
673
Example 11.13 Is it really as easy to remove marks on fabrics from erasable pens as the word “erasable” might imply? Consider the following data from an experiment to compare three different brands of pens and four different wash treatments with respect to their ability to remove marks on a particular type of fabric (based on “An Assessment of the Effects of Treatment, Time, and Heat on the Removal of Erasable Pen Marks from Cotton and Cotton/Polyester Blend Fabrics,” J. Test. Eval. 1991: 394–397). The response variable is a quantitative indicator of overall specimen color change; the lower this value, the more marks were removed. Washing treatment
Brand of pen
1 2 3 Average
1
2
3
4
Average
.97 .77 .67 .803
.48 .14 .39 .337
.48 .22 .57 .423
.46 .25 .19 .300
.598 .345 .455
Is there any difference in the true average amount of color change due either to the different brands of pen or to the different washing treatments? ■ As in single-factor ANOVA, double subscripts are used to identify random variables and observed values. Let Xij = the random variable denoting the measurement when (factor A, factor B) = (i, j) xij = the observed value of Xij The xij’s are usually presented in a two-way table in which the ith row contains the observed values when factor A is held at level i and the jth column contains the observed values when factor B is held at level j. In the erasable-pen experiment of Example 11.13, the number of levels of factor A (pen brand) is I = 3, the number of levels of factor B (washing treatment) is J = 4; x13 = .48, x22 = .14, and so on. Whereas in single-factor ANOVA we were interested only in row means and the grand mean, here we are interested also in column means. Let X i ¼
J 1X the average of data obtained ¼ Xij when factor A is held at level i J j¼1
X j ¼
I 1X the average of data obtained ¼ Xij when factor B is held at level j I i¼1
X ¼ the grand mean ¼
I X J 1X Xij IJ i¼1 j¼1
with observed values xi , xj , and x . Intuitively, to see whether there is any effect due to the levels of factor A, we should compare the observed xi ’s with each other, and information about the different levels of factor B should come from the xj ’s.
674
11
The Analysis of Variance
A Two-Factor Fixed Effects Model Proceeding by analogy to single-factor ANOVA, one’s first inclination in specifying a model is to let lij = the true average response when (factor A, factor B) = (i, j), giving IJ mean parameters. Then let Xij ¼ lij þ eij where eij is the random amount by which the observed value differs from its expectation, and the eij’s are assumed normal and independent with common variance r2. Unfortunately, there is no valid test procedure for this choice of parameters. The reason is that under the alternative hypothesis of interest, the lij’s are free to take on any values whatsoever and r2 can be any value greater than zero, so that there are IJ + 1 freely varying parameters. But there are only IJ observations, so after using each xij as an estimate of lij, there is no way to estimate r2. To rectify this problem of a model having more parameters than observed values, we must specify a model that is realistic yet involves relatively few parameters. For the no-replication (Kij = 1) scenario, we assume the existence of a parameter l, I parameters a1, a2,…, aI, and J parameters b1, b2,…, bJ such that Xij ¼ l þ ai þ bj þ eij
ði ¼ 1; . . .; I;
j ¼ 1; . . .; JÞ
ð11:7Þ
Taking expectations on both sides of (11.7) yields lij ¼ l þ ai þ bj
ð11:8Þ
The model specified in (11.7) and (11.8) is called an additive model, because each mean response lij is the sum of a true grand mean (l), an effect due to factor A at level i (ai), and an effect due to factor B at level j (bj). The difference between mean responses for factor A at levels i and i′ when B is held at level j is lij – li′j . Critically, when the model is additive, lij li0 j ¼ ðl þ ai þ bj Þ ðl þ ai0 þ bj Þ ¼ ai ai0 which is independent of the level j of the factor B. A similar result holds for lij – lij′. Thus additivity means that the difference in mean responses for two levels of one of the factors is the same for all levels of the other factor. Figure 11.4a shows a set of mean responses that satisfy the condition of additivity (which implies parallel lines), and Figure 11.4b shows a nonadditive configuration of mean responses.
a
b
Mean response
Mean response
Levels of B
1
3 2 Levels of A
4
Levels of B
1
3 2 Levels of A
4
Figure 11.4 Mean responses for two types of model: (a) additive; (b) nonadditive
11.4
Two-Factor ANOVA without Replication
675
If additivity does not hold, we say that interaction is present. Factors A and B have an interaction effect on the response variable if the effect of factor A on the (mean) response value depends upon the level of factor B, and vice versa. The foregoing discussion implies that an additive model assumes there is no interaction effect. The graphs in Figure 11.4 are called interaction plots; Figure 11.4a displays data consistent with no interaction effect, whereas Figure 11.4b indicates a potentially strong interaction. When Kij = 1, there is insufficient data to estimate any potential interaction effects, and so the additive model specified by (11.7) and (11.8) must be used. In Section 11.5, where Kij > 1, we will consider models that include interaction effects. Example 11.14 (Example 11.13 continued) When the observed xij’s are plotted in a manner analogous to that of Figure 11.4, we get the result shown in Figure 11.5. Although there is some “crossing over” in the observed xij’s, the configuration is reasonably representative of what would be expected under additivity with just one observation per treatment.
Color change 1.0 .9
Brand 1 Brand 2
.8 .7
Brand 3 .6 .5 .4 .3 .2 .1 1
2
3
4
Washing treatment
Figure 11.5 Plot of data from Example 11.13
■
Expression (11.8) is still not quite our final model description, because the ai’s and bj’s are not uniquely determined. Following are two different configurations of the ai’s and bj’s (with l = 0 for convenience) that yield the same additive lij’s.
676
a1 = 1 a2 = 2
11 b1 = 1
b2 = 4
l11 = 2 l21 = 3
l12 = 5 l22 = 6
a1 = 0 a2 = 1
The Analysis of Variance
b1 = 2
b2 = 5
l11 = 2 l21 = 3
l12 = 5 l22 = 6
By subtracting any constant c from all ai’s and adding c to all bj’s, other configurations corresponding to the same additive model are obtained. This nonuniqueness is eliminated by use of the following model, which imposes an extra constraint on the ai’s and bj’s. TWO-FACTOR ANOVA ADDITIVE MODEL EQUATION
Xij ¼ l þ ai þ bj þ eij
ð11:9Þ
PJ PI where i¼1 ai ¼ 0, j¼1 bj ¼ 0, and the eij’s are assumed to be independent normal rvs with mean 0 and variance r2.
This is analogous to the alternative choice of parameters for single-factor ANOVA discussed in Section 11.3. It is not difficult to verify that (11.9) is an additive model in which the parameters are uniquely determined. Notice that there are now only I – 1 independently determined ai’s and J – 1 independently determined bj’s, so including l Expression (11.9) specifies (I – 1) + (J – 1) + 1 = I + J – 1 parameters. The interpretation of the parameters of (11.9) is straightforward: l is the true grand mean response over all levels of both factors; ai is the effect of factor A at level i measured as a deviation from l; and bj is the effect of factor B at level j. Unbiased (and maximum likelihood) estimators for these parameters are ^ ¼ X l
^ai ¼ X i X ::
^ ¼ X j X b j
Test Procedures There are two different hypotheses of interest in a two-factor experiment with Kij = 1. The first, denoted by H0A, states that the different levels of factor A have no effect on true average response. The second, denoted by H0B, asserts that there is no factor B effect. H0A: a1 ¼ a2 ¼ ¼ aI ¼ 0 versus HaA: at least one ai 6¼ 0 H0B: b1 ¼ b2 ¼ ¼ bJ ¼ 0 versus HaB : at least one bj 6¼ 0 No factor A effect implies that all ai’s are equal, so they must all be 0 since they sum to 0, and similarly for the bj’s. The analysis now follows closely that for single-factor ANOVA. The relevant sums of squares and associated dfs are given in the accompanying box.
11.4
Two-Factor ANOVA without Replication
SSA ¼
I X J X
X i X
677
2
I X
¼J
i¼1 j¼1
SSB ¼
I X J X
I X J X
df ¼ I 1
^2 b j
df ¼ J 1
i¼1
X j X
2
¼I
J X
i¼1 j¼1
SSE ¼
^ a2i
j¼1
Xij X i X j þ X
2
df ¼ ðI 1ÞðJ 1Þ
i¼1 j¼1
SST ¼
I X J X
Xij X
2
df ¼ IJ 1
i¼1 j¼1
The fundamental ANOVA identity is SST ¼ SSA þ SSB þ SSE
SSA and SSB take the place of SSTr from single-factor ANOVA. The unwieldy expression for SSE 2 P results from replacing l, ai, and bj in Xij l þ ai þ bj by their respective estimators. Error df is IJ – [number of mean parameters estimated] = IJ – (I + J – 1) = (I – 1)(J – 1). Analogous to singlefactor ANOVA, total variation is split into a part (SSE) that cannot be attributed to the truth or falsity of H0A and H0B (i.e., unexplained variation) and two parts that can be explained by possible falsity of the two null hypotheses. Forming F ratios as in single-factor ANOVA, it can be shown as in Section 11.1 that if H0A is true, the corresponding ratio has an F distribution with numerator df = I – 1 and denominator df = (I – 1)(J – 1); an analogous result applies when testing H0B. Hypotheses
Test Statistic Value
Rejection Region
H0A versus HaA H0B versus HaB
fA ¼ MSA/MSE fB ¼ MSB/MSE
fA Fa;I1;ðI1ÞðJ1Þ fB Fa;J1;ðI1ÞðJ1Þ
The corresponding P-values for the two tests are the areas under the associated F curves to the right of fA and fB, respectively. Example 11.15 (Example 11.13 continued) The xi ’s (row means) and xj ’s (column means) for the color change data are displayed along the right and bottom margins of the data table in Example 11.13. In addition, the grand mean is x ¼ :466. Table 11.7 summarizes further calculations. Table 11.7 ANOVA table for Example 11.15 Source of Variation Factor A (pen brand) Factor B (wash treatment) Error Total
df I−1=2 J−1=3 (I − 1)(J − 1) = 6 IJ − 1 = 11
Sum of squares
Mean square
f
P-value
SSA = .1282 SSB = .4797 SSE = .0868 SST = .6947
MSA = .0641 MSB = .1599 MSE = .01447
fA = 4.43 fB = 11.05
.066 .007
678
11
The Analysis of Variance
The critical value for testing H0A at level of significance .05 is F.05,2,6 = 5.14. Since 4.43 < 5.14, H0A cannot be rejected at significance level .05. Based on this (small) data set, we cannot conclude that true average color change depends on brand of pen. Because F.05,3,6 = 4.76 and 11.05 4.76, H0B is rejected at significance level .05 in favor of the assertion that color change varies with washing treatment. The same conclusions result from consideration of the P-values: .066 > .05 and .007 .05. The plausibility of the normality and constant variance assumptions can be investigated graphically by first calculating the predicted values (also called fitted values) ^xij and the residuals (the differences between the observations and predicted values) eij : ^ ¼ x þ ðxi x Þ þ xj x ¼ xi þ xj x ^xij ¼ l ^ þ ^ai þ b j eij ¼ xij ^xij ¼ xij xi xj þ x We can check the normality assumption with a normal plot of the residuals, Figure 11.6a, and then the constant variance assumption with a plot of the residuals against the fitted values, Figure 11.6b.
a
b Normal Probability Plot of the Residuals
Residuals Versus the Fitted Values
99 0.15 0.10
80 70 60 50 40 30 20
Residual
Percent
95 90
0.05 0.0 −0.5
10 5
−0.10
1 −0.2
−0.1
0.0
0.1
0.2
0.1
0.2
0.3
Residual
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Fitted Value
Figure 11.6 Residual plots for Example 11.15
The normal probability plot is reasonably straight, so there is no reason to question normality for this data set. In the plot of the residuals against the fitted values, look for differences in vertical spread as we move horizontally across the graph. For example, if there were a narrow range for small fitted values and a wide range for high fitted values, this would suggest that the variance is higher for larger responses (this happens often, and it can sometimes be cured by transforming via logarithms). No such problem occurs here, so there is no evidence against the constant variance assumption, either. ■
11.4
Two-Factor ANOVA without Replication
679
Expected Mean Squares The plausibility of using the F tests just described is demonstrated by determining the expected mean squares. After some tedious algebra, EðMSEÞ ¼ r2 ðwhen the model is additiveÞ EðMSAÞ ¼ r2 þ
I J X a2 I 1 i¼1 i
EðMSBÞ ¼ r2 þ
J I X b2 J 1 j¼1 j
When H0A is true, MSA is an unbiased estimator of r2, so FA is a ratio of two unbiased estimators of r2. When H0A is false, MSA tends to overestimate r2, so H0A should be rejected when the ratio FA is too large. Similar comments apply to MSB and H0B. Multiple Comparisons When either H0A or H0B has been rejected, Tukey’s procedure can be used to identify significant differences between the levels of the factor under investigation. The steps in the analysis are identical to those for a single-factor ANOVA: 1. For comparing levels of factor A, obtain Qa,I,(I−1)(J−1). For comparing levels of factor B, obtain Qa,J,(I−1)(J−1). 2. Compute Tukey’s honestly significant difference: d ¼ Q ðestimated SD of the sample means being comparedÞ ( pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Qa;I;ðI1ÞðJ1Þ MSE=J for factor A comparisons ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi Qa;J;ðI1ÞðJ1Þ MSE=I for factor B comparisons pffiffiffi P (because, e.g., the standard deviation of X i ¼ ð1=JÞ Xij is r= J ). 3. Arrange the sample means in increasing order, then underscore those pairs differing by less than d. Pairs not underscored by the same line correspond to significantly different levels of the given factor. Example 11.16 (Example 11.15 continued) Identification of significant differences among the four pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi washing treatments requires Q.05,4,6 = 4.90 and d ¼ 4:90 :01447=3 ¼ :340. The four factor B sample means (column averages) are now listed in increasing order, and any pair differing by less than .340 is underscored by a line segment: x4 :300
x2 :337
x3 :423
x1 :803
Washing treatment 1 is significantly worse than the other three treatments, but no other significant differences are identified. In particular, it is not apparent which among treatments 2, 3, and 4 is best at removing marks. Notice that Tukey’s HSD is not required for comparing the levels of factor A, since the ANOVA F test for that factor did not reveal any statistically significant effect. ■
680
11
The Analysis of Variance
Randomized Block Experiments In using single-factor ANOVA to test for the presence of effects due to the I different treatments under study, once the IJ subjects or experimental units have been chosen, treatments should be allocated in a completely random fashion. That is, J subjects should be chosen at random for the first treatment, then another sample of J chosen at random from the remaining subjects for the second treatment, and so on. It frequently happens, though, that subjects or experimental units exhibit differences with respect to other characteristics that may affect the observed responses. For example, some patients might be healthier than others. When this is the case, the presence or absence of a significant F value may be due to these other differences rather than to the presence or absence of factor effects. This was the reason for introducing paired experiments in Chapter 10. The generalization of the paired experiment to I > 2 is called a randomized block design. An extraneous factor, “blocks,” is constructed by dividing the IJ units into J groups (with I units in each group) in such a way that within each block, the I units are homogeneous with respect to other factors thought to affect the responses. Then within each homogeneous block, the I treatments are randomly assigned to the I units or subjects in the block. Example 11.17 A consumer product-testing organization wished to compare the annual power consumption for five different brands of dehumidifier. Because power consumption depends on the prevailing humidity level, it was decided to monitor each brand at four different levels ranging from moderate to heavy humidity (thus blocking on humidity level). Within each humidity level, brands were randomly assigned to the five selected locations. The resulting amount of power consumption (annual kWh) appears in Table 11.8, and the ANOVA calculations are summarized in Table 11.9.
Table 11.8 Power consumption data for Example 11.17 Blocks (humidity level) 1
2
3
4
xi
685 722 733 811 828
792 806 802 888 920
838 893 880 952 978
875 953 941 1005 1023
797.50 843.50 839.00 914.00 937.25
Treatments (brands) 1 2 3 4 5
Table 11.9 ANOVA table for Example 11.17 Source of variation
df
Sum of squares
Mean square
f
Treatments (brands) Blocks Error Total
4 3 12 19
53,231.00 116,217.75 1671.00 171,119.75
13,307.75 38,739.25 139.25
fA = 95.57 fB = 278.20
Since fA = 95.57 F.05,4,12 = 3.26, H0 is rejected in favor of Ha. We conclude that power consumption does depend on the brand of humidifier. To identify significantly different brands, we pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi use Tukey’s procedure; Q.05,5,12 = 4.51 and d ¼ 4:51 139:25=4 ¼ 26:6.
11.4
Two-Factor ANOVA without Replication
x1 797:50
681
x3 839:00
x2 843:50
x4 914:00
x5 937:25
The underscoring indicates that the brands can be divided into three groups with respect to power consumption. Because the blocking factor is of secondary interest, F.05,3,12 is not needed, though the computed value of FB is clearly highly significant. Figure 11.7 shows SAS output for this data. Notice that in the first part of the ANOVA table, the sums of squares (SS’s) for treatments (brands) and blocks (humidity levels) are combined into a single “model” SS. Analysis of Variance Procedure Dependent Variable: POWERUSE
Source
DF
Sum of
Mean
Squares
Square
F Value
Pr > F
173.84
0.0001
Model
7
169448.750
24206.964
Error
12
1671.000
139.250
Corrected Total
19
171119.750
R-Square
C.V.
Root MSE
POWERUSE Mean
0.990235
1.362242
11.8004
866.25000
Source
DF
Anova SS
Mean Square
F Value
Pr > F
BRAND
4
53231.000
13307.750
95.57
0.0001
HUMIDITY
3
116217.750
38739.250
278.20
0.0001
Alpha = 0.05 df = 12 MSE = 139.25 Critical Value of Studentized Range = 4.508 Minimum Significant Difference = 26.597 Means with the same letter are not significantly different. Tukey Grouping
Mean
N
BRAND
937.250
4
5
A
914.000
4
4
B
843.500
4
2
B
839.000
4
3
C
797.500
4
1
A A
B
Figure 11.7 SAS output for consumption data
■
In some experimental situations in which treatments are to be applied to subjects, a single subject can receive all I of the treatments. Blocking is then often done on the subjects themselves to control for variability between subjects, typically in random order; each subject is then said to act as its own control. Social scientists sometimes refer to such experiments as repeated-measures designs. The “units” within a block are then the different “instances” of treatment application. Similarly, blocks are often taken as different time periods, locations, or observers.
682
11
The Analysis of Variance
In most randomized block experiments in which subjects serve as blocks, the subjects actually participating in the experiment are selected from a large population. The subjects then contribute random rather than fixed effects. This does not impact the procedure for comparing treatments when Kij = 1 (one observation per “cell,” as in this section), but the procedure is altered if Kij > 1. We will shortly consider two-factor models in which effects are random. More on Blocking When I = 2, either the F test above or the paired differences t test can be used to analyze the data. The resulting conclusion will not depend on which procedure is used, since 2 T2 = F and ta=2;m ¼ Fa;1;m . Just as with pairing, blocking entails both a potential gain and a potential loss in precision. If there is a great deal of heterogeneity in experimental units, the value of the variance parameter r2 in the one-way model will be large. The effect of blocking is to filter out the variation represented by r2 in the two-way model appropriate for a randomized block experiment. Other things being equal, a smaller value of r2 results in a test that is more likely to detect departures from H0 (i.e., a test with greater power). However, other things are not equal here, since the single-factor F test is based on I(J – 1) degrees of freedom (df) for error, whereas the two-factor F test is based on (I – 1)(J – 1) df for error. Fewer degrees of freedom for error results in a decrease in power, essentially because the denominator estimator of r2 is not as precise. This loss in degrees of freedom can be especially serious if the experimenter can afford only a small number of observations. Nevertheless, if it appears that blocking will significantly reduce variability, it is probably worth the loss in degrees of freedom. Models for Random Effects In many experiments, the actual levels of a factor used in the experiment, rather than being the only ones of interest to the experimenter, have been selected from a much larger population of possible levels of the factor. In a two-factor situation, when this is the case for both factors, a random effects model is appropriate. The case in which the levels of one factor are the only ones of interest while the levels of the other factor are selected from a population of levels leads to a mixed effects model. The two-factor random effects model when Kij = 1 is Xij ¼ l þ Ai þ Bj þ eij
ði ¼ 1; . . .; I;
j ¼ 1; . . .; JÞ
where the Ai’s, Bj’s, and eij’s are all independent, normally distributed rvs with mean 0 and variances r2A , r2B , and r2, respectively. The hypotheses of interest are then H0A: r2A ¼ 0 (level of factor A does not contribute to variation in the response) versus HaA: r2A [ 0 and H0B: r2B ¼ 0 versus HaB: r2B [ 0. Whereas E(MSE) = r2 as before, the expected mean squares for factors A and B are now EðMSAÞ ¼ r2 þ Jr2A
and
EðMSBÞ ¼ r2 þ Ir2B
Thus when H0A (H0B) is true, FA (FB) is still a ratio of two unbiased estimators of r2. It can be shown that a test with significance level a for H0A versus HaA still rejects H0A if fA Fa,I−1,(I−1)(J−1), and, similarly, the same procedure as before is used to decide between H0B and HaB. For the case in which factor A is fixed and factor B is random, the mixed model is Xij ¼ l þ ai þ Bj þ eij
ði ¼ 1; . . .; I;
j ¼ 1; . . .; JÞ
11.4
Two-Factor ANOVA without Replication
683
P where ai ¼ 0, and the Bj’s and eij’s are all independent, normally distributed rvs with mean 0 and variances r2B and r2, respectively. Now the two null hypotheses are H0A: a1 ¼ ¼ aI ¼ 0
and
H0B: r2B ¼ 0
Expected mean squares are EðMSAÞ ¼ r2 þ
J X 2 ai I1
and
EðMSBÞ ¼ r2 þ Ir2B
The test procedures for H0A versus HaA and H0B versus HaB are exactly as before. For example, in the analysis of the color change data in Example 11.13, if the four wash treatments were randomly selected, then because fB = 11.05 > F.05,3,6 = 4.76, H0B: r2B ¼ 0 is rejected in favor of HaB: r2B [ 0. Summarizing, when Kij = 1, although the hypotheses and expected mean squares differ from the case of both effects fixed, the test procedures are identical. Exercises: Section 11.4 (45–60) 45. The number of miles of useful tread wear (in 1000’s) was determined for tires of each of five different makes of subcompact car (factor A, with I = 5) in combination with each of four different brands of radial tires (factor B, with J = 4), resulting in IJ = 20 observations. The values SSA = 30.6, SSB = 44.1, and SSE = 59.2 were then computed. Assume that an additive model is appropriate. a. Test H0: a1 = a2 = a3 = a4 = a5 = 0 (no differences in true average tire lifetime due to makes of cars) versus Ha: at least one ai 6¼ 0 using a level .05 test. b. Test H0: b1 = b2 = b3 = b4 = 0 (no differences in true average tire lifetime due to brands of tires) versus Ha: at least one bj 6¼ 0 using a level .05 test. 46. Four different coatings are being considered for corrosion protection of metal pipe. The pipe will be buried in three different types of soil. To investigate whether the amount of corrosion depends either on the coating or on the type of soil, 12 pieces of pipe are selected. Each piece is coated with one of the four coatings and buried in one of the three types of soil for a fixed time, after which the amount of corrosion (depth of
maximum pits, in .0001 in.) is determined. The depths are shown in this table: Soil type (B)
Coating (A)
1 2 3 4
1
2
3
64 53 47 51
49 51 45 43
50 48 50 52
a. Assuming the validity of the additive model, carry out the ANOVA analysis using an ANOVA table to see whether the amount of corrosion depends on either the type of coating used or the type of soil. Use a = .05. ^ ;b ^ ^ ^; ^ b. Compute l a1 ; ^ a2 ; ^ a3 ; ^ a4 ; b 1 2 and b3 . 47. The article “Step-Counting Accuracy of Activity Monitors in Persons with Down Syndrome” (J. Intellect. Disabil. Res. 2019: 21–30) describes a study in which 17 people with DS walked for a set time period with multiple step-counting devices attached to them. (Walking is a common form of exercise for people with DS, and clinicians want to insure that step counts for them are accurate.) The accompanying table summarizes the different step-counting methods and the number of steps recorded for each
684
11
participant. (LFE refers to a low frequency extension filter applied to the device.) Step-counting method Hand tally Pedometer Hip accelerometer Hip accelerometer + LFE Wrist accelerometer Wrist accelerometer + LFE
Mean
SD
668 557 466 606 449 579
70 202 159 93 89 85
Sums of squares consistent with this and other information in the article include SS(Method) = 596,748, SSE = 987,380, and SST = 2,113,228. a. Determine the sum of squares associated with the blocking variable (subject), and then construct an ANOVA table. b. Assuming that model assumptions are plausible, test the null hypothesis of “no method effect” at the .01 significance level. c. Apply Tukey’s procedure to these six stepcounting methods. Are any of the five device-based methods not significantly different from hand tally (considered by the researchers to be the most correct)? 48. In an experiment to see whether the amount of coverage of light-blue interior latex paint depends either on the brand of paint or on the brand of roller used, 1 gallon of each of four brands of paint was applied using each of three brands of roller, resulting in the following data (number of square feet covered). Roller brand
Paint brand
1 2 3 4
1
2
3
454 446 439 444
446 444 442 437
451 447 444 443
a. Construct the ANOVA table. [Hint: The computations can be expedited by subtracting 400 (or any other convenient number) from each observation. This does not affect the final results.] b. Check the normality and constant variance assumptions graphically.
The Analysis of Variance
c. State and test hypotheses appropriate for deciding whether paint brand has any effect on coverage. Use a = .05. d. Repeat part (c) for brand of roller. e. Use Tukey’s method to identify significant differences among brands. Is there one brand that seems clearly preferable to the others? 49. The following data is presented in the article “Influence of Cutting Parameters on Drill Bit Temperature” (Ind. Lubr. Tribol. 2007: 186–193); values in the table are temperatures in °C. Feed rate
Spindle speed
1 2 3
1
2
3
275 380 425
325 415 420
365 420 405
a. Construct an ANOVA table from this data. b. Test whether spindle speed impacts drill bit temperature at the .05 significance level. c. Test whether feed rate impacts drill bit temperature at the .05 significance level. 50. A particular county employs three assessors who are responsible for determining the value of residential property in the county. To see whether these assessors differ systematically in their assessments, 5 houses are selected, and each assessor is asked to determine the market value of each house. With factor A denoting assessors (I = 3) and factor B denoting houses (J = 5), suppose SSA = 11.7, SSB = 113.5, and SSE = 25.6. a. Test H0: a1 ¼ a2 ¼ a3 ¼ 0 at level .05. (H0 states that there are no systematic differences among assessors.) b. Explain why a randomized block experiment with only 5 houses was used rather than a one-way ANOVA experiment involving a total of 15 different houses with each assessor asked to assess 5 different houses (a different group of 5 for each assessor).
11.4
Two-Factor ANOVA without Replication
685
51. In a 2018 class activity, 54 students measured how much time (sec) it took to melt each of the following in their mouths: (1) a butterscotch chip, (2) a chocolate chip, (3) a white chip (yes, white is a chip flavor). Each student rolled a die to determine the order in which to melt the chips. a. Why was it important to randomize the chip order? b. Besides not having to recruit as many students to get the same number of observations, what is the advantage of using blocking here, versus randomly assigning one chip to each student? c. Summary quantities include x1 ¼ 88:15; x2 ¼ 60:49; x3 ¼ 72:35; SS(Subject) = 135,833, and SSE = 31,506. Construct an ANOVA table and test at significance level .01 to see whether mean melting time varies by chip flavor. d. Judging from the F ratio for subjects, do you think that blocking on subjects was effective in this experiment? Explain. 52. The efficiency (%) of a 0.7 L Daihatsu diesel engine at 3100 rpm was determined at various torques (N-m) and coolant temperatures (°C), resulting in the following data kindly provided by the study’s authors (“Performance of a Diesel Engine at High Coolant Temperatures, J. Energy Resour. Technol. 2017). Coolant Temp. (°C)
Torque
12 15 18 21 24
90
100
125
150
175
200
24.07 26.52 28.50 28.61 28.47
23.81 26.00 28.23 29.96 28.34
23.55 26.06 26.67 28.38 28.20
22.84 25.33 26.42 27.16 26.55
24.34 26.92 28.94 29.35 28.87
24.62 27.08 28.74 29.16 27.27
a. Test at the .01 significance level whether mean engine efficiency differs with torque. b. Test at the .01 significance level whether mean engine efficiency differs with coolant temperature. c. Apply Tukey’s procedure as appropriate to the results of (a) and (b).
53. An experiment was conducted to assess the effect of current and voltage on the tensile strength (ksi) of welds made using a tungsten inert gas (TIG) welding tool, which resulted in the following data. Voltage
Current
130 135 140
10
12
14
197.62 185.90 203.23
200.35 215.56 174.81
199.40 179.36 194.47
Use two-factor ANOVA to determine whether current or voltage impacts the tensile strength of welds under these experimental conditions at the .10 significance level. (Data is from “To Investigate the [E]ffect of Process Parameters on Mechanical Properties of TIG Welded 6351 Aluminum Alloy by ANOVA,” GE-Int. J. Engr. Res. 2014: 50–62.) 54. The article “Effect of Face Value on Product Valuation in Foreign Currencies” (J. Consum. Res. 2002) describes a classroom experiment involving 97 business students to see whether they could adjust for exchange rates when deciding how much to spend a product. The students acted as buyers in a mock World Garment Expo at which each would be purchasing silk ties from six different nations. They were provided pictures of the ties and the exchange rates for each national currency into US$; the pictures were randomly permuted to reduce any perceived quality differences. Students then had to report how much, in the foreign currencies, they would pay for one silk tie. The average prices students were willing to pay, converted back into dollars, appear in the accompanying table; exchange rates are the number of foreign currency units (e.g., Norwegian krone or Japanese yen) equaling $1. Country (Exch. rate)
Mean
SD
Norway (9.5) Luxembourg (48) Japan (110) Korea (1100) Romania (24,500) Turkey (685,000)
$15.85 $15.45 $15.91 $12.42 $11.33 $10.77
$15.36 $13.47 $11.26 $10.17 $12.05 $9.01
686
11
Relevant sums of squares include SS(Country) = 2752, SSE = 45,086, and SST = 86,653. a. What experimental design was used in this study? What was the advantage of using this method? b. Construct an ANOVA table from the information provided, then test the null hypothesis of “no currency exchange effect” at the .01 level. c. The exchange rates vary by orders of magnitude (this was deliberate). As the exchange rate increases, what happens to the average amount students are willing to pay for a silk tie in that currency? (The article’s authors note that “th[is] evidence is against the common wisdom that when the home currency is perceived to go a long way in foreign currency terms, the foreign currency will be treated as play money and that people will overspend.”) 55. The strength of concrete used in commercial construction tends to vary from one batch to another. Consequently, small test cylinders of concrete sampled from a batch are “cured” for periods up to about 28 days in temperature- and moisture-controlled environments before strength measurements are made. Concrete is then “bought and sold on the basis of strength test cylinders” (ASTM C 31 Standard Test Method for Making and Curing Concrete Test Specimens in the Field). The accompanying data resulted from an experiment carried out to compare three different curing methods with respect to compressive strength (MPa). Analyze this data. Batch 1 2 3 4 5 6
Method A
Method B
Method C
30.7 29.1 30.0 31.9 30.5 26.9
33.7 30.6 32.2 34.6 33.0 29.3
30.5 32.6 30.5 33.5 32.4 27.8 (continued)
Batch 7 8 9 10
The Analysis of Variance
Method A
Method B
Method C
28.2 32.4 26.6 28.6
28.4 32.4 29.5 29.4
30.7 33.6 29.2 33.2
56. Check the normality and constant variance assumptions graphically for the data of Example 11.17. 57. Suppose that in the experiment described in Exercise 50 the five houses had actually been selected at random from among those of a certain age and size, so that factor B is random rather than fixed. Test H0 : r2B ¼ 0 versus Ha : r2B [ 0 using a level .01 test. 58. a. Show that a constant d can be added to (or subtracted from) each xij without affecting any of the ANOVA sums of squares. b. Suppose that each xij is multiplied by a nonzero constant c. How does this affect the ANOVA sums of squares? How does this affect the values of the F statistics FA and FB ? What effect does “coding” the data by yij = cxij + d have on the conclusions resulting from the ANOVA procedures? 59. Use the fact that E Xij ¼ l þ ai þ bj P P with ai ¼ bj ¼ 0 to show that ai ¼ X i X is E X i X ¼ ai , so that ^ an unbiased estimator for ai. 60. Power for the F test in two-factor ANOVA is calculated using a similar method to the one shown in Section 11.3. For fixed values of a1, a2, …, aI, power calculations are based on the noncentral F distributions with parameters v1 = I – 1, v2 = (I – 1)(J – 1), and P noncentrality parameter k ¼ J a2i =r2 . a. For the corrosion experiment described in Exercise 46, determine power when a1 = 4, a2 = 0, a3 = a4 = −2, and r = 4. Repeat for a1 = 6, a2 = 0, a3 = a4 = −3, and r = 4. b. By symmetry, what is the power for the test of H0B versus HaB in Example 11.13 when b1 = .3, b2 = b3 = b4 = –.1, and r = .3?
11.5
11.5
Two-Factor ANOVA with Replication
687
Two-Factor ANOVA with Replication
In Section 11.4, we analyzed data from a two-factor experiment in which there was one observation for each of the IJ combinations of levels of the two factors (i.e., no replication). To obtain valid test procedures in that situation, the lij’s were assumed to have an additive structure, meaning that the difference in true average responses for any two levels of the factors is the same for each level of the other factor. This was shown in Figure 11.4a, in which the lines connecting true average responses are parallel. Figure 11.4b depicted a set of true average responses that does not have additive structure. The lines connecting these lij’s are not parallel, which means that the difference in true mean responses for different levels of one factor does depend on the level of the other factor—what’s known as an interaction. When Kij > 1 for at least one (i, j) pair, we can estimate the interaction effect and formally test for whether interaction is present. In specifying the appropriate model and deriving test procedures, we will focus on the case Kij = K > 1, so the number of observations per “cell” (i.e., for each combination of levels) is constant. That is, throughout this section we will assume a balanced study design. A Two-Factor Model With Interaction Again lij will denote the true mean response when factor A is at level i and factor B is at level j. Expressions (11.7)–(11.9) show the development of a model equation that assumes additivity (i.e., no interaction). To extend this model, first let l¼
1 XX l IJ i j ij
li ¼
1X l J j ij
lj ¼
1X l I i ij
ð11:10Þ
Thus l is the expected response averaged over all levels of both factors (the true grand mean), li is the expected response averaged over levels of factor B when factor A is held at level i, and similarly for lj . Now define three sets of parameters by ai ¼ li l ¼ the effect of factor A at level i bj ¼ lj l ¼ the effect of factor B at level j
ð11:11Þ
cij ¼ lij ðl þ ai þ bj Þ from which lij ¼ l þ ai þ bj þ cij The ai’s and bj’s are the same as those from Section 11.4. The ai’s are called the main effects for factor A, and the bj’s are the main effects for factor B. The new parameters, the cij’s, measure the difference between the true treatment means lij and the means assumed under the additive model (11.8). The cij’s are referred to as the interaction parameters, and the model is additive if and only if all cij’s = 0.
688
11
The Analysis of Variance
P P Although there are I ai’s, J bj’s, and IJ cij’s in addition to l, the conditions ai ¼ 0, bj ¼ 0, P P j cij ¼ 0 for every i, and i cij ¼ 0 for any j—all true by virtue of (11.10) and (11.11)—imply that only IJ of these new parameters are independently determined: l, I – 1 of the ai’s, J – 1 of the bj’s, and (I – 1)(J – 1) of the cij’s. We now must use triple subscripts for both random variables and observed values: Xijk and xijk denote the kth observation (replication) when factor A is at level i and factor B is at level j. TWO-FACTOR ANOVA GENERAL MODEL EQUATION
Xijk ¼ l þ ai þ bj þ cij þ eijk i ¼ 1; . . .; I;
j ¼ 1; . . .; J;
ð11:12Þ
k ¼ 1; . . .; K
where the eijk’s are independent and normally distributed, each with mean 0 and variance r2. Example 11.18 Three different varieties of tomato (Harvester, Ife No. 1, and Pusa Early Dwarf) and four different plant densities (10, 20, 30, and 40 thousand plants per hectare) are being considered for planting in a particular region. To see whether either variety or plant density affects yield, each combination of variety and plant density is used in three different plots, resulting in the data on yields in Table 11.10. Table 11.10 Yield data for Example 11.18 Planting density Variety H Ife P
10,000 10.5 8.1 16.1
9.2 8.6 15.3
20,000 7.9 10.1 17.5
12.8 12.7 16.6
11.2 13.7 19.2
30,000 13.3 11.5 18.5
12.1 14.4 20.8
12.6 15.4 18.0
40,000 14.0 13.7 21.0
10.8 11.3 18.4
9.1 12.5 18.9
12.5 14.5 17.2
Here, I = 3, J = 4, and K = 3, for a total of IJK = 36 observations. If we identify factor A = variety and B = density, then the observations across the first row of Table 11.10 are x111 = 10.5, x112 = 9.2, x113 = 7.9, x121 = 12.8, and so on. Some of the parameters specified in the model equation (11.12) include l = true average yield of all tomato plants in this population l1 = true average yield of all Harvester plants (i = 1) in this population b2 = the effect of 20,000 plants/hectare density (j = 2) on average yield To check the normality and constant variance assumptions, we can make plots similar to those of Section 11.4. Define the predicted/fitted values to be the cell means, ^xijk ¼ xij , so the residuals are eijk ¼ xijk ^xijk ¼ xijk xij . For example, the mean of the three observations in the top-left cell of Table 11.10 is x11 ¼ ð10:5 þ 9:2 þ 7:9Þ=3 ¼ 9:2, and the residual for the very first observation is e111 ¼ x111 x11 ¼ 10:59:2 ¼ 1:3. The normal probability plot of the 36 residuals is Figure 11.8a, and the plot of the residuals against the fitted values is Figure 11.8b. The normal plot is sufficiently straight that there should be no concern about the normality assumption. The plot of residuals against predicted values has a fairly uniform vertical spread, so there is no cause for concern about the constant variance assumption.
11.5
Two-Factor ANOVA with Replication
689
a
b Normal Probability Plot of the Residuals (response is Yield)
Residuals Versus the Fitted Values (response is Yield)
2
95 90 80 70 60 50 40 30 20 10 5
1 Residual
Percent
99
0 −1 −2
1 −3
−2
−1
0
1
2
3
10
Residual
12
14
16
18
20
Fitted Value
Figure 11.8 Plots from Minitab to verify assumptions for Example 11.18
■
Sums of Squares and Test Procedures There are now three relevant pairs of hypotheses: H0AB : cij ¼ 0 for all i; j versus
HaAB : at least one cij 6¼ 0
H0A : a1 ¼ a2 ¼ ¼ aI ¼ 0
versus HaA : at least one ai 6¼ 0
H0B : b1 ¼ b2 ¼ ¼ bJ ¼ 0
versus
HaB : at least one bj 6¼ 0
The no-interaction hypothesis H0AB is usually tested first. If H0AB is not rejected, then the other two hypotheses can be tested to see whether the main effects are significant. But once H0AB is rejected, we believe that the effect of factor A at any particular level depends on the level of B (and vice versa). It then does not make sense to test H0A or H0B. In this case, an interaction plot similar to that of Figure 11.4b is helpful in visualizing the way the factors interact. To test the hypotheses of interest, we again define sums of squares and indicate their corresponding degrees of freedom. Again, a dot in place of a subscript means that we have summed over all values of that subscript, and a horizontal bar denotes averaging. So, for example, X ij denotes the mean of the K observations in the (i, j)th cell of the data table, while X i represents the average of all JK values in the ith row. SSA ¼ SSB ¼ SSAB ¼
XXX i
j
k
i
j
k
XXX XXX i
SSE ¼
j
j
ðX j X Þ2
df ¼ J 1
ðX ij X i X j þ X Þ2
df ¼ ðI 1ÞðJ 1Þ
ðXijk X ij Þ2
df ¼ IJK IJ
ðXijk X Þ2
df ¼ IJK 1
k
XXX i
df ¼ I 1
k
XXX i
SST ¼
j
ðX i X Þ2
k
690
11
The Analysis of Variance
SSAB is called the interaction sum of squares. Mean squares are, as always, defined by (sum of squares)/df. The fundamental ANOVA identity is SST ¼ SSA þ SSB þ SSAB þ SSE
According to the fundamental identity, variation is partitioned into four pieces: unexplained (SSE—which would be present whether or not any of the three null hypotheses was true) and three pieces that may be explained by the truth or falsity of the three H0’s. The expected mean squares suggest how each set of hypotheses should be tested using the appropriate ratio of mean squares with MSE in the denominator: EðMSEÞ ¼ r2 EðMSAÞ ¼ r2 þ
I JK X a2 I 1 i¼1 i
EðMSBÞ ¼ r2 þ
J IK X b2 J 1 j¼1 j
EðMSABÞ ¼ r2 þ
I X J X K c2 ðI 1ÞðJ 1Þ i¼1 j¼1 ij
Each of the three mean-square ratios can be shown to have an F distribution when the associated H0 is true, which yields the following level a test procedures. Hypotheses
Test Statistic Value
Rejection Region
H0A versus HaA H0B versus HaB H0AB versus HaAB
fA ¼ MSA=MSE fA ¼ MSB=MSE fAB ¼ MSAB=MSE
fA Fa;I1;IJðK1Þ fB Fa;J1;IJðK1Þ fAB Fa;ðI1ÞðJ1Þ;IJðK1Þ
As before, the results of the analysis are summarized in an ANOVA table. Example 11.19 (Example 11.18 continued) The cell, row, column, and grand means for the given data are 10,000
20,000
30,000
40,000
xi
H Ife P
9.20 8.93 16.30
12.43 12.63 18.10
12.90 14.50 19.93
10.80 12.77 18.17
11.33 12.21 18.13
xj
11.48
14.39
15.78
13.91
x ¼ 13:89
Table 11.11 summarizes the resulting ANOVA computations.
11.5
Two-Factor ANOVA with Replication
691
Table 11.11 ANOVA table for Example 11.19 Source of variation
df
Sum of squares
Mean square
f
P-value
Varieties Density Interaction Error Total
2 3 6 24 35
327.60 86.69 8.03 38.04 460.36
163.80 28.90 1.34 1.59
fA = 103.02 fB = 18.18 .84 fAB =
50 when x = 25) = .0301
50 41 35 True regression line E(Y|x) = 65 – 1.2x
x 20
25
Figure 12.5 Probabilities based on the simple linear regression model
Suppose that Y1 denotes an observation on time to fracture made with x = 25 and Y2 denotes an independent observation made with x = 24. Then the difference Y1 − Y2 is normally distributed with mean value E(Y1 − Y2) = b1 = −1.2, variance V(Y1 − Y2) = r2 + r2 = 128, and standard deviation pffiffiffiffiffiffiffiffi 128 ¼ 11:314. The probability that Y1 exceeds Y2 is
0 ð1:2Þ PðY1 Y2 [ 0Þ ¼ P Z [ 11:314
¼ PðZ [ :11Þ ¼ :4562
That is, even though we expect Y to decrease when x increases by one unit, the probability is fairly high (but less than .5) that the observed Y at x + 1 will be larger than the observed Y at x. ■ Our discussion thus far has presumed that the explanatory variable is under the control of the investigator, so that only the response variable Y is random. This will not always be the case: if we take a random sample of college students and record the height and weight of each, neither variable is preselected, so both x and y could be considered random. Methods and conclusions of the next several sections can be applied both when the values of the explanatory variable are fixed in advance and when they are random, but because the derivations and interpretations are more straightforward in the former case, we will continue to work explicitly with it. For more commentary, see the excellent book by Michael Kutner et al. listed in the bibliography.
710
12
Regression and Correlation
Exercises: Section 12.1 (1–12) a. Construct a scatterplot of solubility versus pH, and describe what you see. Does it appear that a linear model would be appropriate here? b. Hydrogen ion concentration [H+] is related to pH by pH = –log10([H+]). Use this to calculate the hydrogen ion concentrations for each observation, then make a scatterplot of solubility versus [H+]. Does it appear that a linear model would fit this data well? c. Would a linear function fit the data in part (b) perfectly? That is, is it reasonable to assume a completely deterministic relationship here? Explain your reasoning.
1. Obesity is associated with higher foot load that can potentially increase pain and discomfort, but little research has been done on this relationship in children and its possible effects. A graph in the article “Childhood Obesity is Associated with Altered Plantar Pressure Distribution During Running” (Gait and Posture 2018: 202–205) gave the accompanying data on x = body mass index (kg/m2) and y = peak foot pressure (kPa) while running for a sample of 42 children. x 12.8
13.0 13.0
13.5 13.8
13.8 14.2
14.4 14.5
y
346
572
334
538
340
641
360
366
360
x 14.6
14.6 14.8
14.9 15.0
15.0 15.0
15.5 15.6
y
627
552
575
546
417
609
414
578
314
x 15.9
16.6 16.9
17.0 17.1
17.1 17.5
17.6 18.6
y
572
454
368
322
466
x 18.7 y
494
18.7 20.0
589
305
20.1 20.5
351
362
474
24.2 24.7
26.5
y
893
850
376
815
494
21.1 21.2
22.4 23.1 741
368
494
20.6 21.0
x 21.6 491
664
305
486
3. Bivariate data often arises from the use of two different techniques to measure the same quantity. As an example, the accompanying observations on x = hydrogen concentration (ppm) using a gas chromatography method and y = concentration using a new sensor method were read from a graph in the article “A New Method to Measure the Diffusible Hydrogen Content in Steel Weldments Using a Polymer Electrolyte-Based Hydrogen Sensor” (Welding Res., July 1997: 251s–256s).
382
a. Construct stem-and-leaf displays of both BMI and peak foot pressure, and comment on interesting features. b. Is the value of peak foot pressure completely and uniquely determined by BMI? Explain your reasoning. c. Construct a scatterplot of the data. Does it appear that peak foot pressure could be predicted by a child’s body mass index? Explain your reasoning. 2. Verapamil is used to treat certain heart conditions, including high blood pressure and arrhythmia. Studies continue on the factors that affect the drug’s absorption into the body. The article “The Effect of Nonionic Surfactant Brij 35 on Solubility and Acid-Base Equilibria of Verapamil” (J. Chem. Engr. Data 2017: 1776–1781) includes the following data on x = pH and y = Verapamil solubility (10−5 mol/L) at 25 °C for one such study. pH
8.12
8.32
8.41
8.62
8.70
8.84
8.88
9.09
sol.
53.4
32.3
22.2
14.6
13.9
8.76
5.06
5.57
x
47
62
65
70
70
78
95
100
114
118
y
38
62
53
67
84
79
93
106
117
116
x
124
127
140
140
140
150
152
164
198
221
y
127
114
134
139
142
170
149
154
200
215
Construct a scatterplot. Does there appear to be a very strong relationship between the two types of concentration measurements? Do the two methods appear to be measuring roughly the same quantity? Explain your reasoning. 4. A study to assess the capability of subsurface flow wetland systems to remove biochemical oxygen demand (BOD, a measure of organic matter in sewage) and various other chemical constituents resulted in the accompanying data on x = BOD mass loading (kg/ha/d) and y = BOD mass removal (kg/ha/d) (“Subsurface Flow
12.1
The Simple Linear Regression Model
711
Wetlands—A Performance Evaluation,” Water Environ. Res. 1995: 244–247). x
3
8
10
11
13
16
27
y
4
7
8
8
10
11
16
x
30
35
37
38
44
103
142
y
26
21
9
31
30
75
90
a. Construct boxplots of both mass loading and mass removal, and comment on any interesting features. b. Construct a scatterplot of the data, and comment on any interesting features. 5. The article “Objective Measurement of the Stretchability of Mozzarella Cheese” (J. Texture Stud. 1992: 185–194) reported on an experiment to investigate how the behavior of mozzarella cheese varied with temperature. Consider the accompanying data on x = temperature and y = elongation (%) at failure of the cheese. [Note: The researchers were Italian and used real mozzarella cheese, not the poor cousin widely available in the USA.] x
59
63
68
72
74
78
83
y
118
182
247
208
197
135
132
a. Construct a scatterplot in which the axes intersect at (0, 0). Mark 0, 20, 40, 60, 80, and 100 on the horizontal axis and 0, 50, 100, 150, 200, and 250 on the vertical axis. b. Construct a scatterplot in which the axes intersect at (55, 100), as was done in the cited article. Does this plot seem preferable to the one in part (a)? Explain your reasoning. c. What do the plots of parts (a) and (b) suggest about the nature of the relationship between the two variables? 6. One factor in the development of tennis elbow, a malady that strikes fear in the hearts of all serious tennis players, is the impact-induced vibration of the racket-andarm system at ball contact. It is well known that the likelihood of getting tennis elbow depends on various properties of the racket
used. Consider the scatterplot of x = racket resonance frequency (Hz) and y = sum of peak-to-peak acceleration (a characteristic of arm vibration, in m/s/s) for n = 23 different rackets (“Transfer of Tennis Racket Vibrations into the Human Forearm,” Med. Sci. Sports Exercise 1992: 1134–1140). Discuss interesting features of the data and scatterplot. y
38 36 34 32 30 28 26 24 22 100 110 120 130 140 150 160 170 180 190
x
7. Data from the EPA’s Fuel Efficiency Guide suggests an approximate linear relationship between y = highway fuel efficiency (mpg) and x = weight (lbs) for midsize cars. Suppose the equation of the true regression line is f(x) = 70 − .0085x. a. What is the expected value of highway fuel efficiency when weight = 2500 lbs? b. By how much can we expect highway fuel efficiency to change when weight increases by 1 lb? c. Answer part (b) for an increase of 500 lbs. d. Answer part (b) for a decrease of 500 lbs. 8. Referring to the previous exercise, suppose that the random deviation e is normally distributed with standard deviation 4.6 mpg. a. What is the probability that the observed value of highway fuel efficiency will exceed 30 mpg when the car’s weight is 4000 lbs? b. Repeat part (a) with 5000 in place of 4000.
712
12
c. Consider making two independent observations on highway fuel efficiency, the first for a car weighing 4000 lbs and the second for x = 5000. What is the probability that the first observation will exceed the second by more than 5 mpg? d. Let Y1 and Y2 denote observations on highway fuel efficiency when x = x1 and x = x2, respectively. By how much would x2 have to exceed x1 in order that P(Y1 > Y2) = .95? 9. The flow rate y (m3/min) in a device used for air-quality measurement depends on the pressure drop x (in. of water) across the device’s filter. Suppose that for x values between 5 and 20, the two variables are related according to the simple linear regression model with true regression line E(Y|x) = −.12 + .095x. a. What is the expected change in flow rate associated with a 1-in. increase in pressure drop? Explain. b. What change in flow rate can be expected when pressure drop decreases by 5 in.? c. What is the expected flow rate for a pressure drop of 10 in.? A drop of 15 in.? d. Suppose r = .025 and consider a pressure drop of 10 in. What is the probability that the observed value of flow rate will exceed .835? That observed flow rate will exceed .840? e. What is the probability that an observation on flow rate when pressure drop is 10 in. will exceed an observation on flow rate made when pressure drop is 11 in.? 10. Suppose the expected cost of a production run is related to the size of the run by the equation E(Y|x) = 4000 + 10x. Let Y denote an observation on the cost of a run. Assuming that the variables size and cost are related according to the simple linear regression model, could it be the case that
Regression and Correlation
P(Y > 5500 when x = 100) = .05 and P(Y > 6500 when x = 200) = .10? Explain. 11. Suppose that in a certain chemical process the reaction time y (hr) is related to the temperature (°F) in the chamber in which the reaction takes place according to the simple linear regression model with equation E(Y|x) = 5.00 −.01x and r = .075. a. What is the expected change in reaction time for a 1 °F increase in temperature? For a 10 °F increase in temperature? b. What is the expected reaction time when temperature is 200 °F? When temperature is 250 °F? c. Suppose five observations are made independently on reaction time, each one for a temperature of 250 °F. What is the probability that all five times are between 2.4 and 2.6 h? d. What is the probability that two independently observed reaction times for temperatures 1° apart are such that the time at the higher temperature exceeds the time at the lower temperature? 12. The article “On the Theoretical Velocity Distribution and Flow Resistance in Natural Channels” (J. Hydrol. 2017: 777–785) suggests a quadratic relationship between x = flow depth (m) and y = water surface slope at certain points along the Tiber River in Italy. Suppose the variables are related by Equation (12.1) with f ðxÞ ¼ 0:6x2 þ 5x þ 1 (similar to the equation suggested in the article). a. What is the expected water surface slope when the flow depth is 2.0 m? 2.5 m? 3.0 m? b. Does the expected water surface slope change by a fixed amount for each 1-m increase in flow depth? Explain. c. Determine a flow depth for which the expected surface slope is the same as the expectation for x = 2.0 m obtained in part (a). [Note: Your answer should be something other than 2.0.]
12.1
The Simple Linear Regression Model
d. For what depth is expected water surface slope maximized? e. Assume the rv e in Equation (12.1) has a standard normal distribution; this is consistent with information in the
12.2
713
article. At a flow depth of 3.0 m, what’s the probability the water surface slope Y is greater than 10? Less than 6?
Estimating Model Parameters
We will assume in this and the next several sections that the variables x and y are related according to the simple linear regression model. The values of the parameters b0, b1, and r will almost never be known to an investigator. Instead, sample data consisting of n observed pairs (x1, y1), …, (xn, yn) will be available, from which the model parameters and the true regression line itself can be estimated. These observations are assumed to have been obtained independently of each other. According to the model, the observed points will be distributed about the true regression line f ðxÞ ¼ b0 þ b1 x in a random manner. Figure 12.6 shows a scatterplot of observed pairs along with two candidates for the estimated regression line, y ¼ a0 þ a1 x and y ¼ b0 þ b1 x. Intuitively, the line y ¼ a0 þ a1 x is not a reasonable estimate of the true line because, if y ¼ a0 þ a1 x were the true line, the observed points would almost surely have been closer to this line. The line y ¼ b0 þ b1 x is a more plausible estimate because the observed points are scattered rather closely about this line. y y = b0 + b1x
y = a0 + a1x
x
Figure 12.6 Two different estimates of the true regression line: one good and one bad
Figure 12.6 and the foregoing discussion suggest that our estimate of b0 þ b1 x should be a line that provides, in some sense, a “best fit” to the observed data points. This is what motivates the principle of least squares, which can be traced back to the mathematicians Gauss and Legendre around the year 1800. According to this principle, a line provides a good fit to the data if the vertical distances or deviations from the observed points to the line (see Figure 12.7) are small. The proposed measure of the goodness-of-fit is the sum of the squares of these deviations; the best-fit line is then the one having the smallest possible sum of squared deviations.
714
12
Regression and Correlation
Time to fracture (hr)
y 80
y = b0 + b1x
60 40 20 x 10
20
30
40
Applied stress (kg/mm2)
Figure 12.7 Deviations of observed data from line y = b0 + b1x
PRINCIPLE OF LEAST SQUARES
The vertical deviation of the point (xi, yi) from a line y ¼ b0 þ b1 x is height of point height of line ¼ yi ðb0 þ b1 xi Þ The sum of squared vertical deviations from the points (x1, y1), …, (xn, yn) to the line is then gðb0 ; b1 Þ ¼
n X
½yi ðb0 þ b1 xi Þ2
i¼1
The point estimates of b0 and b1, denoted by b^0 and b^1 and called the least squares estimates, are those values that minimize gðb0 ; b1 Þ. The estimated regression line or least squares regression line (LSRL) is then the line whose equation is y ¼ b^0 þ b^1 x. The minimizing values of b0 and b1 are found by taking partial derivatives of gðb0 ; b1 Þ with respect to both b0 and b1, equating them both to zero, and solving the equations @gðb0 ; b1 Þ X ¼ 2ðyi b0 b1 xi Þð1Þ ¼ 0 @b0 @gðb0 ; b1 Þ X ¼ 2ðyi b0 b1 xi Þðxi Þ ¼ 0 @b1 Cancellation of the factor 2 and re-arrangement gives the following system of equations, called the normal equations: X X nb0 þ x i b1 ¼ yi X X X x i b0 þ x2i b1 ¼ xi yi
12.2
Estimating Model Parameters
715
The normal equations are linear in the two unknowns b0 and b1. Provided that at least two of the xi’s are different, the least squares estimates are the unique solution to this linear system. The least squares estimate of the slope coefficient b1 of the true regression line is
PROPOSITION
b1 ¼ b^1 ¼
P
ðxi xÞðyi yÞ P ðxi xÞ2
ð12:2Þ
The least squares estimate of the intercept b0 of the true regression line is b0 ¼ b^0 ¼
P
P yi b^1 xi ¼ y b^1 x n
ð12:3Þ
Moreover, under the normality assumption of the simple linear regression model, b^0 and b^1 are also the maximum likelihood estimates (see Exercise 23). Because they will feature prominently here and in subsequent sections, we define the following notation for certain sums: Sxy ¼
n X
ðxi xÞðyi yÞ Sxx ¼
i¼1
n X
ðxi xÞ2
Syy ¼
i¼1
n X
ðyi yÞ2
i¼1
The Sxx formula was presented in Chapter 1 in connection with the sample variance: s2x ¼ Sxx =ðn 1Þ and similarly for y. The least squares estimates of the regression coefficients can then be written as Sxy b^1 ¼ Sxx
and b^0 ¼ y b^1 x
We emphasize that before b^1 and b^0 are computed, a scatterplot should be examined to see whether a linear probabilistic model is plausible. If the points do not tend to cluster about a straight line with roughly the same degree of spread for all x (e.g., Figure 12.2), then other models should be investigated. Example 12.4 As brick-and-mortar shops decline and online retailers like Amazon and Wayfair ascend, demand for warehouse storage space has steadily increased. Despite effectively being empty shells, warehouses still require professional appraisal. The following data on x = truss height (ft), which determines how high stored goods can be stacked, and y = sale price ($) per square foot appeared in the article “Challenges in Appraising ‘Simple’ Warehouse Properties” (The Appraisal J. 2001: 174–178).
Warehouse
1
2
3
4
5
6
7
8
9
10
x y
12 35.53
14 37.82
14 36.90
15 40.00
15 38.00
16 37.50
18 41.00
22 48.50
22 47.00
24 47.50
Warehouse x y
11 24 46.20
12 26 50.35
13 26 49.13
14 27 48.07
15 28 50.90
16 30 54.78
17 30 54.32
18 33 57.17
19 36 57.45
716
12
Regression and Correlation
From the sample data, 1 X 1 X xi ¼ 22:737 ft; yi ¼ 46:217 $=ft2 y¼ 19 19 X X Sxx ¼ ðxi 22:737Þ2 ¼ 913:684 Syy ¼ ðyi 46:217Þ2 ¼ 924:436 X Sxy ¼ ðxi 22:737Þðyi 46:217Þ ¼ 901:944 x¼
Sxy 901:944 ¼ 0:987 b^1 ¼ ¼ Sxx 913:684 b^ ¼ y b^ x ¼ 46:217 0:987ð22:737Þ ¼ 23:8 0
1
The equation of the LSRL is y = 23.8 + .987x. We estimate that the change in expected sale price associated with a 1-ft increase in truss height is .987, or about 99 cents per square foot. The intercept of 23.8, while important for correctly summarizing the data, does not have a direct interpretation— after all, it doesn’t make sense for a warehouse to have a truss height of x = 0 feet (how would you store anything?). Figure 12.8, generated by the statistical software package R, shows that the least squares line provides an excellent summary of the relationship between the two variables.
Figure 12.8 A scatterplot of the data in Example 12.4 with the LSRL superimposed, from R
■
The LSRL can immediately be used for two different purposes. For a fixed x value x , b^0 þ b^1 x (the height of the line above x ) gives both (1) a point estimate of the mean value of Y when x = x and (2) a point prediction of the Y value that will result from a single new observation made at x = x . The least squares line should not be used to make a prediction for an x value much beyond the range of the data, such as x = 5 or x = 45 in Example 12.4. The danger of extrapolation is that the fitted relationship (a line here) may not be valid for such x values. Example 12.5 (Example 12.4 continued) A point estimate for the true average price for all warehouses with 25-ft truss height is
12.2
Estimating Model Parameters
717
^Yj25 ¼ b^0 þ b^1 ð25Þ ¼ 23:8 þ :987ð25Þ ¼ $48:48=ft2 l This also represents a point prediction for the price of a single warehouse with 25-ft truss height. Notice that although no sample observations had x = 25, this value lies in the “middle” of the set of x values (see Figure 12.8). This is an example of interpolation: using the LSRL for x values that were unseen but are consistent with the sample data. A point estimate for the true average price for all warehouses with 50-ft truss height is ^Yj50 ¼ b^0 þ b^1 ð50Þ ¼ 23:8 þ :987ð50Þ ¼ $73:15=ft2 l However, because this calculation involves an extrapolation—the value x = 50 is well outside the bounds of the available data—we have much less faith that this estimated cost is accurate. ■ Residuals and Estimating r The parameter r determines the amount of variability inherent in the regression model. A large value of r will lead to observed (xi, yi)’s that are typically quite spread out about the true regression line, whereas when r is small the observed points will tend to fall very close to the true line (see Figure 12.9). An estimate of r will be used in confidence interval formulas and hypothesis-testing procedures presented in the next two sections. Because the equation of the true line is unknown, the estimate is based on the extent to which the sample observations deviate from the estimated line.
a y
b Elongation
y
Product sales 0
0
x
1x
1x
Tensile force
x
Advertising expenditure
Figure 12.9 Typical sample for r: (a) small; (b) large
DEFINITION
The fitted (or predicted) values ^y1 ; . . .; ^yn are obtained by successively substituting the x values x1, …, xn into the equation of the LSRL: the ith fitted value is ^yi ¼ b^0 þ b^1 xi ¼ y þ b^1 ðxi xÞ The residuals e1 ; . . .; en are the vertical deviations from the LSRL: the ith residual is ei ¼ yi ^yi ¼ yi ðb^0 þ b^1 xi Þ ¼ ðyi yÞ b^1 ðxi xÞ
ð12:4Þ
718
12
Regression and Correlation
In words, the predicted value ^yi is the value of y that we would predict or expect when using the estimated regression line with x ¼ xi ; ^yi is the height of the estimated regression line above xi. The residual ei is the difference between the observed yi and the predicted ^yi . Assuming that the line in Figure 12.7 is the least squares line, the residuals are identified by the vertical line segments from the observed points to the line. In fact, the principle of least squares is equivalent to determining the line for which the sum of squared residuals is minimized. If the residuals are all small in magnitude, then much of the variability in observed y values appears to be due to the linear relationship between x and y, whereas many large residuals suggest quite a bit of inherent variability in y relative to the amount due to the linear relation. The residuals from the LSRL P always satisfy ei ¼ 0 and so e ¼ 0 (see Exercise 24; in practice, the sum may deviate a bit from zero due to rounding). The ith residual ei may also be regarded as a proxy for the unobservable “true” error ei for the ith observation: true error: ei ¼ yi ðb0 þ b1 xi Þ estimated error: ei ¼ ^ei ¼ yi ðb^0 þ b^1 xi Þ The sum of squared residuals is used here to estimate the standard deviation r of the ei ’s in the same way that the sum of squares Sxx was previously used to estimate a population sd. DEFINITION
The error sum of squares (or residual sum of squares), denoted by SSE, is SSE ¼
X
ðei eÞ2 ¼
X
e2i ¼
X
ðyi ^yi Þ2
and the least squares estimate of r2 is P SSE ðyi ^yi Þ2 ^ ¼ ¼ ¼ r n2 n2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi The estimate se ¼ SSE=ðn 2Þ of r is called the residual standard deviation. 2
s2e
The divisor n − 2 in se is the number of degrees of freedom (df) associated with the estimate (or, equivalently, with the error sum of squares). This is because to obtain se, the two parameters b0 and b1 must first be estimated, which results in a loss of 2 df (just as l had to be estimated in one-sample problems, resulting in an estimated variance based on n − 1 df). Equivalently, the normal equations impose two constraints; as a result, if n – 2 of the residuals are known, then the remaining two are completely determined (so only n – 2 are freely determined; see Exercise 24). Replacing each yi in the formula for se by the rv Yi gives the estimator Se. It can be shown that S2e is an unbiased estimator for r2, although the estimator Se is biased for r. (The mle of r2 based on the normal model has divisor n rather than n − 2, so it is biased.) The interpretation of se here is similar to that of r given earlier. Roughly speaking, se is the size of a “typical” or “representative” deviation from the least squares line.
12.2
Estimating Model Parameters
719
Example 12.6 Japan’s high population density has resulted in a multitude of resource usage problems. One especially serious difficulty concerns waste removal. The article “Innovative Sludge Handling Through Pelletization Thickening” (Water Res. 1999: 3245–3252) reported the development of a new compression machine for processing sewage sludge. An important part of the investigation involved relating the moisture content of compressed pellets (y, in %) to the machine’s filtration rate (x, in kg-DS/m/h). The following data was read from a graph in the paper: x
125.3
98.2
201.4
147.3
145.9
124.7
112.2
120.2
161.2
178.9
y
77.9
76.8
81.5
79.8
78.2
78.3
77.5
77.0
80.1
80.2
x
159.5
145.8
75.1
151.4
144.2
125.0
198.8
132.5
159.6
110.7
y
79.9
79.0
76.7
78.2
79.5
78.1
81.5
77.0
79.0
78.6
Relevant summary quantities are x ¼ 140:895, y ¼ 78:74, Sxx ¼ 18;921:8295, and Sxy ¼ 776:434, from which b^1 ¼
776:434 ¼ :04103377 :041 18;921:8295
b^0 ¼ 78:74 ð:04103377Þð140:895Þ ¼ 72:958547 72:96 The equation of the least squares line is y ¼ 72:96 þ :041x. For numerical accuracy, the fitted values are calculated from ^yi ¼ 72:958547 þ :04103377xi : ^y1 ¼ 72:958547 þ :04103377ð125:3Þ 78:100;
e1 ¼ y1 ^y1 :200; etc:
A positive residual corresponds to a point in the scatterplot that lies above the graph of the least squares line, whereas a negative residual results from a point lying below the line. All predicted values (fits) and residuals appear in the accompanying table. Obs. (i) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Filtrate (xi)
Moist. Con. (yi)
Fit (ŷi)
Residual (ei)
125.3 98.2 201.4 147.3 145.9 124.7 112.2 120.2 161.2 178.9 159.5 145.8 75.1 151.4 144.2 125.0 198.8 132.5 159.6 110.7
77.9 76.8 81.5 79.8 78.2 78.3 77.5 77.0 80.1 80.2 79.9 79.0 76.7 78.2 79.5 78.1 81.5 77.0 79.0 78.6
78.100 76.988 81.223 79.003 78.945 78.075 77.563 77.891 79.573 80.299 79.503 78.941 76.040 79.171 78.876 78.088 81.116 78.396 79.508 77.501
−0.200 −0.188 0.277 0.797 −0.745 0.225 −0.063 −0.891 0.527 −0.099 0.397 0.059 0.660 −0.971 −0.624 0.012 0.384 −1.396 −0.508 1.099
720
12
Regression and Correlation
It can be verified that, rounding error notwithstanding, the residuals (the last column) sum to 0. The corresponding residual sum of squares is SSE ¼ ð:200Þ2 þ ð:188Þ2 þ þ ð1:099Þ2 ¼ 7:968 ^2 ¼ s2e ¼ 7:968=ð20 2Þ ¼ :4427, and the residual standard deviation is The estimate of r2 is then r pffiffiffiffiffiffiffiffiffiffiffi ^ ¼ se ¼ :4427 ¼ :665. Roughly speaking, .665 is the typical difference between the actual r moisture concentration of a specimen and its predicted moisture concentration based on the LSRL. ■ Computation of SSE from the defining formula involves much tedious arithmetic, because both the predicted values and residuals must first be calculated. Use of the following formula does not require P P 2 these quantities (again see Exercise 24), though yi and yi are needed. SSE ¼ Syy S2xy =Sxx The Coefficient of Determination Figure 12.10 shows three different scatterplots of bivariate data. In all three plots, the heights of the different points vary substantially, indicating that there is much variability in observed y values. The points in the first plot all fall exactly on a straight line. In this case, all (100%) of the sample variation in y can be attributed to the fact that x and y are linearly related in combination with variation in x. The points in Figure 12.10b do not fall exactly on a line, but compared to overall y variability, the deviations from the least squares line are small. It is reasonable to conclude in this case that much of the observed y variation can be attributed to the approximate linear relationship between the variables postulated by the simple linear regression model. When the scatterplot looks like that of Figure 12.10c, there is substantial variation about the least squares line relative to overall y variation, so the simple linear regression model fails to explain much of the variation in y by relating y to x.
a
b
c
y
y
y
x
x
x
Figure 12.10 Explaining y variation: (a) all variation explained; (b) most variation explained; (c) little variation explained
The error sum of squares SSE can be interpreted as a measure of how much variation in y is left unexplained by the model—that is, how much cannot be attributed to a linear relationship. In Figure 12.10a, SSE = 0 and there is no unexplained variation, whereas unexplained variation is small for the data of Figure 12.10b and much larger in Figure 12.10c. A quantitative measure of the total amount of variation in the observed y values is given by the total sum of squares
12.2
Estimating Model Parameters
721
SST ¼
X
ðyi yÞ2 ¼ Syy
Figure 12.11 illustrates the difference between these two sums of squares. The (x, y) points in the two scatterplots are identical. While SSE measures the deviation of the y values from the LSRL (a “model” that uses x as a predictor), SST measures the deviation of the y values from the horizontal line y ¼ y (essentially ignoring the presence of x). Since the least squares line is by definition the line having the smallest sum of squared vertical deviations, SSE can’t be any larger than SST, and usually it is much smaller.
a
b y
y
Horizontal line at height y Least squares line y
x
x
Figure 12.11 Sums of squares illustrated: (a) SSE = sum of squared deviations about the least squares line; (b) SST = sum of squared deviations about the horizontal line y ¼ y
Dividing SSE by SST gives the proportion of total variation that is not explained by the approximate linear relationship. Subtracting this ratio from 1 results in the proportion of total variation that is explained by the relationship. DEFINITION
The coefficient of determination, denoted by R2, is given by R2 ¼ 1
SSE SST
R2 is interpreted as the proportion of observed y variation that can be explained by the simple linear regression model (i.e., attributed to an approximate linear relationship between y and x). The closer R2 is to 1, the more successful the simple linear regression model is in explaining y variation. Multiplying R2 by 100 gives the percentage of total variation explained by the relationship; software often reports R2 this way. Said differently, R2 is the proportion by which the error sum of squares is reduced by the regression line compared to the horizontal line. For example, if SST = 20 and SSE = 2, then R2 ¼ 1 ð2=20Þ = .9, so the regression reduces the error sum of squares by 90%. Although it is common to have R2 values of .9 or more in engineering and the physical sciences, 2 R is likely to be much smaller in social sciences such as psychology and sociology, where values far less than .5 are common but still considered important.
722
12
Regression and Correlation
Example 12.7 (Example 12.5 continued) The scatterplot of the truss height-sale price data in Figure 12.8 indicates a fairly high R2 value. Previous computations showed that SST = Syy = 924.436, Sxy = 901.944, and Sxx = 913.684. Using the computational shortcut, SSE ¼ 924:436 ð901:944Þ2 =913:684 ¼ 34:081 The coefficient of determination is then R2 ¼ 1
34:081 ¼ 1 :037 ¼ :963 924:436
That is, 96.3% of the observed variation in warehouse price is attributable to (can be explained by) the approximate linear relationship between price and truss height, a fairly impressive result. The R2 can also be interpreted by saying that the error sum of squares using the regression line is 96.3% less than the error sum of squares using a horizontal line (i.e., ignoring truss height). Figure 12.12 shows partial Minitab output for the warehouse data; the package will also provide the predicted values and residuals upon request, as well as other information. The formats used by other packages differ slightly from that of Minitab, but the information content is very similar. Quantities in Figure 12.12 such as the standard deviations, t ratios, and the details of the ANOVA table are discussed in Section 12.3. The regression equation is Sales Price = 23.8 + 0.987 Truss Height Predictor Constant Truss Height
Coef
SE Coef
T
P
23.772← ˆ0 1.113 21.35 0.000 ˆ 0.98715← 1 0.04684 21.07 0.000 R-Sq = 96.3%←100R2
S = 1.41590←se
R-Sq(adj) = 96.1%
Analysis of Variance Source Regression Residual Error Total
DF 1 17 18
SS 890.36 34.08←SSE 924.44←SST
MS 890.36 2.00
F 444.11
P 0.000
■
Figure 12.12 Minitab output for the regression of Example 12.7
For regression there is an analysis of variance identity like the fundamental identity (11.1) in Chapter 11. Add and subtract ^yi in the total sum of squares: SST ¼
X
ðyi yÞ2 ¼
X
½ðyi ^yi Þ þ ð^yi yÞ2 ¼
X
ðyi ^yi Þ2 þ
X
ð^yi yÞ2
12.2
Estimating Model Parameters
723
Notice that the middle (cross-product) term is missing on the right; see Exercise 24 for the justification. P Of the two sums on the right, the first is SSE ¼ ðyi ^yi Þ2 and the second is something new, the P regression sum of squares, SSR ¼ ð^yi yÞ2 . The analysis of variance identity for regression is SST ¼ SSE þ SSR
ð12:5Þ
The coefficient of determination can now be written in a slightly different way: R2 ¼ 1
SSE SST SSE SSR ¼ ¼ SST SST SST
The ANOVA table in Figure 12.12 shows that SSR = 890.36, from which R2 = 890.36/924.44 = .936 as before. Hence we interpret the regression sum of squares SSR as the amount of total variation explained by the model, so that R2 is the ratio of explained variation to total variation. Exercises: Section 12.2 (13–30) Termination Metallurgy” (Plating and Surface Finishing, Jan. 1997: 38–40). Do you agree with the claim by the article’s author that “a linear relationship was obtained from the tin–lead rate of deposition as a function of current density”? Explain your reasoning.
13. Exercise 4 gave data on x = BOD mass loading and y = BOD mass removal. Values of relevant summary quantities are n ¼ 14 Sxy ¼ 13;048
P
xi ¼ 517 Sxx ¼ 20;003
P
yi ¼ 346 Syy ¼ 8903
a. Obtain the equation of the least squares line. b. Predict the value of BOD mass removal for a single observation made when BOD mass loading is 35, and calculate the value of the corresponding residual. c. Calculate SSE and then a point estimate of r. d. What proportion of observed variation in removal can be explained by the approximate linear relationship between the two variables? e. The last two x values, 103 and 142, are much larger than the others. How are the equation of the least squares line and the value of R2 affected by deletion of the two corresponding observations from the sample? Adjust the given values of the summary quantities, and use the fact that the new value of SSE is 311.79. 14. The accompanying data on x = current density (mA/cm2) and y = rate of deposition (mm/min) appeared in the article “Plating of 60/40 Tin/Lead Solder for Head
x
20
40
60
80
y
.24
1.20
1.71
2.22
15. The efficiency ratio for a steel specimen immersed in a phosphating tank is the weight of the phosphate coating divided by the metal loss (both in mg/ft2). The article “Statistical Process Control of a Phosphate Coating Line” (Wire J. Internat., May 1997: 78–81) gave the accompanying data on tank temperature (x) and efficiency ratio (y). Temp.
170
172
173
174
174
175
176
Ratio
.84
1.31
1.42
1.03
1.07
1.08
1.04
Temp.
177
180
180
180
180
180
181
Ratio
1.80
1.45
1.60
1.61
2.13
2.15
.84
Temp.
181
182
182
182
182
184
184
Ratio
1.43
.90
1.81
1.94
2.68
1.49
2.52
Temp.
185
186
188
Ratio
3.00
1.87
3.08
a. Determine the equation of the estimated regression line.
724
12
b. Calculate a point estimate for true average efficiency ratio when tank temperature is 182. c. Calculate the values of the residuals from the least squares line for the four observations for which temperature is 182. Why do they not all have the same sign? d. What proportion of the observed variation in efficiency ratio can be attributed to the simple linear regression relationship between the two variables?
Regression and Correlation
Analysis of Variance Source Regression Residual Error Total
DF 1 9
SS 49.471 18.938
10
68.409
MS 49.471 2.104
F 23.51
P 0.001
d. What are the values of SSE, SST, and the coefficient of determination? How well does the midparent height account for the variation in daughter height?
16. The scientist Francis Galton, an early developer of regression methodology, used “midparent height,” the average of the father’s and mother’s heights, in order to predict children’s heights. Here are the heights of 11 female students along with their midparent heights in inches:
17. The article “Characterization of Highway Runoff in Austin, Texas, Area” (J. Environ. Engr. 1998: 131–137) gave a scatterplot, along with the least squares line, of x = rainfall volume (m3) and y = runoff volume (m3) for a particular location. The accompanying values were read from the plot.
Midparent
65.5
x
5
12
14
17
23
30
40
47
65.0
y
4
10
13
15
15
25
27
46
Daughter
66.0 64.0
65.5 63.0
71.5 69.0
68.0 69.0
70.0 69.0
Midparent
67.0
70.5
69.5
64.5
67.5
x
55
67
72
81
96
112
127
Daughter
63.0
68.5
69.0
64.0
67.0
y
38
46
53
70
82
99
100
a. Construct a scatterplot of daughter’s height against the midparent height and comment on the strength of the relationship. b. Is the daughter’s height completely and uniquely determined by the midparent height? Explain. c. Use the accompanying Minitab output to obtain the equation of the least squares line for predicting daughter height from midparent height, and then predict the height of a daughter whose midparent height is 70 in. Would you feel comfortable using the least squares line to predict daughter height when midparent height is 74 in.? Explain. Predictor Constant midparent S = 1.45061
Coef SE Coef T P 1.65 13.36 0.12 0.904 0.9555 0.1971 4.85 0.001 R-Sq = 72.3% R-Sq(adj) = 69.2%
a. Does a scatterplot of the data support the use of the simple linear regression model? b. Calculate point estimates of the slope and intercept of the population regression line. c. Calculate a point estimate of the true average runoff volume when rainfall volume is 50. d. Calculate a point estimate of the standard deviation r. e. What proportion of the observed variation in runoff volume can be attributed to the simple linear regression relationship between runoff and rainfall? 18. A regression of y = calcium content (g/L) on x = dissolved material (mg/cm2) was reported in the article “Use of Fly Ash or Silica Fume to Increase the Resistance of Concrete to Feed Acids” (Mag. Concrete Res. 1997: 337–344). The equation of the estimated regression line was y = 3.678 + .144x, with R2 = .860, based on n = 23.
12.2
Estimating Model Parameters
725
a. Interpret the estimated slope .144 and the coefficient of determination .860. b. Calculate a point estimate of the true average calcium content when the amount of dissolved material is 50 mg/cm2. c. The value of total sum of squares was SST = 320.398. Calculate an estimate of the error standard deviation r in the simple linear regression model. 19. The cetane number is a critical property in specifying the ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. The article “Relating the Cetane Number of Biodiesel Fuels to Their Fatty Acid Composition: A Critical Study” (J. Automobile Engr. 2009: 565–583) included the following data on x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100 g of oil. The article’s authors fit the simple linear regression model to this data, so let’s follow their lead. x
132.0
129.0
120.0
113.2
105.0
92.0
84.0
y
46.0
48.0
51.0
52.1
54.0
52.0
59.0
x
83.2
88.4
59.0
80.0
81.5
71.0
69.2
y
58.7
61.6
64.0
61.4
54.6
58.8
58.0
a. Obtain the equation of the least squares line, and then calculate a point prediction of the cetane number that would result from a single observation with an iodine value of 100. b. Calculate and interpret the coefficient of determination. c. Calculate and interpret a point estimate of the model standard deviation r. 20. A number of studies have shown lichens (certain plants composed of an alga and a fungus) to be excellent bioindicators of air pollution. The article “The Epiphytic Lichen Hypogymnia physodes as a
Biomonitor of Atmospheric Nitrogen and Sulphur Deposition in Norway” (Environ. Monit. Assess. 1993: 27–47) gives the following data (read from a graph) on x = NO wet deposition (gN/m2) and 3 y = lichen N (% dry weight): x
.05
.10
.11
.12
.31
.37
.42
y
.48
.55
.48
.50
.58
.52
1.02
x
.58
.68
.68
.73
.85
.92
y
.86
.86
1.00
.88
1.04
1.70
The author used simple linear regression to analyze the data. Use the accompanying Minitab output to answer the following questions: a. What are the least squares estimates of b0 and b1? b. Predict lichen N for an NO 3 deposition value of .5. c. What is the estimate of r? d. What is the value of total variation, and how much of it can be explained by the model relationship? The regression equation is lichen N = 0.365 + 0.967 No3depo Predictor Constant No3depo S = 0.1932
Coef 0.36510 0.9668 R-sq = 71.7%
Stdev t ratio P 0.09904 3.69 0.004 0.1829 5.29 0.000 R-sq (adj) = 69.2%
Analysis of Variance Source Regression Error Total
DF 1 11 11
SS 1.0427 0.4106 1.4533
MS 0.4106 0.0373
F 27.94
P 0.000
21. Visual and musculoskeletal problems associated with the use of visual display terminals (VDTs) have become rather common in recent years. Some researchers have focused on vertical gaze direction as a source of eye strain and irritation. This direction is known to be closely related to ocular surface area (OSA), so a method of measuring OSA is needed. The accompanying representative data on y = OSA (cm2) and x = width of the palprebal fissure (i.e., the horizontal width of the eye
726
12
opening, in cm) is from the article “Analysis of Ocular Surface Area for Comfortable VDT Workstation Layout” (Ergonomics 1996: 877–884). x
.40
.42
.48
.51
.57
.60
.70
.75
y
1.02
1.21
.88
.98
1.52
1.83
1.50
1.80
x
.75
.78
.84
.95
.99
1.03
1.12
y
1.74
1.63
2.00
2.80
2.48
2.47
3.05
x
1.15
1.20
1.25
1.25
1.28
1.30
1.34
1.37
y
3.18
3.76
3.68
3.82
3.21
4.27
3.12
3.99
x
1.40
1.43
1.46
1.49
1.55
1.58
1.60
y
3.75
4.10
4.18
3.77
4.34
4.21
4.92
a. Construct a scatterplot of this data. Describe what you find. b. Calculate the equation of the LSRL. c. Interpret the slope of the LSRL. d. What OSA would you predict for a subject whose palprebal fissure width is 1.25 cm? e. What would be the estimate of expected OSA for people with palprebal fissure width of 1.25 cm? 22. For many years, rubber powder has been used in asphalt cement to improve performance. The article “Experimental Study of Recycled Rubber-Filled High-Strength Concrete” (Mag. Concrete Res. 2009: 549–556) included a regression of y = axial strength (MPa) on x = cube strength (MPa) based on the following sample data: x
112.3
97.0
92.7
86.0
102.0
y
75.0
71.0
57.7
48.7
74.3
x
99.2
95.8
103.5
89.0
86.7
y
73.3
68.0
59.3
57.8
48.5
a. Verify that a scatterplot supports the assumption that the two variables are related via the simple linear regression model. b. Obtain the equation of the least squares line, and interpret its slope. c. Calculate and interpret the coefficient of determination.
Regression and Correlation
d. Calculate and interpret an estimate of the error standard deviation r in the simple linear regression model. e. The largest x value in the sample considerably exceeds the other x values. What is the effect on the equation of the least squares line of deleting the corresponding observation? 23. Show that under the assumptions of the simple linear regression model, the mles of b0 and b1 are identical to the least squares estimates. [Hints: (1) The pdf of Yi is normal with mean li = b0 + b1xi and variance r2; the likelihood function is the product of the n pdfs. (2) You don’t need to differentiate the likelihood function; instead, find the correspondence between that function and the least squares expression gðb0 ; b1 Þ.] Show that the residuals e1 ; . . .; en satisfy P P both ei ¼ 0 and ðxi xÞei ¼ 0. [Hint: Use the last expression for ei in (12.4), along with the fact that for any P numbers a1 ; . . .; an , ðai aÞ ¼ 0.] ^ b. Show that ^yi y ¼ b1 ðxi xÞ. c. Use (a) and (b) to derive the analysis of variance identity for regression, Equation (12.5), by showing that the crossproduct term is 0. d. Use (b) and Equation (12.5) to verify the computational formula for SSE.
24. a.
25. A regression analysis is carried out with y = temperature, expressed in °C. How do the resulting values of b^0 and b^1 relate to those obtained if y is re-expressed in °F? Justify your assertion. [Hint: ðnew yi Þ ¼ 1:8yi þ 32:] 26. Show that b1 and b0 of Expressions (12.2) and (12.3) satisfy the normal equations. 27. Show that the “point of averages” ðx; yÞ lies on the estimated regression line. 28. Suppose an investigator has data on the amount of shelf space x devoted to display of a particular product and sales revenue y for that product. The investigator may wish to fit a model for which the true
12.2
Estimating Model Parameters
regression line passes through (0, 0). The appropriate model is Y = b1x + e. Assume that ðx1 ; y1 Þ; . . .; ðxn ; yn Þ are observed pairs generated from this model, and derive the least squares estimator of b1. [Hint: Write the sum of squared deviations as a function of b1, a trial value, and use calculus to find the minimizing value of b1.] 29. a. Consider the data in Exercise 20. Suppose that instead of the least squares line that passes through the points ðx1 ; y1 Þ; . . .; ðxn ; yn Þ, we wish the least squares line that passes through ðx1 x; y1 Þ; . . .; ðxn x; yn Þ. Construct a scatterplot of the (xi, yi) points and then of the ðxi x; yi Þ points. Use the plots to explain intuitively how the two least squares lines are related. b. Suppose that instead of the model Yi ¼ b0 þ b1 xi þ ei ði ¼ 1; . . .; nÞ, we wish to fit a model of the form Yi ¼ b0 þ b1 ðxi xÞ þ ei ði ¼ 1; . . .; nÞ. What are the least squares estimators of b0 and b1 , and how do they relate to b^0 and b^1 ?
12.3
727
30. Consider the following three data sets, in which the variables of interest are x = commuting distance and y = commuting time. Based on a scatterplot and the values of se and R2, in which situation would simple linear regression be most (least) effective, and why? 1
Sxx Sxy b^ 1
b^0 SST SSE
2
3
x
y
x
y
x
y
15 16 17 18 19 20
42 35 45 42 49 46
5 10 15 20 25 50
16 32 44 45 63 115
5 10 15 20 25 50
8 16 22 23 31 60
17.50 29.50 1.685714 13.666672 114.83 65.10
1270.8333 2722.5 2.142295 7.868852 5897.5 65.10
1270.8333 1431.6667 1.126557 3.196729 1627.33 14.48
Inferences About the Regression Coefficient b1
In virtually all of our inferential work thus far, the notion of sampling variability has been pervasive. ^ and so on) have been the In particular, properties of sampling distributions of various statistics (X, P, basis for developing confidence interval formulas and hypothesis-testing methods. The key idea is that the value of virtually any quantity calculated from sample data (i.e., any statistic) is going to vary from one sample to another. Example 12.8 Reconsider the data on x = truss height and y = sale price per square foot from n = 19 warehouses in Example 12.4 of the previous section. Suppose the simple linear regression model applies here, with parameter values b1 = 1, b0 = 25, and r = 1.4 (consistent with the estimates computed previously). To understand the sampling variability of the statistics b^0 and b^1 , we performed the following simulation 250 times in R: • Generate random errors e1 ; . . .; e19 from a normal distribution with mean 0 and standard deviation r = 1.4. • Using the 19 xi’s from the original data set, generate response values according to the model equation:
728
12
yi ¼ b0 þ b1 xi þ ei ¼ 25 þ 1xi þ ei
Regression and Correlation
i ¼ 1; 2; . . .; 19
• Perform least squares regression on the simulated (xi, yi) pairs to obtain the estimated slope and intercept. Figure 12.13 shows histograms of the b^ and b^ values resulting from this simulation. There is 0
1
>
clearly variation in values of the estimated slope and estimated intercept. The equation of the LSRL thus also varies from one sample to the next. Note, though, that the estimates are centered close to the true values, an indication of unbiasedness. >
Histogram of β 1
10 0
0
10
Frequency 20 30
Frequency 20 30
40
40
Histogram of β 0
22
23
24
25
26
27
28
0.85
0.90
0.95
1.00
1.05
1.10
Figure 12.13 Histograms approximating the sampling distributions of b^0 and b^1
1.15
■
The slope b1 of the population regression line is the true change in the mean of the response variable Y associated with a one-unit increase in the explanatory variable x. The slope of the least squares line, b^1 , gives a point estimate of b1. In the same way that a confidence interval for l and procedures for testing hypotheses about l were based on properties of the sampling distribution of X, inferences about b1 are based on the sampling distribution of b^1 . The values of the xi’s are assumed to be chosen before the study is performed, so only the Yi’s are random. The estimators for b0 and b1 are obtained by replacing yi with Yi in (12.2) and (12.3): b^1 ¼
P
ðxi xÞðYi YÞ ¼ P ðxi xÞ2
P
ðxi xÞðYi YÞ ; b^0 ¼ Y b^1 x Sxx
Similarly, the estimator for r results from replacing each yi in the formula for se by the rv Yi: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P P ðYi Y^i Þ2 ðYi ½b^0 þ b^1 xi Þ2 ^ ¼ Se ¼ ¼ r n2 n2 P The denominator of b^1 , Sxx ¼ ðxi xÞ2 , depends only on the xi’s and not on the Yi’s, so it is a P P constant. Then, because ðxi xÞY ¼ Y ðxi xÞ ¼ Y 0 ¼ 0, the slope estimator can be re-written as
12.3
Inferences About the Regression Coefficient b1
b^1 ¼
P
729
ðxi xÞYi X ¼ ci Yi Sxx
where ci ¼ ðxi xÞ=Sxx
That is, the estimator b^1 is a linear function of the independent rvs Y1, Y2, …, Yn, each of which is normally distributed. Invoking properties of a linear function of random variables discussed in Section 5.3 leads to the following results (Exercise 40). PROPERTIES OF THE 1. ESTIMATED SLOPE
1.
The mean value of b^1 is Eðb^1 Þ ¼ lb^1 ¼ b1 , so b^1 is an unbiased estimator of b1 (i.e., the distribution of b^ is always centered at the 1
2.
true value of b1). The variance and standard deviation of b^1 are Vðb^1 Þ ¼ r2b^ ¼ 1
r2 Sxx
r rb^1 ¼ pffiffiffiffiffiffi Sxx
Replacing r by its estimate se gives an estimate for rb^1 : se se ^b^ ¼ sb^ ¼ pffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffi r 1 1 Sxx sx n 1 3.
The estimator b^1 has a normal distribution, because it is a linear function of independent normal rvs.
Properties 1 and 3 manifest themselves in the b^1 histogram of Figure 12.13. According to Property 2, the standard deviation of b^1 equals the standard deviation r of the random error term—or, equivapffiffiffiffiffiffiffiffiffiffiffi lently, of any Yi—divided by sx n 1. Because the sample standard deviation sx is a measure of how spread out the xi’s are about x, we conclude that making observations at xi values that are quite spread out results in a more precise estimator of the slope parameter (smaller variance of b^1 ), whereas values of xi all close to each other imply a highly variable estimator. Of course, if the xi’s are spread out too far, a linear model may not be appropriate throughout the range of observation. Finally, the presence of n in the denominator of sb^1 implies that the estimated slope varies less for larger samples than for ^ as sample size smaller ones. We have seen this feature previously in other statistics such as X and P: n increases, the distribution of the statistic “collapses onto” the true value of the corresponding parameter. Many inferential procedures discussed previously were based on standardizing an estimator by first subtracting its mean value and then dividing by its estimated standard deviation. In particular, test procedures and a CI for the mean l of a normal population utilized the fact that the standardized pffiffiffi variable ðX lÞ=ðS= nÞ has a t distribution with n − 1 df. A similar result here provides the key to further inferences concerning b1.
730
12
THEOREM
Regression and Correlation
The assumptions of the simple linear regression model imply that the standardized variable T¼
b^1 b1 b^ b1 ¼ 1 pffiffiffiffiffiffi Sb^1 Se = Sxx
has a t distribution with n − 2 df. The T ratio can be written as b^1 b1 pffiffiffiffiffiffi ^ b 1 b1 r= Sxx ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s p ffiffiffiffiffiffi T¼ ¼ Se = Sxx ðn 2ÞS2 =r2 e
ðn 2Þ pffiffiffiffiffiffi The theorem is a consequence of the following facts: (1) ðb^1 b1 Þ=ðr= Sxx Þ N ð0; 1Þ, (2) ðn 2ÞS2e =r2 v2n2 , and (3) b^1 is independent of Se. That is, T is a standard normal rv divided by the square root of an independent chi-squared rv over its df, and so has the specified t distribution. A Confidence Interval for b1 As in the derivation of previous CIs, we begin with a probability statement: b^ b1 P ta=2;n2 \ 1 \ta=2;n2 Sb^1
! ¼1a
Manipulation of the inequalities inside the parentheses to isolate b1 and substitution of estimates in place of the estimators gives the following CI formula. A 100(1 − a)% CI for the slope b1 of the true regression line has endpoints se se b^1 ta=2;n2 sb^1 ¼ b^1 ta=2;n2 pffiffiffiffiffiffi ¼ b^1 ta=2;n2 pffiffiffiffiffiffiffiffiffiffiffi Sxx sx n 1 This interval has the same general form as did many of our previous intervals. It is centered at the point estimate of the parameter, and the amount it extends out to each side of the estimate depends on the desired confidence level (through the t critical value) and on the amount of variability in the estimator b^1 (through sb^1 , which will tend to be small when there is little variability in the distribution of b^ and large otherwise). 1
Example 12.9 The scatterplot in Figure 12.14 shows the size (x, in square feet) and monthly rent (y, in dollars) for a random sample of n = 77 two-bedroom apartments in Omaha, NE (courtesy of www. zillow.com/omaha-ne/rentals). The plot suggests, not surprisingly, that rent generally increases with apartment size, and that for any fixed apartment size there is variability in monthly rents.
12.3
Inferences About the Regression Coefficient b1
731
Rent ($) 1300 1200 1100 1000 900 800 700
Size (sq.ft.) 700
800
900
1000
1100
1200
1300
1400
1500
Figure 12.14 Scatterplot of the data from Example 12.8
Summary quantities include x ¼ 1023:5 y ¼ 1006:6
sx ¼ 161:9 sy ¼ 113:3
Sxx ¼ 1;991;569 Syy ¼ 975;840
Sxy ¼ 1;042;971 from which b^1 ¼ :5237, b^0 ¼ 470:6, SST = Syy = 975,840, SSE = 429,642, and R2 = .5597. Roughly 56% of the observed variation in monthly rent can be attributed to the simple linear regression model relationship between rent and apartment size. The remaining 44% of rent variation is due to other apartment features, such as neighborhood, nicer appliances, or dedicated parking. Error df is n – 2 = 77 − 2 = 75, giving s2e = 429,642/75 = 5728.56 and se = 75.69. The estimated standard deviation of b^ is 1
se 75:69 sb^1 ¼ pffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ :0536 1;991;569 Sxx The t critical value for a confidence level of 95% is t.025,75 = 1.992, so a 95% CI for b1 is :5237 1:992ð:0536Þ ¼ ð:4169; :6305Þ With a high degree of confidence, we estimate that a one square foot increase in an apartment’s size is associated with an increase between $.4169 and $.6305 in the expected monthly rent. This applies to the population of all two-bedroom apartments in Omaha. Multiplying by 100 gives $41.69 to $63.05 as the increase in expected rent associated with a 100 ft2 size increase. Looking at the R output of Figure 12.15, we find the value of sb^1 under coefficients as the second number in the standard error column, while the value of se is displayed as residual standard error. There is also an estimated standard error for the statistic b^0 . For all of the statistics, compare the values in the R output with the values calculated above.
732
12
Regression and Correlation
Call: lm(formula = Rent ~ Size, data = Omaha) Residuals: Min 1Q -144.314 -52.306
Median 1.658
3Q 40.635
Coefficients: Estimate Std. Error t (Intercept) 470.62001 55.56499 Size 0.52369 0.05363 --Signif. codes: 0 ‘***’ 0.001 ‘**’
Max 189.477
value Pr(>|t|) 8.470 1.52e-12 *** 9.765 5.30e-15 *** 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 75.69 on 75 degrees of freedom Multiple R-squared: 0.5597, Adjusted R-squared: 0.5539 F-statistic: 95.35 on 1 and 75 DF, p-value: 5.302e-15 Figure 12.15 R output for the data of Example 12.9
■
Hypothesis-Testing Procedures As before, the null hypothesis in a test about b1 will be an equality statement. The null value of b1 claimed true by the null hypothesis will be denoted by b10 (read “beta one naught,” not “beta ten”). The test statistic results from replacing b1 in the standardized variable T by the null value b10—that is, from standardizing b^1 under the assumption that H0 is true. The test statistic thus has a t distribution with n − 2 df when H0 is true, so the type I error probability is controlled at the desired level a by using an appropriate t critical value.
The most commonly encountered pair of hypotheses about b1 is H0: b1 = 0 versus Ha: b1 6¼ 0, in which case the test statistic value is the t ratio b^1 =sb^1 . When this null hypothesis is true, EðYjxÞ ¼ b0 þ 0x ¼ b0 , independent of x, so knowledge of x gives no information about the value of the response variable. A test of these two hypotheses is often referred to as the model utility test in simple linear regression. Unless n is quite small, H0 will be rejected and the utility of the model confirmed precisely when R2 is reasonably large. The simple linear regression model should not be used for further inferences, such as estimates of mean value or predictions of future values (the topics of Section 12.4), unless the model utility test results in rejection of H0 for a suitably small a.
12.3
Inferences About the Regression Coefficient b1
733
Example 12.10 How is the perceived risk of an investment related to its expected return? Intuitively it would seem as though riskier investments would be associated with higher expected returns. The article “Affect in a Behavioral Asset-Pricing Model” (Fin. Anal. J., March/April 2008: 20–29) reported on an experiment in which each member of a group of investors rated the risk of a company’s stock on a 10-point scale ranging from low to high, and members of a different group rated the future return of the stock on the same scale. This was done for a total of 210 companies, and for each one both a risk score x and an expected return score y resulted from averaging responses from the individual raters. The following data is from a subset of ten of these companies (listed for convenience in increasing order of risk): x
4.3
4.6
5.2
5.3
5.5
5.7
6.1
6.3
6.8
7.5
y
7.7
5.2
7.9
5.8
7.2
7.0
5.3
6.8
6.6
4.7
The scatterplot of the data for these ten companies in Figure 12.16 shows a weak (R2 .18) but also surprising negative relationship between the two variables. Let’s carry out the model utility test at significance level a = .05 (the scatterplot does not bode well for the model, but stay tuned for the rest of the story). Expected return score 8.0 7.5 7.0 6.0 6.0 5.5 5.0
Risk score 4.0
4.5
5.0
5.5
6.0
6.5
7.0
7.5
Figure 12.16 Scatterplot for the data in Example 12.10
The parameter of interest is b1 = true change in expected return score associated with a one-point increase in risk score. The null hypothesis H0: b1 = 0 will be rejected in favor of Ha: b1 6¼ 0 if the observed test statistic t satisfies either t ta=2;n2 ¼ t:025;8 ¼ 2:306 or t −2.306. Partial Excel output (software not favorably regarded in the statistical community) for this example appears in Figure 12.17. In the output, b^1 = –.4913 and sb^1 = .3614, so the test statistic is t¼
:4913 0 1:36 :3614
ðalso on outputÞ
734
12
Regression and Correlation
SUMMARY OUTPUT Regression Residual Total
Intercep t Risk
df 1 8 9 Coefficients 9.2353 -0.4913
SS 2.0714 8.9646 11.0360 Standard Error 2.0975 0.3614
MS 2.0714 1.1206
F 1.8485
Significance 0.2110
t Stat 4.4029 -1.3596
P-value 0.0023 0.2110
Lower 95% 4.3983 -1.3246
Upper 95% 14.0722 0.3420
Figure 12.17 Partial Excel output for Example 12.10
Since −1.36 does not fall in the rejection region, H0 is not rejected at the .05 level. Equivalently, the two-sided P-value is double the area under the t8 curve to the left of −1.36, which Excel reports as roughly .211; since .211 > .05, again H0 is not rejected. Excel also provides a 95% confidence interval of (−1.3246, 0.3420) for b1. This is consistent with the results of the hypothesis test: since the value 0 lies in the interval, we have no reason to reject the claim H0: b1 = 0. Is there truly no relationship, or have we committed a type II error here? With just n = 10 observations, it is quite possible we failed to detect a relationship because hypothesis tests do not have much power when the sample size is small. In fact, the authors of the original study examined 210 companies on these same two variables, resulting in an estimated slope of b^1 = −0.4 (similar to our sample) but with an estimated standard error of roughly .0556. The resulting test statistic value is t = −7.2 at 208 df, which is highly statistically significant. The authors concluded, based on their larger sample, that risk is a useful predictor of future return—although, contrary to intuition, the association between the two appears to be negative. Even though the relationship was statistically significant, note that—even in the full sample—risk only accounted for R2 = .185 = 18.5% of the variation in future returns. As in the case of previous test procedures, a large-sample size can result in rejection of H0 even though the data suggests that the departure from H0 is of little practical significance. ■ Regression and ANOVA P The splitting of the total sum of squares SST = Syy = ðyi yÞ2 into a part SSE which measures unexplained variation and a part SSR which measures variation explained by the linear relationship is strongly reminiscent of one-way ANOVA. In fact, H0: b1 = 0 can alternatively be tested against Ha: b1 6¼ 0 by constructing an ANOVA table (Table 12.1) and rejecting H0 if f Fa,1,n−2. Table 12.1 ANOVA table for simple linear regression Source of variation Regression Error Total
df
Sum of squares
1 n−2 n−1
SSR SSE SST
Mean square MSR = SSR/1 MSE = SSE/(n – 2)
F ratio f = MSR/MSE
12.3
Inferences About the Regression Coefficient b1
735
The square root of the mean squared error (MSE) is se, the residual standard deviation. The F test 2 gives exactly the same result as the model utility t test because t2 = f and ta=2;n2 ¼ Fa;1;n2 . Virtually all computer packages that have regression options include such an ANOVA table in the output. For example, Figure 12.17 shows ANOVA output for the data of Example 12.10. The ANOVA table at the top of the output has f = 1.8485 with a P-value of .211 for the model utility test. The table of parameter estimates gives t = −1.3596, again with P = .211, and t2 = (−1.3596)2 = 1.8485 = f. Note that this F test is only for H0: b1 = 0 versus Ha: b1 6¼ 0; if the alternative hypothesis is one-sided (> or 1. Determine power when b01 ¼ 2; r ¼ 4.
Inferences for the (Mean) Response
Throughout this section we will let Ŷ denote the statistic Y^ ¼ b^0 þ b^1 x with observed value ŷ, where x denotes a specified value of the explanatory variable x. Once the estimates b^0 and b^1 have been calculated, ŷ can be regarded either as a point estimate of lYjx (the expected or true average value of Y when x = x ) or as a prediction of the Y value that will result from a single new observation made when x = x . The point estimate or prediction by itself gives no information concerning how precisely lYjx has been estimated or Y has been predicted. This can be remedied by developing a CI for lYjx and a prediction interval (PI) for a single future Y value.
738
12
Regression and Correlation
Before we obtain sample data, both b^0 and b^1 are subject to sampling variability—that is, they are both statistics whose values will vary from sample to sample. This variability was shown in Figure 12.13 at the beginning of Section 12.3. Suppose, for example, that b0 = 50 and b1 = 2. Then a first sample of (x, y) pairs might give b^0 ¼ 52:35 and b^1 ¼ 1:895, a second sample might result in b^0 ¼ 46:52 and b^1 ¼ 2:056, and so on. It follows that Y^ itself varies in value from sample to sample. If the intercept and slope of the population line are the aforementioned values 50 and 2, respectively, and x = 10, then this statistic is trying to estimate the value lYj10 = 50 + 2(10) = 70. The estimate from a first sample might be ŷ = 52.35 + 1.895(10) = 71.30, from a second sample might be ŷ = 46.52 + 2.056(10) = 67.08, etc. In the same way that a confidence interval for b1 was based on properties of the sampling distribution of b^1 , a confidence interval for a mean y value in regression is based on properties of the sampling distribution of the statistic Ŷ. Substitution of the expressions for b^0 and b^1 into Ŷ, followed by some algebraic manipulation, leads to the representation of Ŷ as a linear function of the Yi’s: Y^ ¼ b^0 þ b^1 x ¼ ¼
n X 1 i¼1
n
þ
n X ðx xÞðxi xÞ di Yi Yi ¼ Sxx i¼1
where di ¼ ð1=nÞ þ ðx xÞðxi xÞ=Sxx The coefficients d1, d2, …, dn in this linear function involve the xi’s and x , all of which are fixed. Application of the rules of Section 5.3 to this linear function gives the following properties. (Exercise 52 requests a derivation of Property 2.) SAMPLING DISTRIBUTION OF Ŷ
Let Y^ ¼ b^0 þ b^1 x , where x is some fixed value of x. Then 1. The mean value of Y^ is ^ ¼ Eðb^0 þ b^1 x Þ ¼ b0 þ b1 x ¼ lYjx EðYÞ Thus b^0 þ b^1 x is an unbiased estimator for b0 + b1x (i.e., for lYjx ). 2. The variance of Y^ is
" # 1 ðx xÞ2 2 2 ^ þ VðYÞ ¼ rY^ ¼ r n Sxx
and the standard deviation rY^ is the square root of this expression. The ^ denoted by s ^ or s ^ ^ , results estimated standard deviation of Y, Y b0 þ b1 x from replacing r by its estimate se: sY^ ¼ sb^0 þ b^1 x .
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ðx xÞ2 þ ¼ se n Sxx
3. Y^ has a normal distribution, because it is a linear function of the Yi’s which are normally distributed and independent.
12.4
Inferences for the (Mean) Response
739
The variance of Y^ is smallest when x ¼ x and increases as x moves away from x in either direction. Thus the estimator of lYjx is more precise when x is near the center of the xi’s than when it is far from the x values where observations have been made. This implies that both the CI and PI are narrower for an x near x than for an x far from x. Most statistical computer packages provide both Y^ and sY^ for any specified x upon request. Inferences Concerning the Mean Response Just as inferential procedures for b1 were based on the t variable obtained by standardizing, a t variable obtained by standardizing Y^ ¼ b^0 þ b^1 x leads to a CI and test procedures here.
THEOREM
The variable T¼
Y^ lYjx b^0 þ b^1 x ðb0 þ b1 x Þ ¼ Sb^0 þ b^1 x SY^
ð12:6Þ
has a t distribution with n − 2 df. As was the case for b1 in the previous section, a probability statement involving this standardized variable can be manipulated to yield a confidence interval for lYjx . A 100(1 − a)% CI for μY|x*, the mean/expected value of Y when x = x , has endpoints sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ðx xÞ2 ^y ta=2;n2 sY^ ¼ ðb^0 þ b^1 x Þ ta=2;n2 se þ n Sxx
ð12:7Þ
This CI is centered at the point estimate for lYjx and extends out to each side by an amount that depends on the confidence level and on the extent of variability in the estimator on which the point estimate is based. Example 12.11 Refer back to the Omaha apartment data of Example 12.9, where the response variable was monthly rent and the predictor was square footage. Let’s now calculate a confidence interval, using a 95% confidence level, for the mean rent for all 1200 ft.2 two-bedroom apartments in Omaha—that is, a confidence interval for lY|1200 = b0 + b1(1200). The interval is centered at ^y ¼ b^0 þ b^1 ð1200Þ ¼ 470:6 þ :5237ð1200Þ ¼ $1099:04 The estimated standard deviation of the statistic Y^ at x = x = 1200 is sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ðx xÞ2 1 ð1200 1023:5Þ2 þ þ ¼ 12:807 sY^ ¼ se ¼ 75:69 n 77 Sxx 1;991;569 The 75 df t critical value for a 95% confidence level is 1.992, from which we determine the desired interval to be
740
12
Regression and Correlation
1099:04 1:992ð12:807Þ ¼ ð1073:54; 1124:57Þ At the 95% confidence level, we estimate that the average monthly rent of all 1200 square foot, twobedroom apartments in Omaha is between $1073.54 and $1124.57. Remember that if we recalculated this interval for sample after sample, in the long run about 95% of the calculated intervals would include b0 + b1(1200). We hope that this true mean value lies in the single interval that we have calculated. For the population of all two-bedroom apartments in Omaha of size 1050 square feet, similar calculations result in ŷ = 1020.50, sY^ = 8.742, and 95% CI = (1003.08, 1037.91). Notice that not only is the expected rent lower for 1050 ft.2 apartments than for 1200 ft.2 apartments (no surprise there), but the estimated standard error is also smaller. That’s because x = 1050 is closer to the sample mean of x = 1023.5 square feet than is x = 1200. Figure 12.18 shows a JMP scatterplot with the LSRL and curves corresponding to the confidence limits for each different x value.
Figure 12.18 JMP scatterplot with confidence limits for the data of Example 12.11
■
In some situations, a CI is desired not just for a single x value but for two or more x values, and we must proceed with caution in interpreting the confidence levels of our intervals. For example, in Example 12.11 two 95% CIs were constructed, one for lY|1200 and another for lY|1050. The joint or simultaneous confidence level—the long-run proportion of time under repeated sampling that both CIs would contain their respective parameters—is less than 95%. While it is difficult to determine the exact simultaneous confidence level, Bonferroni’s inequality (Chapter 8, Exercise 91) established that if two 100(1 − a)% CI’s are computed, then the joint confidence level of the resulting pair of intervals is at least 100(1 – 2a)%. Thus, in Example 12.11 we are at least 90% confident (because a = .05 and 1 – 2a = 1 – .10 = .90) that the two statements 1073.54 < lY|1200 < 1124.57 and 1003.08 < lY|1050 < 1037.91 are both true.
12.4
Inferences for the (Mean) Response
741
More generally, a set of m intervals each with confidence level 100(1 − a)% is guaranteed to have simultaneous confidence of at least 100(1 − ma)%. This relationship can be reversed to achieve a desired joint confidence level: replacing a with a/m, if individual 100(1 – a/m)% CIs are constructed for each of m parameters, the resulting simultaneous confidence level is at least 100(1 – a)%. For example, if we desired intervals for both lY|1200 and lY|1050 with (at least) 95% joint confidence, then each individual CI should be calculated using the t critical value corresponding to confidence coefficient 1 – a/m = 1 – .05/2 = .975. Tests of hypotheses about lYjx are based on the test statistic T obtained by replacing lYjx with the null value l0 in the numerator of (12.6). For example, the assertion H0: lYj1200 = $1100 in Example 12.11 says that the mean rent for all 1200 ft.2 apartments in the population is $1100 per month. The test statistic value is then t ¼ ð^y 1100Þ=sY^ , and the test is upper-, lower-, or two-tailed according to the inequality in Ha. A Prediction Interval for a Future Value of Y Analogous to the CI (12.7) for lYjx , one frequently wishes to obtain an interval of plausible values for the value of Y associated with a single future observation when the explanatory variable has value x . In Example 12.11, a CI was computed for the true mean rent of all apartments of a certain size, but an individual renter will be more interested in knowing a realistic range of rent values for a single such apartment. A CI estimates a parameter, or population characteristic, whose value is fixed but unknown to us. In contrast, a future value of Y is not a parameter but instead a random variable; for this reason we refer to an interval of plausible values for a future Y as a prediction interval (PI) rather than a confidence interval. (Section 8.2 presented a method for constructing a one-sample t prediction interval for a single future value of a variable.) When estimating lYjx , the estimation error, Ŷ – lYjx , is the difference between a random variable and a fixed but unknown quantity. In contrast, the prediction error is Y^ Y ¼ Y^ ðb0 þ b1 x þ eÞ, a difference between two random variables. With the additional random e term, there is more uncertainty in prediction than in estimation. As a consequence, a PI will be wider than a CI for the same x value. Because the future value Y is independent of the observed Yi’s that determine Ŷ, VðY^ YÞ ¼ variance of prediction error ^ þ VðYÞ independence ¼ VðYÞ ^ þ VðeÞ because b0 þ b1 x is a constant ¼ VðYÞ " # 1 ðx xÞ2 2 þ þ r2 ¼r n Sxx " # 2 1 ðx xÞ ¼ r2 1 þ þ n Sxx ^ ¼ b0 þ b1 x , the expected value of the prediction Furthermore, because EðYÞ ¼ b0 þ b1 x and EðYÞ error is EðY^ YÞ ¼ 0. It can then be shown that the standardized variable T¼
Y Y^ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ðx xÞ2 Se 1 þ þ n Sxx
742
12
Regression and Correlation
has a t distribution with n − 2 df. Substituting this expression for T into the probability statement P(−ta/2,n−2 < T < ta/2,n−2) = 1 − a and manipulating to isolate Y between the two inequalities yields the following interval. A 100(1 − a)% PI for a future Y observation to be made when x = x has endpoints qffiffiffiffiffiffiffiffiffiffiffiffiffiffi ^y ta=2;n2 s2e þ s2Y^ ¼ ðb^0 þ b^1 x Þ ta=2;n2 se
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ðx xÞ2 1þ þ n Sxx
ð12:8Þ
The interpretation of the prediction level 100(1 − a)% is identical to that of previous confidence levels—if (12.8) is used repeatedly, in the long run the resulting intervals will actually contain the observed future y values 100(1 − a)% of the time. Notice that the 1 underneath the square root symbol makes the PI (12.8) wider than the CI (12.7), although the intervals are both centered at ŷ. Also, as n ! 1 the width of the CI approaches 0, whereas the width of the PI approaches 2za/2r (because even with perfect knowledge of b0 and b1, there will still be uncertainty in prediction). Example 12.12 (Example 12.11 continued) Let’s calculate a 95% prediction interval for the monthly rent of a single 1200 square foot, two-bedroom apartment in Omaha. Relevant quantities from Example 12.11 are ^y ¼ 1099:04
sY^ ¼ 12:807
se ¼ 75:69
t:025;75 ¼ 1:992
The prediction interval is then 1099:04 1:992
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 75:692 þ 10:292 ¼ 1099:04 1:992ð76:386Þ ¼ ð946:13; 1251:97Þ
Plausible values for the monthly rent of a 1200 ft.2 apartment are, at the 95% prediction level, between $946.13 and $1251.97. The 95% confidence interval for the mean rent of all such apartments was (1073.54, 1124.57). The prediction interval is much wider than this because of the extra 75.692 under the square root. Since apartments of the same size will vary in rent—the estimated sd of rents for apartments of any fixed size is se = $75.69—there is necessarily much greater uncertainty in the cost of a single apartment than in the average cost of all such apartments. ■ The Bonferroni technique can be employed as in the case of confidence intervals. If a PI with prediction level 100(1 – a/m)% is calculated at each of m different x values, the simultaneous or joint prediction level for all m intervals will be at least 100(1 − a)%. Exercises: Section 12.4 (43–52) 43. Global warming is a major issue, and CO2 emissions are an important part of the discussion. The article “Effects of Atmospheric CO2 Enrichment on Biomass Accumulation and Distribution in Eldarica Pine Trees” (J. Exp. Bot. 1994: 345–349)
describes the results of growing pine trees with increasing levels of CO2 in the air. Here are the observations with x = atmospheric concentration of CO2 (parts per million) and y = mass in kilograms after 11 months of the experiment.
12.4
Inferences for the (Mean) Response
743
x
408
408
554
554
680
680
812
812
y
1.1
1.3
1.6
2.5
3.0
4.3
4.2
4.7
Software calculates se = .534; ^y ¼ 2:723 and sY^ ¼ :190 when x = 600; and ^y ¼ 3:992 and sY^ ¼ :256 when x = 750. a. Explain why sY^ is larger when x = 750 than when x = 600. b. Calculate a confidence interval with a confidence level of 95% for the true average mass of all trees grown with a CO2 concentration of 600 parts per million. c. Calculate a prediction interval with a prediction level of 95% for the mass of a tree grown with a CO2 concentration of 600 parts per million. d. If a 95% CI is calculated for the true average mass when CO2 concentration is 750, what will be the simultaneous confidence level for both this interval and the interval calculated in part (b)? 44. Reconsider the filtration rate–moisture content data introduced in Example 12.6. a. Compute a 90% CI for b0 + 125b1, true average moisture content when the filtration rate is 125. b. Predict the value of moisture content for a single experimental run in which the filtration rate is 125 using a 90% prediction level. How does this interval compare to the interval of part (a)? Why is this the case? c. How would the intervals of parts (a) and (b) compare to a CI and PI when filtration rate is 115? Answer without actually calculating these new intervals. d. Interpret both H0: b0 + 125b1 = 80 and Ha: b0 + 125b1 < 80, and then carry out a test at significance level .01. 45. Astringency is the quality in a wine that makes the wine drinker’s mouth feel slightly rough, dry, and puckery. The paper “Analysis of Tannins in Red Wine Using
Multiple Methods: Correlation with Perceived Astringency” (Amer. J. Enol. Vitic. 2006: 481–485) reported on an investigation to assess the relationship between perceived astringency and tannin concentration using various analytic methods. Here is data provided by the authors on x = tannin concentration by protein precipitation and y = perceived astringency as determined by a panel of tasters. x 0.718
0.808
0.924 1.000
0.667
0.529
0.514
0.559
y 0.428
0.480
0.493 0.978
0.318
0.298 −0.224
0.198
x 0.766
0.562
0.470
0.726 0.762
0.666
0.378
0.779
y 0.326 −0.336
0.765 0.190
0.066 −0.221 −0.898
0.836
x 0.674
0.858
0.406 0.927
0.311
0.687
y 0.126
0.305 −0.577 0.779 −0.707 −0.610 −0.648 −0.145
x 0.907
0.638
0.234 0.781
0.326
0.319
0.433
0.518
0.319
0.238
y 1.007 −0.090 −1.132 0.538 −1.098 −0.581 −0.862 −0.551
Relevant follows: X
summary
xi ¼ 19:404;
X
Syy ¼ 11:82637622;
quantities
yi ¼ :549;
are
as
Sxx ¼ 1:48193150;
Sxy ¼ 3:83071088
a. Fit the simple linear regression model to this data. Then determine the proportion of observed variation in astringency that can be attributed to the model relationship between astringency and tannin concentration. b. Calculate and interpret a confidence interval for the slope of the true regression line. c. Estimate true average astringency when tannin concentration is .6, and do so in a way that conveys information about reliability and precision. d. Predict astringency for a single wine sample whose tannin concentration is .6, and do so in a way that conveys information about reliability and precision. e. Is there compelling evidence for concluding that true average astringency is
744
12
positive when tannin concentration is .7? State and test the appropriate hypotheses. 46. The simple linear regression model provides a very good fit to the data on rainfall and runoff volume given in Exercise 17 of Section 12.2. The equation of the least squares line is ^y ¼ 1:128 þ :82697x, R2 = .975, and se = 5.24. a. Use the fact that sY^ ¼ 1:44 when rainfall volume is 40 m3 to predict runoff in a way that conveys information about reliability and precision. Does the resulting interval suggest that precise information about the value of runoff for this future observation is available? Explain your reasoning. b. Calculate a PI for runoff when rainfall is 50 using the same prediction level as in part (a). What can be said about the simultaneous prediction level for the two intervals you have calculated? 47. A simple linear regression is performed on y = salary ($1000s) and x = years of experience for actuaries. You are told that a 95% CI for the mean salary of actuaries with five years of experience, based on a sample of n = 10 observations, is (92.1, 117.7). Calculate a CI with confidence level 99% for the mean salary of actuaries with five years of experience. 48. Refer to Exercise 19 in which x = iodine value in grams and y = cetane number for a sample of 14 biofuels. a. Software gives sY^ = .802 when x = 80 and sY^ = 1.074 when x = 120. Explain why one is much larger than the other. b. Calculate a 95% CI for expected cetane number when the iodine value is 80 g. c. Calculate a 95% PI for the cetane number of a single biofuel with iodine value 120 g. 49. The article “Optimization of HVAC Control to Improve Comfort and Energy Performance in a School” (Energy Engr. 2008: 6–22) gives an analysis of the electrical and
Regression and Correlation
gas costs for a high school in Austin, Texas after a new heating and air conditioning (HVAC) system was installed. The accompanying data on x = average outside air temperature (°F) and y = electricity consumption (kWh) for a sample of n = 20 months was read from a graph in the article. x y
48 8200
53 7600
56 8000
58 10000
58 10400
x
59
59
60
68
69
y
10200
11000
9500
9800
8500
x
69
70
73
75
79
y
11100
11800
12000
12200
11100
x
80
80
84
87
88
y
11400
13600
10000
14000
12500
Summary quantities include x = 68.65, Sxx = 2692.55, y = 10645, Syy = 60,089,500, Sxy = 303,515, b^ = 2906, b^ = 112.7. 0
1
a. Does the simple linear regression model specify a useful relationship between outside temperature and electricity consumption? b. Estimate the true change in expected energy consumption associated with a 1 °F increase in outside temperature using a 95% confidence interval, and interpret the interval. c. Calculate a 95% CI for lY|70, the true average monthly energy consumption when temperature = 70 °F. d. Calculate a 95% PI for a single future observation on energy consumption to be made when temperature = 70 °F. e. Would the 95% CI and PI when temperature = 85 °F be wider or narrower than the corresponding intervals of parts (c) and (d)? Answer without actually computing the intervals. f.
Would you recommend calculating a 95% PI for an outside temperature of 95 °F? Explain.
12.4
Inferences for the (Mean) Response
745
Board 2016) gives the following data on x = shear stress (lb/ft2) and y = erosion depth (ft) for six experimental test trays at Colorado State University, built to re-create real stream conditions.
g. Calculate simultaneous CI’s for true average monthly energy consumption when outside temperature is 60, 70, and 80 °F, respectively. Your simultaneous confidence level should be at least 97%. 50. Consider the following four intervals based on the data of the previous exercise: • A 95% CI for energy consumption when temp = 60 • A 95% PI for energy consumption when temp = 60 • A 95% CI for energy consumption when temp = 72 • A 95% PI for energy consumption when temp = 72 Without computing any of these intervals, what can be said about their widths relative to each other? 51. Many parts of the USA are experiencing increased erosion along waterways due to increased flow from global climate change. The report “Evaluation and Assessment of Environmentally Sensitive Stream Bank Protection Measures” (Transp. Resour.
12.5
x
0.75
1.50
1.70
1.61
2.43
3.24
y
.01
.06
.10
.03
.13
.24
a. Construct a scatterplot. Does the simple linear regression model appear to be plausible? b. Carry out a test of model utility. c. Estimate true average erosion depth when shear stress is 1.75 lb/ft2 by giving an interval of plausible values. d. Estimate erosion depth along a single stream where water flow creates a shear stress of 1.75 lb/ft2 by giving an interval of plausible values. 52. Verify that Vðb^0 þ b^1 xÞ is indeed given by the expression in the text. [Hint: P P 2 Vð di Yi Þ ¼ di VðYi Þ.]
Correlation
In many situations, the objective in studying the joint behavior of two variables is simply to see whether they are related, rather than to use one to predict the value of the other. In this section, we first develop the sample correlation coefficient r as a measure of how strongly related two variables x and y are in a sample, and then we relate r to the correlation coefficient q defined in Chapter 5. The Sample Correlation Coefficient r Given n pairs of observations ðx1 ; y1 Þ; . . .; ðxn ; yn Þ, it is natural to speak of x and y having a positive relationship if large x’s are paired with large y’s and small x’s with small y’s. Similarly, if large x’s are paired with small y’s and vice versa, then a negative relationship between the variables is implied. Consider standardizing each x value in the sample, i.e., replacing each xi with ðxi xÞ=sx . Now do the same thing with the yi’s to obtain the standardized y values ðyi yÞ=sy . Our proposed measure of the direction and strength of the relationship between the x’s and y’s involves the sum of the products of these standardized values.
746
12
DEFINITION
Regression and Correlation
The sample correlation coefficient for the n pairs ðx1 ; y1 Þ; . . .; ðxn ; yn Þ is n 1 X xi x yi y Sxy Sxy r¼ ¼ pffiffiffiffiffiffipffiffiffiffiffiffi ¼ n 1 i¼1 sx sy ðn 1Þsx sy Sxx Syy
ð12:9Þ
The denominator of Expression (12.9) is clearly positive, so the sign of r (+ or –) is determined by the numerator Sxy. If the relationship between x and y is strongly positive, an xi above the mean x will tend to be paired with a yi above the mean y, so that ðxi xÞðyi yÞ [ 0, and this same product will also be positive whenever both xi and yi are below their respective means (a negative times a negative equals a positive). Thus a positive relationship implies that Sxy will be positive. An analogous argument shows that when the relationship is negative, Sxy will be negative, since most of the products ðxi xÞðyi yÞ will be negative. This is illustrated in Figure 12.19.
a
b
y y
x
x
Figure 12.19 (a) Scatterplot with r and Sxy positive; (b) scatterplot with r and Sxy negative [+ means ðxi xÞðyi yÞ [ 0, and − means ðxi xÞðyi yÞ\0]
The most important properties of r are as listed below. PROPERTIES OF r
1. The value of r does not depend on which of the two variables is labeled x and which is labeled y. 2. The value of r is independent of the units in which x and y are measured. In particular, r itself is unitless. 3. The square of the sample correlation coefficient gives the value of the coefficient of determination that would result from fitting the simple linear regression model—in symbols, r2 = R2. 4. −1 r 1. 5. r = ±1 if and only if all (xi, yi) pairs lie on a straight line.
Proof Property 1 should be evident. Property 2 is a direct result of standardizing the two variables; Exercise 64 asks for a formal verification. To prove Property 3, recall that R2 can be expressed as the
12.5
Correlation
747
P P ratio SSR/SST, where SSR ¼ ð^yi yÞ2 and SST ¼ Syy ¼ ðyi yÞ2 . It is easily shown (see Exercise 24(b)) that ^yi y ¼ b^1 ðxi xÞ, and therefore SSR ¼
X
ð^yi yÞ2 ¼ b^21
X
SSR S2xy =Sxx ¼ R ¼ ¼ SST Syy 2
ðxi xÞ2 ¼
S2xy Sxy 2 Sxx ¼ ) Sxx Sxx
!2 Sxy pffiffiffiffiffiffipffiffiffiffiffiffi ¼ r 2 Sxx Syy
Because r2 = R2 = SSR/SST = (SST – SSE)/SST, and the numerator cannot be bigger than the denominator, r must be between −1 and 1. Furthermore, because the ratio can be 1 if and only if SSE = 0, we conclude that r2 = 1 (i.e., r = ±1) if and only if all the points fall on a straight line. ■ Property 1 stands in marked contrast to what happens in regression analysis, where virtually all quantities of interest (the estimated slope, estimated y-intercept, se, etc.) depend on which of the two variables is treated as the response variable. However, Property 3 shows that the proportion of variation in the response variable explained by fitting the simple linear regression model does not depend on which variable plays this role. Property 2 is equivalent to saying that r is unchanged if each xi is replaced by cxi and if each yi is replaced by dyi (where c and d are positive, giving a change in the scale of measurement), as well as if each xi is replaced by xi − a and yi by yi − b (which changes the location of zero on the measurement axis). This implies, for example, that r is the same whether temperature is measured in °F or °C. Property 4 tells us that the maximum value of r, corresponding to the largest possible degree of positive relationship, is r = 1, whereas the most negative relationship is identified with r = −1. According to Property 5, the largest positive and largest negative correlations are achieved only when all points lie along a straight line. Any other configuration of points, even if the configuration suggests a deterministic relationship between variables, will yield an r value less than 1 in absolute magnitude. Thus, r measures the degree of linear relationship among variables. A value of r near 0 is not necessarily evidence of a lack of a strong relationship, but only the absence of a linear relation, so that such a value of r must be interpreted with caution. Figure 12.20 illustrates several configurations of points associated with different values of r.
r near +1
r near 0, no apparent relationship
r near –1
r near 0, nonlinear relationship
Figure 12.20 Data plots for different values of r
748
12
Regression and Correlation
Example 12.13 The article “A Cross-National Relationship Between Sugar Consumption and Major Depression?” (Depression and Anxiety 2002: 118–120) reported the following data on x = daily sugar consumption (calories per capita) and y = annual rate of major depression (cases per 100 people) for a sample of six countries. Country USA Canada France Germany New Zealand South Korea
Sugar consumption
Depression rate
300 390 350 375 480 150
3.0 5.2 4.4 5.0 5.7 2.3
With n = 6, x = 340.8, sx = 110.6, y = 4.267, and sy = 1.338, 1 r¼ 61
300 340:8 110:6
3:0 4:267 150 340:8 2:3 4:267 þ þ ¼ :944 1:338 110:6 1:338
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Equivalently, Sxx = 61,120.83, Syy = 8.953, Sxy = 698.667, and r = Sxy = Sxx Syy = .944. Since the correlation coefficient is positive and close to 1, the data indicates a strong, positive relationship between sugar consumption and depression rate, at least for these six countries. A scatterplot of this data (not shown) also supports the notion of a strong, positive association. Note that if sugar consumption was converted into grams per capita (a gram of sugar has about 4 calories, so x0i = xi/4), the summary values for the x data would change but r would remain .944. Does this study show that increased sugar consumption causes depression? Would forcing people in these countries to eat less sugar reduce the depression rate? Not necessarily: the high r value establishes a strong association between the two variables, but (as discussed in earlier chapters) association does not imply causation. Other factors not explored by the investigators may explain why nations with greater sugar consumption report higher depression rates. It should also be noted that aggregating data—here, looking at data on the national rather than individual level—tends to inflate the correlation coefficient by averaging out individual variation that would weaken the apparent relationship between the two variables. ■ Correlation and the Regression Effect The correlation coefficient can be used to obtain an alternative expression for the equation of the least squares regression line: sy y ¼ b^0 þ b^1 x ¼ y þ r ðx xÞ sx (Exercise 66 requests a derivation of this result.) This expression for the regression line can be interpreted as follows. Suppose r > 0. For an x that lies one standard deviation (sx units) above the mean x of the xi’s, the predicted y value is y þ r sy , r standard deviations above the mean on the
12.5
Correlation
749
y scale. If r is negative, the LSRL predicts that the y value when x is one sd above average will be r sd’s below average. Critically, since the magnitude of r is typically strictly less than 1, our model predicts that, on a standardized scale, the response variable will be closer to its mean than the explanatory variable is to its mean. The term regression analysis was first used by Francis Galton in the late nineteenth century in connection with his work on the relationship between father’s height x and son’s height y. After collecting a number of pairs (xi, yi), Galton used the principle of least squares to obtain the equation of the LSRL with the objective of using it to predict son’s height from father’s height. In using the derived line, Galton found that if a father was above average in height, his son was also expected to be above average in height, but not by as much as the father. Similarly, the son of a shorter-thanaverage father was expected to be shorter than average, but not by as much as the father. Thus the predicted height of a son was “pulled back in” toward the mean; because regression can be defined as moving backward, Galton adopted the terminology regression line. This phenomenon of being pulled back in toward the mean has been observed in many other situations (e.g., a player’s batting averages from year to year in baseball) and is called the regression effect or regression to the mean. See also Section 5.5 for a discussion of this topic in the context of the bivariate normal distribution. Because of the regression effect, care must be exercised in experiments that involve selecting individuals based on below-average scores. For example, if students are selected because of belowaverage performance on a test and they are then given special instruction, the regression effect predicts improvement even if the instruction is useless. A similar warning applies in studies of underperforming businesses or hospital patients. The Population Correlation Coefficient q and Inferences About Correlation The correlation coefficient r is a measure of how strongly related x and y are in the observed sample. We can think of the pairs (xi, yi) as having been drawn from a bivariate population of pairs, with (Xi, Yi) having some joint probability distribution f(x, y). In Chapter 5, we defined the correlation coefficient q(X, Y) by q ¼ qðX; YÞ ¼ E
X lX rX
Y lY E½ðX lX ÞðY lY Þ ¼ rX rY rY
If we think of f(x, y) as describing the distribution of pairs of values within the entire population, q becomes a measure of how strongly related x and y are in that population. Properties of q analogous to those for r were given in Chapter 5. The population correlation coefficient q is a parameter or population characteristic, just as lX, lY, rX, and rY are, and we can use the sample correlation coefficient to make various inferences about q. In particular, r is a point estimate for q, and the corresponding estimator is P n 1 X Xi X Xi Y ðXi XÞðYi YÞ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ^¼R¼ q ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P P n 1 i¼1 sX sY ðXi XÞ2 ðYi YÞ2 Many of the intervals and test procedures presented in Chapters 8–10 were based on an assumption of population normality. To test hypotheses about q, we must make an analogous assumption about the distribution of pairs of (x, y) values in the population. We are now assuming that both X and Y are random, with joint distribution given by the bivariate normal pdf introduced in Section 5.5. If X = x, recall from Section 5.5 that the conditional distribution of Y is normal with mean lYjx ¼ l2 þ ðqr2 =r1 Þðx l1 Þ and variance ð1 q2 Þr22 . This is exactly the model used in simple
750
12
Regression and Correlation
linear regression with b0 ¼ l2 ql1 r2 =r1 ; b1 ¼ qr2 =r1 , and r2 ¼ ð1 q2 Þr22 independent of x. The implication is that if the observed pairs (xi, yi) are actually drawn from a bivariate normal distribution, then the simple linear regression model is an appropriate way of studying the behavior of Y for fixed x. If q = 0, then lYjx = l2 independent of x; in fact, when q = 0 the joint pdf f(x, y) can be factored into a part involving x only and a part involving y only, which implies that X and Y are independent random variables. Example 12.14 As discussed in Section 5.5, contours of the bivariate normal distribution are elliptical, and this suggests that a scatterplot of observed (x, y) pairs from such a joint distribution should have a roughly elliptical shape. The article “Methods of Estimation of Visceral Fat: Advantages of Ultrasonography” (Obesity Res. 2003: 1488–1494) includes the scatterplot in Figure 12.21 for x = visceral fat (cm2) measured by ultrasound (US) versus y = visceral fat by computerized tomography (CT) for a sample of n = 100 obese women. CT is considered the most accurate technique for body fat measurement but is costly, time-consuming, and involves exposure to ionizing radiation; the US method is noninvasive and less expensive. Fat by CT 350 300 250 200 150 100 50 0
0
2
4
6
8
10
12
Fat by US
Figure 12.21 Scatterplot for Example 12.14
The pattern in the scatterplot in Figure 12.21 seems consistent with an assumption of bivariate normality. If we let q denote the true population correlation coefficient between CT and US mea^ = r = .71, a value given in the article. Of course we would surements, then a point estimate of q is q want fat measurements from the two methods to be very highly correlated before regarding one as an adequate substitute for the other. By that standard, r = .71 is not all that impressive, but the investigators reported that a test of H0: q = 0 (to be introduced shortly) gives P-value < .001. ■ Assuming that the pairs are drawn from a bivariate normal distribution allows us to test hypotheses about q and to construct a CI. There is no completely satisfactory way to check the plausibility of the bivariate normality assumption. A partial check involves constructing two separate normal probability plots, one for the sample xi’s and another for the sample yi’s, since bivariate normality implies that the marginal distributions of both X and Y are normal. If either probability plot deviates substantially from a straight-line pattern, the following inferential procedures should not be used when the sample size n is small. Also, as in Example 12.14, the scatterplot should show a roughly elliptical shape.
12.5
Correlation
TESTING FOR THE ABSENCE OF CORRELATION
751
When H0: q = 0 is true, the test statistic pffiffiffiffiffiffiffiffiffiffiffi R n2 T ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 R2 has a t distribution with n − 2 df (see Exercise 63). Alternative Hypothesis Ha: q > 0 Ha: q < 0 Ha: q 6¼ 0
Rejection Region for Level a Test t ta,n2 t −ta,n2 either t ta/2,n2 or t ta/2,n2
A P-value based on n − 2 df can be calculated as described previously. Example 12.15 Neurotoxic effects of manganese are well known and are usually caused by high occupational exposure over long periods of time. In the fields of occupational hygiene and environmental hygiene, the relationship between lipid peroxidation, which is responsible for deterioration of foods and damage to live tissue, and occupational exposure had not been previously reported. The article “Lipid Peroxidation in Workers Exposed to Manganese” (Scand. J. Work Environ. Health 1996: 381–386) gave data on x = manganese concentration in blood (ppb) and y = concentration (lmol/L) of malondialdehyde, which is a stable product of lipid peroxidation, both for a sample of 22 workers exposed to manganese and for a control sample of 45 individuals. The value of r for the control sample was .29, from which pffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð:29Þ 45 2 t ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2:0 1 :292 The corresponding P-value for a two-tailed t test based on 43 df is roughly .052 (the cited article reported only that the P-value > .05). We would not want to reject the assertion that q = 0 at either significance level .01 or .05. For the sample of exposed workers, r = .83 and t = 6.7, clear evidence that there is a positive relationship in the entire population of exposed workers from which the sample was selected. Although in general correlation does not necessarily imply causation, it is plausible here that higher levels of manganese cause higher levels of peroxidation. ■ Because q measures the extent to which there is a linear relationship between the two variables in the population, the null hypothesis H0: q = 0 states that there is no such population relationship. In Section 12.3, we used the t ratio b^1 =sb^1 to test for a linear relationship between the two variables in the context of regression analysis. It turns out that the two test procedures are completely equivalent pffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffi because r n 2= 1 r 2 ¼ b^1 =sb^1 (Exercise 63). Other Inferences Concerning q The procedure for testing H0: q = q0 when q0 6¼ 0 is not equivalent to any procedure from regression analysis. The test statistic is based on a transformation of R called the Fisher transformation.
752
12
Regression and Correlation
When (X1, Y1), …, (Xn, Yn) is a sample from a bivariate normal distribution, the rv
PROPOSITION
1 1þR V ¼ ln 2 1R has approximately a normal distribution with mean and variance 1 1þq lV ¼ ln 2 1q
r2V ¼
1 n3
The rationale for the transformation is to obtain a function of R that has a variance independent of q; this would not be the case with R itself. The approximation will not be valid if n is quite small.
The test statistic for testing H0: ρ = ρ0 is Z=
V−
1 ln[(1 + 0 ) / (1 – 2 1/ n − 3
0
)]
Alternative Hypothesis
Rejection Region for Level α Test
Ha: ρ > ρ0 Ha: ρ < ρ0 Ha: ρ ≠ ρ0
z ≥ zα z ≤ –zα either z ≥ zα/2 or z ≤ –zα/2
A P-value can be calculated in the same manner as for previous z tests.
Example 12.16 As far back as Leonardo da Vinci, it was known that height and wingspan (measured fingertip to fingertip between outstretched hands) are closely related. For these measurements (in inches) from 16 students in a statistics class notice how close the two values are. Student
1
2
3
4
5
6
7
8
Height Wingspan
63.0 62.0
63.0 62.0
65.0 64.0
64.0 64.5
68.0 67.0
69.0 69.0
71.0 70.0
68.0 72.0
Student Height Wingspan
9 68.0 70.0
10 72.0 72.0
11 73.0 73.0
12 73.5 75.0
13 70.0 71.0
14 70.0 70.0
15 72.0 76.0
16 74.0 76.5
The scatterplot in Figure 12.22 shows an approximately linear shape, and the point cloud is roughly elliptical. Also, the normal plots for the individual variables are roughly linear, so the bivariate normal distribution can reasonably be assumed.
12.5
Correlation
753 Wingspan 78 76 74 72 70 68 66 64 62 60 62
64
66
68
70
72
74
Height
Figure 12.22 Wingspan plotted against height
The correlation is computed to be .9422. Can it be concluded that true correlation between wingspan and height exceeds .8? To carry out a test of H0: q = .8 versus Ha: q > .8, we Fisher transform .9422 and .8: 1 1 þ :9422 v ¼ ln ¼ 1:757 2 1 :9422
1 1 þ :8 lV ¼ ln ¼ 1:099 2 1 :8
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi The z test statistic is z ¼ ð1:757 1:099Þ=ð1= 16 3Þ ¼ 2:37. Since 2.37 z.01 = 2.33, at level .01 we can reject H0: q = .8 in favor of Ha: q > .8 and conclude that wingspan is highly correlated with height. ■ To obtain a CI for q, we first derive an interval for lV ¼ 12 ln½ð1 þ qÞ=ð1 qÞ. Standardizing V, writing a probability statement, and manipulating the resulting inequalities yields za=2 1 1þr v za=2 rV ¼ ln pffiffiffiffiffiffiffiffiffiffiffi 2 1r n3
ð12:10Þ
as the endpoints of a 100(1 − a)% interval for lV. This interval can then be manipulated to yield a CI for q. A 100(1 − a)% confidence interval for q is 2c1 e 1 ; e2c1 þ 1
e2c2 1 e2c2 þ 1
where c1 and c2 are the left and right endpoints, respectively, in Expression (12.10).
Example 12.17 (Example 12.16 continued) The sample correlation coefficient between wingspan and height was r = .9422, giving m = 1.757. With n = 16, a 95% confidence interval for lV is pffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1:757 1:96= 16 3 ¼ ð1:213; 2:301Þ ¼ ðc1 ; c2 Þ.
754
12
Regression and Correlation
The 95% interval for q is 2ð1:213Þ e 1 e2ð2:301Þ 1 ; ¼ ð:838; :980Þ e2ð1:213Þ þ 1 e2ð2:301Þ þ 1 Notice that this interval excludes .8, and that our hypothesis test in Example 12.16 would have rejected H0: q = .8 in favor of the alternative Ha: q > .8 at the .025 level. ■ Absent the assumption of bivariate normality, a bootstrap procedure can be used to obtain a CI for q or test hypotheses. In Chapter 5, we cautioned that a large value of the correlation coefficient (near 1 or −1) implies only association and not causation. This applies to both q and r. It is easy to find strong but weird correlations in which neither variable is casually related to the other. For example, since Prohibition ended in the 1930s, beer consumption and church attendance have correlated very highly. Of course, the reason is that both variables have increased in accord with population growth. Exercises: Section 12.5 (53–66) 53. The article “Behavioural Effects of Mobile Telephone Use During Simulated Driving” (Ergonomics 1995: 2536–2562) reported that for a sample of 20 experimental subjects, the sample correlation coefficient for x = age and y = time since the subject had acquired a driving license (yr) was .97. Why do you think the value of r is so close to 1? (The article’s authors gave an explanation.) 54. The Turbine Oil Oxidation Test (TOST) and the Rotating Bomb Oxidation Test (RBOT) are two different procedures for evaluating the oxidation stability of steam turbine oils. The article “Dependence of Oxidation Stability of Steam Turbine Oil on Base Oil Composition” (J. Soc. Tribologists Lubricat. Engrs., Oct. 1997: 19–24) reported the accompanying observations on x = TOST time (hr) and y = RBOT time (min) for 12 oil specimens. TOST
4200
3600
3750
3675
4050
2770
RBOT
370
340
375
310
350
200
TOST
4870
4500
3450
2700
3750
3300
RBOT
400
375
285
225
345
285
a. Calculate and interpret the value of the sample correlation coefficient (as did the article’s authors).
b. How would the value of r be affected if we had let x = RBOT time and y = TOST time? c. How would the value of r be affected if RBOT time was expressed in hours? d. Construct a scatterplot and normal probability plots and comment. e. Carry out a test of hypotheses to decide whether RBOT time and TOST time are linearly related. 55. The authors of the paper “Objective Effects of a Six Months’ Endurance and Strength Training Program in Outpatients with Congestive Heart Failure” (Med. Sci. Sports Exerc. 1999: 1102–1107) presented a correlation analysis to investigate the relationship between maximal lactate level x and muscular endurance y. The accompanying data was read from a plot in the paper. x
400
750
770
800
850
1025
1200
y
3.80
4.00
4.90
5.20
4.00
3.50
6.30
x
1250
1300
1400
1475
1480
1505
2200
y
6.88
7.55
4.95
7.80
4.45
6.60
8.90
Sxx ¼ 36:9839; Sxy ¼ 7377:704
Syy ¼ 2;628;930:357;
12.5
Correlation
755
A scatterplot shows a linear pattern. a. Test to see whether there is a positive correlation between maximal lactate level and muscular endurance in the population from which this data was selected. b. If a regression analysis was to be carried out to predict endurance from lactate level, what proportion of observed variation in endurance could be attributed to the approximate linear relationship? Answer the analogous question if regression is used to predict lactate level from endurance—and answer both questions without doing any regression calculations. 56. Torsion during hip external rotation and extension may explain why acetabular labral tears occur in professional athletes. The article “Hip Rotational Velocities During the Full Golf Swing” (J. Sport Sci. Med. 2009: 296–299) reported on an investigation in which lead hip internal peak rotational velocity (x) and trailing hip peak external rotational velocity (y) were determined for a sample of 15 golfers. Data provided by the article’s authors was used to calculate the following summary quantities: Sxx ¼ 64;732:83;
Syy ¼ 130;566:96;
Sxy ¼ 44;185:87 Separate normal probability plots showed very substantial linear patterns. a. Calculate a point estimate for the population correlation coefficient. b. If the simple linear regression model was fit to the data, what proportion of variation in external velocity could be attributed to the model relationship? What would happen to this proportion if the roles of x and y were reversed? Explain. c. Carry out a test at significance level .01 to decide whether there is a linear
relationship between the two velocities in the sampled population; your conclusion should be based on a P-value. d. Would the conclusion of (c) have changed if you had tested appropriate hypotheses to decide whether there is a positive linear association in the population? What if a significance level of .05 rather than .01 had been used? 57. Hydrogen content is conjectured to be an important factor in porosity of aluminum alloy castings. The article “The Reduced Pressure Test as a Measuring Tool in the Evaluation of Porosity/Hydrogen Content in A1–7 Wt Pct Si-10 Vol Pct SiC(p) Metal Matrix Composite” (Metallurg. Trans. 1993: 1857–1868) gives the accompanying data on x = content and y = gas porosity for one particular measurement technique. x
.18
.20
.21
.21
.21
.22
.23
y
.46
.70
.41
.45
.55
.44
.24
x
.23
.24
.24
.25
.28
.30
.37
y
.47
.22
.80
.88
.70
.72
.75
Minitab gives the following output in response to a correlation command: Correlation of Hydrcon and Porosity = 0.449 a. Test at level .05 to see whether the population correlation coefficient differs from 0. b. If a simple linear regression analysis had been carried out, what percentage of observed variation in porosity could be attributed to the model relationship? 58. The wicking properties of certain fabrics were investigated in the article “Thermal and Water Vapor Transport Properties of Selected Lofty Nonwoven Products” (Textile Res. J. 2017: 1413–1424). Use the accompanying data and a .01 significance level to determine whether there is a significant correlation between thickness x (mm) and water vapor resistance
756
12
y (m2Pa/W). Is the result of the test surprising in light of the value of r? x
20
20
30
30
40
40
y
60
56
65
70
96
78
59. The body armor report introduced in Example 12.1 also reported studies in which two different methods were used to measure the same body armor deformations from bullet impact. The goal of these studies was to assess the extent to which two different measurement instruments agree. Eighty-three backface deformations (mm) were measured using a digital caliper (x) and a laser arm (y), resulting in a sample correlation coefficient of r = .878. a. Compute a 90% CI for the true correlation coefficient q. b. Test H0: q = .8 versus Ha: q > .8 at level .05. c. In a regression analysis of y on x, what proportion of variation in laser arm measurements could be explained by variation in digital caliper measurements? d. If you decide to perform a regression analysis with digital caliper measurement as the response variable, what proportion of its variation is explainable by variation in laser arm measurement? 60. It is time-consuming and costly to have trucks stop in order to be weighed on a static scale. The Minnesota Department of Transportation considered using a scale that would weigh trucks while they were moving. Here is data for a sample of trucks that were weighed in motion and also on a static scale (1000s of lbs). Truck In-motion Static
1 26.0 27.9
2 29.9 29.1
3 39.5 38.0
4 25.1 27.0
5 31.6 30.3
Truck In-motion Static
6 36.2 34.5
7 25.1 27.8
8 31.0 29.6
9 35.6 33.1
10 40.2 35.5
Regression and Correlation
a. Determine the sample correlation coefficient r. b. Test H0: q = .85 versus Ha: q > .85 at level .05. c. How successful do you think the simple linear regression model would be in predicting static weight from in-motion weight? Explain. 61. A sample of n = 500 (x, y) pairs was collected and a test of H0: q = 0 versus Ha: q 6¼ 0 was carried out. The resulting P-value was computed to be .00032. a. What conclusion would be appropriate at level of significance .001? b. Does this small P-value indicate that there is a very strong relationship between x and y (a value of q that differs considerably from 0)? Explain. c. Now suppose a sample of n = 10,000 (x, y) pairs resulted in r = .022. Test H0: q = 0 versus Ha: q 6¼ 0 at level .05. Is the result statistically significant? Comment on the practical significance of your analysis. 62. Let x be number of hours per week of studying and y be grade point average. Suppose we have one sample of (x, y) pairs for females and another for males. Then we might like to test the hypothesis H0: q1 − q2 = 0 against the alternative that the two population correlation coefficients are different. a. Use properties of the transformed variable V = .5ln[(1 + R)/(1 − R)] to propose an appropriate test statistic and rejection region (let R1 and R2 denote the two-sample correlation coefficients). b. The paper “Relational Bonds and Customer’s Trust and Commitment: A Study on the Moderating Effects of Web Site Usage” (Serv. Ind. J. 2003: 103–124) reported that n1 = 261, r1 = .59, n2 = 557, r2 = .50, where the first sample consisted of corporate website users and the second of nonusers; here r is the correlation between an assessment of the
12.5
Correlation
757
strength of economic bonds and performance. Carry out the test for this data (as did the authors of the cited paper). 63. Verify that the t ratio for testing H0: b1 = 0 in Section 12.3 is identical to t for testing H0: q = 0. 64. Verify Property 2 of the correlation coefficient: the value of r is independent of the units in which x and y are measured; that is, if x0i ¼ axi þ c and y0i ¼ byi þ d, a > 0, b > 0, then r for the ðx0i ; y0i Þ pairs is the same as r for the (xi, yi) pairs. 65. Consider a time series—that is, a sequence of observations X1, X2, … on some response variable (e.g., concentration of a pollutant) over time—with observed values x1, x2, …, xn over n time periods. Then the lag 1 autocorrelation coefficient, which assess the strength of relationship between series values one time unit apart, is defined as r1 ¼
Pn1
ðxi xÞðxi þ 1 xÞ Pn 2 i¼1 ðxi xÞ
i¼1
Autocorrelation coefficients r2, r3, … for lags 2, 3, … are defined analogously. a. Calculate the values of r1, r2, and r3 for the temperature data from Chapter 1 Exercise 95. b. Consider the pairs (x1, x2), (x2, x3), …, (xn−1, xn). What is the difference between the formula for the sample correlation coefficient r applied to these pairs and the formula for r1? What if n, the length of the series, is large? What
12.6
about r2 compared to r for the n − 2 pairs (x1, x3), (x2, x4), …, (xn2, xn)? c. Analogous to the population correlation coefficient q, let qi (i = 1, 2, 3, …) denote the theoretical or long-run autocorrelation coefficients at the various lags. If all these q’s are zero, there is no (linear) relationship between observations in the series at any lag. In this case, if n is large, each Ri has approximately a normal distribution with mean 0 and standard deviation pffiffiffi 1= n and different Ri’s are almost independent. Therefore H0: qi = 0 can be rejected at a significance level of pffiffiffi approximately .05 if either ri 2= n pffiffiffi or ri 2= n. If n = 100 and r1 = .16, r2 = −.09, r3 = −.15, is there evidence of theoretical autocorrelation at any of the first three lags? d. If you are testing the null hypothesis in (c) for more than one lag, why might you want to increase the cutoff constant 2 in the rejection region? [Hint: What about the probability of committing at least one type I error?] 66. Let sx and sy denote the sample standard deviations of the observed x’s and y’s, respectively. a. Show that Sxx ¼ ðn 1Þs2x and similarly for the y’s. b. Show that an alternative expression for the estimated regression line b^0 þ b^1 x is y ¼ yþr
sy ðx xÞ sx
Investigating Model Adequacy: Residual Analysis
In the last several sections we have taken for granted that our data is consistent with the simple linear regression model, which makes certain assumptions about the “true error” term e. Table 12.2 summarizes these model assumptions. Since we do not observe the true errors in practice, our assessment of the plausibility of these assumptions is made based on the observed residuals e1 ; . . .; en . Certain graphs of the residuals, some of which appeared in the context of ANOVA in Chapter 11, will allow us to validate the regression assumptions. Moreover, these graphs may reveal other unusual or noteworthy features of the data.
758
12
Regression and Correlation
Table 12.2 Assumptions of the simple linear regression model In terms of Y
In terms of e
E(Y|x) is a linear function of x. For any fixed x, the Y distribution is normal. The variance of Y at any fixed x value is independent of x. Yi’s for different observations are independent.
For any fixed x, E(e) = 0. For any fixed x, the rv e is normally distributed. V(e) = r2, independent of x.
Assumption Linearity Normality Constant variance Independence
ei’s for different observations are independent.
Residuals and Standardized Residuals Suppose the simple linear regression model is correct, and let ^yi ¼ b^0 þ b^1 xi be the predicted y value of the ith observation. Then the ith residual is ei ¼ yi ^yi ¼ yi ðb^0 þ b^1 xi Þ. To derive properties of the residuals, let Yi Y^i represent the ith residual as a random variable (i.e., before observations are actually made). Then EðYi Y^i Þ ¼ EðYi Þ Eðb^0 þ b^1 xi Þ ¼ b0 þ b1 xi ðb0 þ b1 xi Þ ¼ 0 It can also be shown (Exercise 74) that " # 1 ðxi xÞ2 2 ^ VðYi Yi Þ ¼ r 1 n Sxx
ð12:11Þ
Notice that the further xi is from x, the smaller the variance will be. This is because the least squares line is “pulled toward” observations whose x values are extreme relative to the other x values. Replacing r by se and taking the square root of Equation (12.11) gives the estimated standard deviation of the ith residual. Let’s now standardize each residual by subtracting the mean value (zero) and then dividing by the estimated standard deviation. DEFINITION
The standardized residuals are ei ¼
ei 0 ¼ se i
yi ^yi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ðxi xÞ2 se 1 n Sxx
i ¼ 1; . . .; n
If, for example, a particular standardized residual is 1.5, then the residual itself is 1.5 standard deviations larger than what would be expected from fitting the correct model. Though the standard deviations of the residuals differ from one another, if n is reasonably large the bracketed term in (12.11) will be approximately 1, so some sources use ei ei =se as the standardized residual. Computation of the ei ’s can be tedious, but the most widely used statistical computer packages automatically provide these values and can construct various plots involving them.
12.6
Investigating Model Adequacy: Residual Analysis
759
Example 12.18 Does stress really accelerate aging? A study described in the article “Accelerated Telomere Shortening in Response to Life Stress” (Proc. Nat. Acad. Sci. 2004: 17312–17315) investigated the relationship between x = perceived stress level (on a quantitative scale) and y = telomere length, a biological measure of cell longevity (smaller telomere lengths indicate shorter lifespan at the cellular level). Figure 12.23 shows a scatterplot of (x, y) pairs for 38 subjects; the plot suggests a negative, weak-to-moderate (r = –.32) association between stress level and telomere length. Telomere length
1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8
Perceived stress level 5
10
15
20
25
30
35
Figure 12.23 Scatterplot of data in Example 12.18
The accompanying table displays the data, residuals, and standardized residuals obtained from software. The estimated standard deviations of the residuals are slightly different, e.g., for the first two observations, se1 .156 while se2 .157. xi
yi
ei
ei
xi
yi
ei
ei
14 17 14 27 22 12 22 24 25 18 28 21 19 23 15 15 27 17 6
1.30 1.32 1.08 1.02 1.24 1.18 1.18 1.12 0.94 1.46 1.22 1.30 0.84 1.18 1.22 0.92 1.12 1.40 1.32
0.068 0.111 −0.152 −0.112 0.070 −0.067 0.010 −0.035 −0.207 0.259 0.096 0.122 −0.353 0.017 −0.004 −0.304 −0.012 0.191 0.027
0.439 0.710 −0.971 −0.729 0.446 −0.431 0.062 −0.224 −1.337 1.650 0.627 0.779 −2.250 0.112 −0.025 −1.943 −0.078 1.220 0.182
7 11 15 5 21 24 21 6 20 22 26 10 18 17 20 13 33 20 29
1.35 1.00 1.24 1.25 1.26 1.50 1.24 1.50 1.30 1.22 0.84 1.30 1.12 1.12 1.22 1.10 0.94 1.32 1.30
0.065 −0.255 0.016 −0.050 0.082 0.345 0.062 0.207 0.114 0.050 −0.300 0.038 −0.081 −0.089 0.034 −0.139 −0.146 0.134 0.183
0.431 −1.649 0.103 −0.341 0.524 2.218 0.396 1.387 0.729 0.318 −1.941 0.246 −0.515 −0.564 0.220 −0.895 −0.994 0.857 1.209
■
760
12
Regression and Correlation
Diagnostic Plots for Checking Assumptions The basic plots that many statisticians recommend for an assessment of model validity are the following: 1. ei (or ei) on the vertical axis and xi on the horizontal axis—i.e., a plot of the ðxi ; ei Þ or ðxi ; ei Þ pairs 2. ei (or ei) on the vertical axis and ŷi on the horizontal axis—i.e., a plot of the ð^yi ; ei Þ or ð^yi ; ei Þ pairs 3. A normal probability plot of the ei ’s (or ei’s) Plots 1 and 2 are called residual plots (against the explanatory variable and the fitted values, respectively). These two plots generally look quite similar, since ŷ is simply a linear function of x; the advantage of plot 2 is that we may also use it for assumption diagnostics in multiple regression, as we’ll see in Section 12.7. Diagnostic plots 2 and 3 were both utilized in Chapter 11 for validating ANOVA assumptions. A residual plot can be used to validate two assumptions for simple linear regression: linearity and constant variance. Ideally, residuals should be randomly distributed about a horizontal line passing through 0. Figure 12.24 shows two prototype scatterplots and the corresponding residual plots. Figure 12.24a shows a scatterplot for which a straight line, at least initially, may appear a good fit. However, the associated residual plot exhibits strong curvature, suggesting the relationship between x and (the mean of) y is not actually linear. Figure 12.24b shows nonconstant variance: as x increases, so does the spread of the residuals about the mean line. (It is certainly possible to have a residual plot indicating both nonlinearity and nonconstant variance.)
a
Standardized residual
y 2.0
130
1.5
120
1.0 110
0.5 0.0
100 –0.5 90
x 10
b
20
30
40
50
60
x
–1.0 10
70
20
30
40
50
60
70
Standardized residual
y
700
3
600
2
500
1
400
0
300 –1
200
–2
100 x
0 50
100
150
200
250
300
350
400
x
–3 50
100
150
200
250
300
350
Figure 12.24 Scatterplots and residual plots: (a) violation of the linearity assumption; (b) violation of the constant variance assumption
400
12.6
Investigating Model Adequacy: Residual Analysis
761
The assumption of normally distributed errors is, naturally, checked by a normal probability plot of the residuals or standardized residuals. As before, approximate normality of the residuals becomes less important as n increases for most of our inferential procedures in regression. For example, the t test in Section 12.3 is still valid for large n even if the residuals are clearly nonnormal. The exception to this is a prediction interval (PI) for a future y value presented in Section 12.4. Example 12.19 (Example 12.18 continued) Figure 12.25 presents a residual-versus-fit plot (ei vs. ŷi) and a normal probability plot of the ei ’s for the stress–telomere data. The lack of a pattern in Figure 12.25a, e.g., lack of curvature, validates the linearity assumption, while the relatively equal vertical spread throughout the graph indicates the constant variance assumption is reasonable here. The points in Figure 12.25b plot are quite straight, suggesting that the standardized residuals—and, by extension, the true errors ei—might reasonably come from a normally distributed population.
a Standardized residual
2
1
0
–1
–2 x 1.15
1.10
b
1.20
1.25
1.30
Percent
99 95 90 80 70 60 50 40 30 20 10 5 1 –3
–2
–1
0
1
2
Figure 12.25 Plots for the data from Example 12.19
3
Standardized residual
762
12
Regression and Correlation
What about the final assumption, independence? Provided that the observations were obtained through random sampling (or, in the case of an experiment, treatments were randomly assigned to subjects), it is reasonable to treat the response values as independent. In Example 12.18, the 38 subjects were volunteers, but there is no reason to think their stress levels or telomere lengths are related. (The lack of random sampling does, however, call into question the extent to which the study’s results can be generalized to a larger population of individuals.) ■ Other Diagnostic Tools Besides violations of the inference requirements for simple linear regression, bivariate data will sometimes present other difficulties: 1. The selected model fits the data well except for a few discrepant or outlying data values, which may have greatly influenced the choice of the best-fit function. 2. When the observations (xi, yi) appear in time order (i.e., the subscript i is actually a time index), the errors exhibit dependence over time. 3. One or more relevant explanatory variables have been omitted from the model. Figure 12.26 presents plots corresponding to these three scenarios. Some unusual observations can be detected by a residual plot, particularly those with large standardized residuals (see Figure 12.26a). However, detection of all types of unusual observations can be difficult, especially in a multiple regression setting. A more complete analysis of unusual observations for both simple and multiple regression is presented in Section 12.9.
a
b
+2
+2
e*
e*
Time order of observation
x –2
–2
c
e*
+2 Omitted explanatory variable –2
Figure 12.26 Plots that indicate abnormality in data: (a) a discrepant observation; (b) dependence in errors; (c) an omitted explanatory variable
Figure 12.26b shows a plot of the standardized residuals against time order; this is only appropriate when the data is collected sequentially (in successive time periods) rather than randomly. The line segments connecting successive points emphasize the sequential nature of the observations.
12.6
Investigating Model Adequacy: Residual Analysis
763
Observe that the residuals have a perfect alternating pattern: the first residual is above the mean line, the next one is below, the next above, and so on. This is an example of autocorrelation: a timedependent pattern in the residuals. The methods of this chapter should be applied with great caution to modeling sequential observations—known as time series data—since the independence assumption is generally not met for this type of data. Figure 12.26c shows a plot of the ei ’s against an explanatory variable other than x. The presence of a pattern suggests that this other explanatory variable should be added to the model, resulting in a multiple regression model. In Example 12.18, we might find that the residuals from the regression of y = telomere length on x1 = stress level are linearly related to the values of x2 = subject’s age. If so, it makes sense to use both x1 and x2 to predict y (the topic of Section 12.7). Remedies for Assumption Violations We now briefly indicate what remedies are available for some of the difficulties encountered in this section. Several of these are discussed in greater detail in subsequent sections of the book. For a more comprehensive discussion, one or more of the bibliographic references on regression analysis should be consulted. If the relationship between x and y appears nonlinear (e.g., as indicated in the residual plot of Figure 12.24a), then a model other than Y = b0 + b1x + e may be fit. This can be achieved by transformation of the x and/or y variable, or by inclusion of higher-order polynomial terms (see Section 12.8). Transformations of the y variable can also be used to remedy nonconstant variance. For instance, if the spread of the residuals grows with x as in the residual plot of Figure 12.24b, the transformation y0 ¼ lnðyÞ is often applied. Another popular approach to addressing nonconstant variance is the method of weighted least squares. The basic idea of weighted least squares is to find coefficients b0 and b1 to minimize the expression gw ðb0 ; b1 Þ ¼
X
wi ½yi ðb0 þ b1 xi Þ2
where the wi’s are “weights” determined by the variance structure of the errors. For example, if the standard deviation of Y is proportional to x (for x > 0)—that is, V(Y) = kx2—then it can be shown that the weights wi ¼ 1=x2i yield minimum-variance estimators of b0 and b1. The books by Kutner et al. and by Chatterjee and Hadi explain weighted least squares in detail (see the bibliography). Weighted least squares are used quite frequently by econometricians (economists who use statistical methods) to estimate parameters. Generally speaking, violations of the normality assumption cannot be fixed, though such a problem may naturally be resolved while addressing linearity and constant variance issues. Again, if the sample size is reasonably large then normality is not as important, except for prediction intervals. If a small data set has the feature that only the normality assumption is violated, consult your friendly neighborhood statistician for information on computer-intensive methods (e.g., bootstrapping). When plots or other evidence suggest that the data set contains outliers or points having large influence on the resulting fit, one possible approach is to omit these outlying points and re-compute the estimated regression equation. This would certainly be correct if it was found that the outliers resulted from errors in recording data values or experimental errors. If no assignable cause can be found for the outliers, it is still desirable to report the estimated equation both with and without outliers. Another approach is to retain possible outliers but to use an estimation principle that puts relatively less weight on outlying values than does the principle of least squares. One such principle is P minimize absolute deviations (MAD), which selects b0 and b1 to minimize jyi ðb0 þ b1 xi Þj. Unlike the least squares estimates, there are no nice formulas for the MAD estimates; their values
764
12
Regression and Correlation
must be found by using an iterative computational procedure. Such procedures are also used when it is suspected that the ei’s have a distribution that is not normal but instead has “heavy tails” (making it much more likely than for the normal distribution that discrepant values will enter the sample); robust regression procedures are those that produce reliable estimates for a wide variety of underlying error distributions. Least squares estimators are not robust, in the same way that the sample mean X is not a robust estimator for l. Exercises: Section 12.6 (67–76) 67. The x values and standardized residuals for the temperature-energy use data of Exercise 49 (Section 12.4) are displayed in the accompanying table. Construct a standardized residual plot and comment on its appearance. x
48
53
56
e
−0.110
−1.153
−1.077
x e x e
59
59
0.560
1.258
69
70
0.356
0.858
58 0.486
58 0.836
60
68
69
−0.148
−0.660
−1.869
73
75
79
0.724
−0.622
0.743
x
80
80
84
87
88
e
−0.460
1.471
−2.133
1.181
−0.302
68. Suppose the variables x = commuting distance and y = commuting time are related according to the simple linear regression model with r = 10. a. If n = 5 observations are made at the x values x1 = 5, x2 = 10, x3 = 15, x4 = 20, and x5 = 25, calculate the (true) standard deviations of the five corresponding residuals. b. Repeat part (a) for x1 = 5, x2 = 10, x3 = 15, x4 = 20, and x5 = 50. c. What do the results of parts (a) and (b) imply about the deviation of the estimated line from the observation made at the largest sampled x value? 69. Nickel-based alloys are especially difficult to machine due to characteristics including high hardness and low thermal conductivity. The article “Multi-response Optimization Using ANOVA and Desirability Function Analysis: A Case Study in End Milling of Inconel Alloy” (ARPN J. Engr.
Appl. Sci. 2014: 457–463) reports the following data on x = cutting velocity (m/min) and y = material removal rate (mm2/min) from one experiment. x
25
25
25
50
50
y
258.48
268.80
270.18
338.58
343.86
x
50
75
75
75
y
354.24
414.36
424.80
451.80
a. The LSRL for this data is y = 182.7 + 3.29x. Calculate and plot the residuals against x and then comment on the appropriateness of the simple linear regression model. b. Use se = 11.759 to calculate the standardized residuals from a simple linear regression. Construct a standardized residual plot and comment. Also construct a normal probability plot and comment. 70. As the air temperature drops, river water becomes supercooled and ice crystals form. Such ice can significantly affect the hydraulics of a river. The article “Laboratory Study of Anchor Ice Growth” (J. Cold Regions Engr. 2001: 60–66) described an experiment in which ice thickness (mm) was studied as a function of elapsed time (hr) under specified conditions. The following data was read from a graph in the article: n = 33; x = .17, .33, .50, .67, …, 5.50; y = .50, 1.25, 1.50, 2.75, 3.50, 4.75, 5.75, 5.60, 7.00, 8.00, 8.25, 9.50, 10.50, 11.00, 10.75, 12.50, 12.25, 13.25, 15.50, 15.00, 15.25, 16.25, 17.25, 18.00, 18.25, 18.15, 20.25, 19.50, 20.00, 20.50, 20.60, 20.50, 19.80.
12.6
Investigating Model Adequacy: Residual Analysis
a. The R2 value resulting from a least squares fit is .977. Given the high R2, does it seem appropriate to assume an approximate linear relationship? b. The residuals, listed in the same order as the x values, are −1.03 −0.59 −0.14 0.67 −0.24
−0.92 0.13 0.93 1.02 −0.43
−1.35 0.45 0.04 1.09 −1.01
−0.78 0.06 0.36 0.66 −1.75
−0.68 0.62 1.92 −0.09 −3.14
−0.11 0.94 0.78 1.33
0.21 0.80 0.35 −0.10
Plot the residuals against x, and reconsider the question in (a). What does the plot suggest? 71. The accompanying data on x = true density (kg/mm3) and y = moisture content (% d. b.) was read from a plot in the article “Physical Properties of Cumin Seed” (J. Agric. Engr. Res. 1996: 93–98). x
7.0
9.3
13.2
16.3
19.1
22.0
y
1046
1065
1094
1117
1130
1135
The equation of the least squares line is y = 1008.14 + 6.19268x (this differs very slightly from the equation given in the article); se = 7.265 and R2 = .968. a. Carry out a test of model utility and comment. b. Compute the values of the residuals and plot the residuals against x. Does the plot suggest that a linear regression function is inappropriate? c. Compute the values of the standardized residuals and plot them against x. Are there any unusually large (positive or negative) standardized residuals? Does this plot give the same message as the plot of part (b) regarding the appropriateness of a linear regression function? 72. Continuous recording of heart rate can be used to obtain information about the level of exercise intensity or physical strain during sports participation, work, or other daily activities. The article “The Relationship Between Heart Rate and Oxygen
765
Uptake During Non-Steady State Exercise” (Ergonomics 2000: 1578–1592) reported on a study to investigate using heart rate response (x, as a percentage of the maximum rate) to predict oxygen uptake (y, as a percentage of maximum uptake) during exercise. The accompanying data was read from a graph in the paper. HR
43.5
44.0
44.0
44.5
44.0
45.0
48.0
49.0
VO2
22.0
21.0
22.0
21.5
25.5
24.5
30.0
28.0
HR
49.5
51.0
54.5
57.5
57.7
61.0
63.0
72.0
VO2
32.0
29.0
38.5
30.5
57.0
40.0
58.0
72.0
Use a statistical software package to perform a simple linear regression analysis. Considering the list of potential difficulties in this section, see which of them apply to this data set. 73. Example 12.6 presented the residuals from a simple linear regression of moisture content y on filtration rate x. a. Plot the residuals against x. Does the resulting plot suggest that a straight-line regression function is a reasonable choice of model? Explain your reasoning. b. Using se = .665, compute the values of the standardized residuals. Is ei ei =se for i = 1, …, n, or are the ei ’s not close to being proportional to the ei’s? c. Plot the standardized residuals against x. Does the plot differ significantly in general appearance from the plot of part (a)? 74. Express the ith residual Yi Y^i (where P Y^i ¼ b^ þ b^ xi ) in the form cj Yj , a linear 0
1
function of the Yj’s. Then use rules of variance to verify that VðYi Y^i Þ is given by Expression (12.11).
75. Consider the following classic four (x, y) data sets; the first three have the same x values, so these values are listed only once (Frank Anscombe, “Graphs in Statistical Analysis,” Amer. Statist. 1973: 17–21):
766
12 1–3
1
2
3
x
y
y
y
10.0 8.0 13.0 9.0 11.0 14.0 6.0 4.0 12.0 7.0 5.0
8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68
9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 4.74
7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73
4 x 8.0 8.0 8.0 8.0 8.0 8.0 8.0 19.0 8.0 8.0 8.0
SSPE ¼
4 y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.50 5.56 7.91 6.89
For each of these four data sets, the values of x, y, Sxx, Syy, and Sxy are virtually identical, so all quantities computed from these five will be essentially identical for the four sets—the equation of the least squares line (y = 3 + .5x), SSE, se, R2, t intervals, t statistics, and so on. The summaries provide no way of distinguishing among the four data sets. Based on a scatterplot and a residual plot for each set, comment on the appropriateness or inappropriateness of fitting a straight-line model; include in your comments any specific suggestions for how a “straightline analysis” might be modified or qualified. 76. If there is at least one x value at which more than one observation has been made, the lack of fit test is a formal procedure for testing H0: lY|x = b0 + b1x for some values b0, b1 (the true regression function is linear) versus Ha: H0 is not true (the true regression function is not linear) Suppose observations are made at c levels x1, x2, …, xc. Let Yi1 ; Yi2 ; . . .; Yini denote the ni observations when x = xi (i = 1, …, c). P With n ¼ ni , SSE has n − 2 df. We break SSE into two pieces, SSPE (pure error) and SSLF (lack of fit), as follows:
Regression and Correlation
XX i
ðYij Y i Þ2
j
SSLF ¼ SSE SSPE The ni observations at xi contribute ni − 1 df to SSPE, so the number of degrees of P freedom for SSPE is i ðni 1Þ = n – c, and the degrees of freedom for SSLF is n − 2 − (n − c) = c − 2. Let MSPE = SSPE/(n − c), MSLF = SSLF/(c − 2). Then it can be shown that whereas E(MSPE) = r2 whether or not H0 is true, E(MSLF) = r2 if H0 is true and E(MSLF) > r2 if H0 is false. Test statistic: F ¼ MSLF=MSPE Rejection region: f Fa;c2;nc The following data comes from the article “Single Cell Isolation Process with Laser Induced Forward Transfer” (J. Bio. Engr. 2017), with x = laser pulse energy (lJ), w = titanium thickness (nm) and y = number of viable cells resulting from a new cell-isolation technique. x
5
5
5
5
5
5
5
5
5
w
80
40
120
80
40
120
80
40
120
y
24
21
16
25
22
14
24
22
14
x
9
9
9
9
9
9
9
9
9
w
80
40
120
80
40
120
80
40
120
y
18
35
24
16
37
25
19
37
24
x
13
13
13
13
13
13
13
13
13
w
80
40
120
80
40
120
80
40
120
y
31
47
42
29
46
38
33
38
40
a. Construct a scatterplot of y vs x. Does it appear that x and the mean of y are linearly related? b. Carry out the lack of fit test on the (x, y) data at significance level .05. c. Repeat parts (a) and (b) for y vs w.
12.7
12.7
Multiple Regression Analysis
767
Multiple Regression Analysis
In multiple regression, the objective is to build a probabilistic model that relates a response variable y to more than one explanatory or predictor variable. Let k represent the number of predictor variables (k 2) and denote these predictors by x1, x2, …, xk. For example, in attempting to predict the selling price of a house, we might have k = 4 with x1 = size (ft2), x2 = age (years), x3 = number of bedrooms, and x4 = number of bathrooms. THE MULTIPLE LINEAR REGRESSION MODEL
There are parameters b0, b1, …, bk and r such that for any fixed values of the explanatory variables x1 ; . . .; xk , the response variable Y is related to the xj’s through the model equation Y ¼ b0 þ b1 x1 þ þ bk xk þ e
ð12:12Þ
where the random variable e is assumed to follow a N(0, r) distribution. It is also assumed that the ei’s (and thus the Yi’s) associated with different observations are independent of one another. As before, e is the random error term (or random deviation) in the model, and the assumptions for statistical inference may be stated in terms of e. Equation (12.12) says that the true (or population) regression function, b0 þ b1 x1 þ þ bk xk , gives the expected value of Y as a function of x1, …, xk. The bj’s are the true (or population) regression coefficients. Interpret the regression coefficients carefully! Performing multiple regression on k explanatory variables is not the same thing as creating k separate, simple linear regression models. For example, b1 (the coefficient on the predictor x1) cannot be interpreted in multiple regression without reference to the other predictor variables in the model. Here’s a correct interpretation: b1 ¼ the change in the average value of y associated with a one-unit increase in x1 ; adjusting for the effects of the other explanatory variables As an example, for the four predictors of home price mentioned above, b1 is interpreted as the change in expected selling price when size increases by 1 ft2, adjusting for the effects of age, number of bedrooms, and number of bathrooms. The other coefficients are interpreted similarly. Some statisticians refer to b1 as describing the effect of x1 on y “after removing the effects of” the other predictors or “in the presence of” the other predictors; either of these interpretations is acceptable. You may hear b1 defined as the change in the mean of y associated with a one-unit increase in x1 “while holding the other variables fixed.” This is correct only if it is possible to increase the value of one predictor while the values of all others remain constant. Estimating Parameters The data in simple linear regression consists of n pairs (x1, y1), …, (xn, yn). Suppose that a multiple regression model contains two explanatory variables, x1 and x2. Then each observation will consist of three numbers (a triple): a value of x1, a value of x2, and a value of y. More generally, with k predictor variables, each observation will consist of k + 1 numbers (a “k + 1 tuple”). The values of the predictors in the individual observations are denoted using double-subscripting:
768
12
Regression and Correlation
xij ¼ the value of the jth predictor xj in the ith observation ði ¼ 1; . . .; n; j ¼ 1; . . .; kÞ Thus the first subscript is the observation number and the second subscript is the predictor number. For example, x83 is the value of the third predictor in the eighth observation (to avoid confusion, a comma can be inserted between the two subscripts, e.g. x12,3). The first observation in our data set is then (x11, x12, …, x1k, y1), the second is (x21, x22, …, x2k, y2), and so on. Consider candidates b0 ; b1 ; . . .; bk for estimates of the bj’s and the corresponding candidate regression function b0 þ b1 x1 þ þ bk xk . Substituting the predictor values for any individual observation into this candidate function gives a prediction for the y value that would be observed, and subtracting this prediction from the actual observed y value gives the prediction error. As in Section 12.2, the principle of least squares says we should square these prediction errors, sum, and then take as the least squares estimates b^0 ; b^1 ; . . .; b^k the values of the bj’s that minimize the sum of squared prediction errors. To carry out this procedure, define the criterion function (sum of squared prediction errors) by gðb0 ; b1 ; . . .; bk Þ ¼
n X
½yi ðb0 þ b1 xi1 þ . . . þ bk xik Þ2 ;
i¼1
then take the partial derivative of g() with respect to each bj (j = 0, 1, …, k) and equate these k + 1 partial derivatives to 0. The result is a system of k + 1 equations, the normal equations, in the k + 1 unknowns (the bj’s): P P P P nb0 þ ð xi1 Þb1 þ ð xi2 Þb2 þ þ ð xik Þbk ¼ yi P 2 P P P P ð xi1 Þb0 þ xi1 b1 þ ð xi1 xi2 Þb2 þ þ ð xi1 xik Þbk ¼ xi1 yi .. . P P 2 P P P ð xik Þb0 þ ð xi1 xik Þb1 þ þ xi;k1 xik bk1 þ xik bk ¼ xik yi
ð12:13Þ
Notice that the normal equations are linear in the unknowns (because the criterion function is quadratic). We will assume that the system has a unique solution, the least squares estimates b^0 ; b^1 ; . . .; b^k . The result is an estimated regression function y ¼ b^0 þ b^1 x1 þ þ b^k xk Section 12.9 uses matrix algebra to deal with the system of equations and develop inferential procedures for multiple regression. For the moment, though, we shall take advantage of the fact that all of the commonly used statistical software packages are programmed to solve the equations and provide the results needed for inference. Sometimes interest in the individual regression coefficients is the main reason for a regression analysis. The article “Autoregressive Modeling of Baseball Performance and Salary Data” (Proc. of the Statistical Graphics Section, Amer. Stat. Assoc. 1988, 132–137) describes a multiple regression of runs scored as a function of singles, doubles, triples, home runs, and walks (combined with hit-bypitcher). The estimated regression equation is
12.7
Multiple Regression Analysis
769
runs ¼ 2:49 þ :47 singles þ :76 doubles þ 1:14 triples þ 1:54 home runs þ :39 walks This is very similar to the popular slugging percentage statistic, which gives weight 1 to singles, 2 to doubles, 3 to triples, and 4 to home runs. However, the slugging percentage gives no weight to walks, whereas the estimated regression function puts weight .39 on walks, more than 80% of the weight it assigns to singles. The importance of walks is well known among statisticians who follow baseball, and it is interesting that there are now some statistically savvy people in major league baseball management who are emphasizing walks in choosing players. Example 12.20 Fuel efficiency of an automobile is determined to a large extent by various intrinsic characteristics of the vehicle. Consider the following multivariate data set consisting of n = 38 observations on x1 = weight (1000s of pounds), x2 = engine displacement (i.e., engine size, in3), x3 = number of cylinders, and y = fuel efficiency, measured in gallons per 100 miles: x1
x2
x3
y
x1
x2
x3
y
4.360 4.054 3.605 3.940 2.155 2.560 2.300 2.230 2.830 3.140 2.795 3.410 3.380 3.070 3.620 3.410 3.840 3.725 3.955
350 351 267 360 98 134 119 105 131 163 121 163 231 200 225 258 305 302 351
8 8 8 8 4 4 4 4 5 6 4 6 6 6 6 6 8 8 8
5.92 6.45 5.21 5.41 3.33 3.64 3.68 3.24 4.93 5.88 4.63 6.17 4.85 4.81 5.38 5.52 5.88 5.68 6.06
3.830 2.585 2.910 1.975 1.915 2.670 1.990 2.135 2.670 2.595 2.700 2.556 2.200 2.020 2.130 2.190 2.815 2.600 1.925
318 140 171 86 98 121 89 98 151 173 173 151 105 85 91 97 146 121 89
8 4 6 4 4 4 4 4 4 6 6 4 4 4 4 4 6 4 4
5.49 3.77 4.57 2.93 2.85 3.65 3.17 3.39 3.52 3.47 3.73 2.99 2.92 3.14 2.68 3.28 4.55 4.65 3.13
We’ve chosen to use this representation of fuel efficiency (similar to European measurement), rather than the traditional American “miles per gallon” version, because the former is linearly related to our predictors while the latter is not. One consequence is that lower y values are better, in the sense that they indicate vehicles with better fuel efficiency. Our goal is to predict fuel efficiency (y) from the predictor variables x1, x2, and x3 (so k = 3). Figure 12.27 shows R output from a request to fit a linear function to the fuel efficiency data. The least squares coefficients appear in the estimate column of the coefficients block: b^0 ¼ 1:64351 b^1 ¼ 2:33584
b^2 ¼ 0:01065
b^3 ¼ 0:21774
Thus the estimated regression equation is y ¼ 1:64351 þ 2:33584x1 0:01065x2 þ 0:21774x3 Consider an automobile that weighs 3000 lbs, has a displacement of 175 in3, and is equipped with a six-cylinder engine. A prediction for the resulting fuel efficiency is obtained by substituting the values of the predictors into the fitted equation:
770
12
Regression and Correlation
Call: lm(formula = Fuel.Eff ~ Weight + Disp + Cyl, data = df) Residuals: Min 1Q -0.63515 -0.30801
Median 0.00029
3Q 0.23637
Max 0.63957
Coefficients: Estimate Std. Error t value (Intercept) -1.64351 0.48512 -3.388 Weight 2.33584 0.28810 8.108 Disp -0.01065 0.00269 -3.959 Cyl 0.21774 0.11566 1.883 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
Pr(>|t|) 0.001795 1.87e-09 0.000364 0.068342
** *** *** .
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3749 on 34 degrees of freedom Multiple R-squared: 0.9034, Adjusted R-squared: 0.8948 F-statistic: 105.9 on 3 and 34 DF, p-value: < 2.2e-16 Figure 12.27 Multiple regression output for Example 12.20
^y ¼ 1:64351 þ 2:33584ð3Þ 0:01065ð175Þ þ 0:21774ð6Þ ¼ 4:8067 Such an automobile is predicted to use 4.8 gallons over 100 miles. This is also a point estimate of the mean fuel efficiency of all vehicles with these specifications (x1 = 3, x2 = 175, x3 = 6). The intercept b^0 = −1.64351 has no contextual meaning here, since you can’t really have a vehicle with no weight and no engine. The coefficient b^ = 2.33584 on x1 indicates that, after 1
adjusting for both engine displacement (x2) and number of cylinders (x3), a 1000-lb increase in weight is associated with an estimated increase of about 2.34 gallons per 100 miles, on average. Equivalently, a 100-lb weight increase is predicted to increase average fuel consumption by 0.234 gallons per 100 miles, accounting for engine size (i.e., displacement and number of cylinders). Consider fitting the simple linear model to just y and x3. The resulting LSRL is y ¼ 1:057 þ 0:6067 x3 ; suggesting that a one-cylinder increase is associated with about a 0.6 gallon increase in average fuel consumption per 100 miles. This is our estimate of the relationship between x3 and y ignoring the effects of any other explanatory variables; notice that it differs substantially from the previous estimate of 0.21774, which adjusted for both the vehicle’s weight and its engine displacement. (Since these cars have 4, 6, or 8 cylinders, it’s perhaps more appropriate to double these coefficients as an indication of the effect of a two-cylinder increase.) ■ Descriptive Measures of Fit The predicted (or fitted) value ŷi results from substituting the values of the predictors from the ith observation into the estimated equation: ^yi ¼ b^0 þ b^1 xi1 þ þ b^k xik
12.7
Multiple Regression Analysis
771
The corresponding residual is ei = yi – ŷi. As in simple linear regression, the closer the residuals are to zero, the better the job our estimated regression function is doing in predicting the y values in our sample. For the fuel efficiency data, the values of the three predictors in the first observation are x11 = 4.360, x12 = 350, and x13 = 8, so ^y1 ¼ 1:64351 þ 2:33584ð4:360Þ 0:01065ð350Þ þ 0:21774ð8Þ ¼ 6:555 e1 ¼ y1 ^y1 ¼ 5:92 6:555 ¼ 0:635 Residuals are sometimes important not just for judging the quality of a regression. Several enterprising students developed a multiple regression model using age, size in square feet, etc. to predict the price of four-unit apartment buildings. They found that one building had a large negative residual, meaning that the price was much lower than predicted. As it turned out, the reason was that the owner had “cash-flow” problems and needed to sell quickly. In simple linear regression, after fitting a straight line to bivariate data and obtaining the residuals, we calculated sums of squares and used them to obtain two assessments of how well the line summarized the relationship: the residual standard deviation and the coefficient of determination. Let’s now follow the same path in multiple regression: SSE ¼
X
e2i ¼
X
ðyi ^yi Þ2
SSR ¼
X
ð^yi yÞ2
SST ¼
X
ðyi yÞ2
These are the same expressions introduced in Section 12.2, and again it can be shown that SST = SSE + SSR. The interpretation is that the total variation in the values of the response variable is the sum of explained variation (SSR) and unexplained variation (SSE). Each sum of squares has an associated number of degrees of freedom (df). In particular, df for SSE ¼ n ðk þ 1Þ This is because the k + 1 coefficients b0 ; b1 ; . . .; bk must be estimated before SSE can be obtained, entailing a reduction of k + 1 df for SSE. Notice that for the case of simple linear regression, k = 1 and df for SSE = n − (1 + 1) = n 2 as before. DEFINITION
The standard deviation about the estimated multiple regression function (or simply the residual standard deviation) is sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi SSE se ¼ n ðk þ 1Þ The coefficient of (multiple) determination, denoted by R2, is given by R2 ¼ 1
SSE SSR ¼ SST SST
Roughly speaking, se is the size of a typical deviation from the fitted equation. The residual standard ^ ¼ se . R2 is the proportion of deviation is also our point estimate of the model parameter r, i.e., r variation in observed y values that can be explained by (i.e., attributed to) the multiple regression model. Software often reports 100R2, the percent of explained variation. The closer R2 is to 1, the
772
12
Regression and Correlation
larger the proportion of observed y variation that is being explained. (In fact, the positive square root of R2, called the multiple correlation coefficient, turns out to be the sample correlation coefficient between the observed yi’s and the predicted ŷi’s—another measure of the quality of the fit of the estimated regression function.) Unfortunately, there is a potential problem with R2 in multiple regression: its value can be inflated by including predictors in the model that are relatively unimportant or even frivolous. For example, suppose we plan to obtain a sample of 20 recently sold houses in order to relate sale price to various characteristics of a house. Natural predictors include interior size, lot size, age, number of bedrooms, and distance to the nearest school. Suppose we also include in the model the diameter of the doorknob on the door of the master bedroom, the height of the toilet bowl in the master bath, and so on until we have 19 predictors. Then unless we are extremely unlucky in our choice of predictors, the value of R2 will be 1 (because 20 coefficients are perfectly estimated from 20 observations)! Rather than seeking a model that has the highest possible R2 value, which can be achieved just by “packing” our model with predictors, what is desired is a relatively simple model based on just a few important predictors whose R2 value is high. It is therefore desirable to adjust R2 to take account of the fact that its value may be quite high just because many predictors were used relative to the amount of data. The adjusted coefficient of multiple determination or adjusted R2 is defined by R2a ¼ 1
MSE SSE=½n ðk þ 1Þ n 1 SSE ¼1 ¼1 MST SST=ðn 1Þ n ðk þ 1Þ SST
The ratio multiplying SSE/SST in adjusted R2 exceeds 1 (the denominator is smaller than the numerator), so adjusted R2 is smaller than R2 itself, and in fact will be much smaller when k is large relative to n. A value of R2a much smaller than R2 is a warning flag that the chosen model has too many predictors relative to the amount of data. Example 12.21 (Example 12.20 continued) Figure 12.27 shows that se 0.3749, R2 = 90.34%, and R2a = 89.48% for the fuel efficiency data. Fuel efficiency predictions based on the estimated regression equation are typically off by about 0.375 gal/100 mi (positive or negative) from vehicles’ actual fuel efficiency. The model explains about 90% of the observed variation in fuel efficiency. ■ A Model Utility Test In multiple regression, is there a single indicator that can be used to judge whether a particular model (equivalently, a particular set of predictors x1, …, xk) will be useful? The value of R2 certainly communicates a preliminary message, but this value is sometimes deceptive because it can be greatly inflated by using a large number of predictors (large k) relative to the sample size n (this is the rationale behind adjusting R2). The model utility test in simple linear regression involved the null hypothesis H0: b1 = 0, according to which there is no useful relation between y and the single predictor x. Here we consider the assertion that b1 = 0, b2 = 0, …, bk = 0, which says that there is no useful relationship between y and any of the k predictors. If at least one of these bj’s is not 0, the corresponding predictor(s) is (are) useful. The test is based on the F statistic derived from the regression ANOVA table (see Sections 10.5 and 11.1 for more about F tests). A prototype multiple regression ANOVA table appears in Table 12.3.
12.7
Multiple Regression Analysis
773
Table 12.3 ANOVA table for multiple regression Source of variation
df
Sum of squares
Mean square
f
Regression Error Total
k n – (k + 1) n–1
SSR SSE SST
MSR = SSR/k MSE = SSE/[n – (k + 1)]
MSR/MSE
MODEL UTILITY TEST IN MULTIPLE REGRESSION
Null hypothesis: Alternative hypothesis:
H0: b1 = 0, b2 = 0, …, bk = 0 Ha: at least one bj 6¼ 0
Test statistic value: f ¼
SSR=k MSR ¼ SSE=½n ðk þ 1Þ MSE
ð12:14Þ
When H0 is true, the test statistic has an F distribution with k numerator df and n – (k + 1) denominator df. Rejection Region for a Level a Test f Fa;k;nðk þ 1Þ
P-value area to the right of f under the Fk;nðk þ 1Þ curve
To understand why the test statistic value (12.14) should be compared to this particular F distribution, divide the fundamental ANOVA identity by r2: SST SSE SSR ¼ 2 þ 2 r2 r r When H0 is true, the observations Y1 ; . . .; Yn all have the same mean l = b0 and common variance r2. It follows from a proposition in Section 6.4 that SST=r2 v2n1 . It can also be shown that (1) SSE=r2 v2nðk þ 1Þ and (2) SSE and SSR are independent. Together, (1) and (2) imply—again, see
Section 6.4—that SSR=r2 has a chi-squared distribution, with df = (n – 1) – (n – (k + 1)) = k. Finally, by definition the F distribution is the ratio of two independent chi-squared rvs divided by their respective dfs. Applying this to SSR=r2 and SSE=r2 leads to the F ratio SSR=r2 SSR=k MSR k ¼ Fk;nðk þ 1Þ ¼ 2 SSE=½n ðk þ 1Þ MSE SSE=r n ðk þ 1Þ The test statistic is identical in structure to the ANOVA F statistic from Chapter 11: the numerator measures the variation explained by the proposed model, while the denominator measures the unexplained variation. The larger the value of F, the stronger the evidence that a statistically significant relationship exists between y and the predictors. In fact, the model utility test statistic value can be re-written as
774
12
f ¼
Regression and Correlation
R2 n ðk þ 1Þ 2 k 1R
so the test statistic is proportional to R2/(1 – R2), the ratio of the explained to unexplained variation. If the proportion of explained variation is high relative to unexplained, we would naturally want to reject H0 and confirm the utility of the model. However, the factor [n − (k + 1)]/k decreases as k increases, and if k is large relative to n, it will reduce f considerably. Because the model utility test considers all of the explanatory variables simultaneously, it is sometimes called a global F test. Example 12.22 What impacts the salary offers made to faculty in management and information science (MIS)? The accompanying table shows part of the data available on a sample of 167 MIS faculty, which includes the following variables (based on publicly available data provided by Prof. Dennis Galletta, University of Pittsburgh): y = salary offer, in thousands of dollars x1 = year the salary offer was made, coded as years after 2000 (so 2009 is coded as 9) x2 = previous experience, in years x3 = teaching load, converted into the number of three-credit semester courses per year. Observation
Salary (y)
Year (x1)
Experience (x2)
Teaching load (x3)
90.0 91.5 105.0 79.2 95.0 .. .
3 3 4 6 9 .. .
5 12 7 3 0.5 .. .
3 4 4 5 6 .. .
1 2 3 4 5 .. .
The model utility test hypotheses are H 0: b 1 ¼ b 2 ¼ b 3 ¼ 0 Ha: at least one of these three bj’s is not 0 Figure 12.28 shows output from the JMP statistical package. The values of se (Root Mean Square Error), R2, and adjusted R2 certainly suggest a useful model. The value of the model utility F ratio is f ¼
SSR=k 20064:099=3 6688:03 ¼ ¼ ¼ 50:8985 SSE=½n ðk þ 1Þ 21418:105=163 131:40
This value also appears in the F ratio column of the ANOVA table in Figure 12.28. Since f = 50.8985 F.01,3,163 3.9, H0 should be rejected at significance level .01. In fact, the ANOVA table in the JMP output shows that P-value < .0001. The null hypothesis should therefore be rejected at any reasonable significance level. We conclude that there is a useful linear relationship between y and at least one of the three predictors in the model. Note this does not mean that all three predictors are necessarily useful; we will say more about this shortly.
12.7
Multiple Regression Analysis
775
Figure 12.28 Multiple regression output from JMP for the data of Example 12.22
■
Inferences about Individual Regression Coefficients When the assumptions for the multiple linear regression model are met, we may also construct CIs and perform hypothesis tests for the individual population regression coefficients b1 ; . . .; bk . Inferences concerning a single coefficient bj are based on the standardized variable T¼
b^j bj Sb^j
which, assuming the model is correct, has a t distribution with n − (k + 1) df. A matrix formula for sb^j is given in Section 12.9, and the result is part of the output from all standard regression computer
packages. A CI for bj allows us to estimate with confidence the effect of the predictor xj on the response variable, while adjusting for the effects of the other explanatory variables in the model. By far the most commonly tested hypotheses about an individual bj are H0: bj = 0 versus Ha: bj 6¼ 0, in which case the test statistic value implifies to t ¼ b^j =sb^j . This null hypothesis states
that, with the other explanatory variables in the model, the variable xj does not provide any additional useful information about y. This is referred to as a variable utility test. It is sometimes the case that a predictor variable will be judged useful for predicting y under the simple linear regression model using xj alone but not in the multiple regression setting (i.e., in the presence of other predictors). This usually indicates that the other variables do a better job predicting y, and that the additional
776
12
Regression and Correlation
information in xj is effectively “redundant.” Occasionally, the opposite will happen: a predictor that isn’t very useful by itself proves statistically significant in the presence of some other variables. Next, let x ¼ ðx1 ; . . .; xk Þ denote a particular value of x ¼ ðx1 ; . . .; xk Þ. Then the point estimate of lYjx , the expected value of Y when x ¼ x , is ^y ¼ b^0 þ b^1 x1 þ þ b^k xk . The estimated standard deviation sY^ of the corresponding estimator is a complicated expression involving the sample xij’s, but a matrix formula is given in Section 12.9. The better statistical software packages will calculate it on request. Inferences about lYjx are based on standardizing its estimator to obtain a t variable having n − (k + 1) df. 1. A 100(1 − a)% CI for bj, the coefficient of xj in the model equation (12.12), is b^j ta=2;nðk þ 1Þ sb^j 2. A test for H0: bj = bj0 uses the test statistic value t ¼ ðb^j bj0 Þ=sb^j based on n − (k + 1) df. The test is upper-, lower-, or two-tailed according to whether Ha contains the inequality > , < , or 6¼ . 3. A 100(1 − a)% CI for lYjx is ^y ta=2;nðk þ 1Þ sY^ 4. A 100(1 − a)% PI for a future y value is ^y ta=2;nðk þ 1Þ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffi s2e þ s2Y^
Simultaneous intervals for which the joint confidence or prediction level is controlled can be obtained by applying the Bonferroni technique discussed in Section 12.4. Example 12.23 (Example 12.22 continued) The JMP output in Figure 12.28 includes variable utility tests and 95% confidence intervals for the coefficients. The results of testing H0: b2 = 0 versus Ha: b2 6¼ 0 (x2 = years of experience) are b^2 ¼ :2068;
sb^2 ¼ :2321;
t ¼ :2068=:2321 ¼ 0:89;
P-value ¼ :3742
so H0 is not rejected here. Adjusting for the year a salary offer was made and the position’s teaching load, years of prior experience does not provide additional useful information about MIS faculty members’ salary offers. The other two variables are useful (P-value < .0001 for each). A 95% CI for b3 is b^3 t:025;167ð3 þ 1Þ sb^3 ¼ 6:58 1:975ð:5642Þ ¼ ð7:694; 5:466Þ; which agrees with the interval given in Figure 12.28. Thus after adjusting for offer year and prior experience, a one-course increase in teaching load is associated with a decrease in expected salary offer between $5466 and $7694. (If that seems counterintuitive, bear in mind that elite institutions can
12.7
Multiple Regression Analysis
777
offer both higher salaries and lighter teaching loads, while the opposite is true at a typical state university.) The predicted salary offer in 2015 (x1 = 15) for a newly minted PhD (x2 = 0, no experience) and a five-course annual teaching load (x3 = 5) is ^y ¼ 115:558 þ 2:18534ð15Þ þ :2068ð0Þ 6:580ð5Þ ¼ 115:437 (that is, $115,437). The estimated standard deviation for this predicted value can be obtained from software: sY^ ¼ 4:929. So a 95% confidence interval for the mean offer under these settings is ^y t:025;163 sY^ ¼ 115:437 1:975ð4:929Þ ¼ ð105:704; 125:171Þ which can also be obtained from software. A 95% prediction interval for a single future salary offer under these settings is ^y t:025;163
qffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s2e þ s2Y^ ¼ 115:437 1:975 11:4632 þ 4:9292 ¼ ð90:798; 140:076Þ
Of course, the PI is much wider than the corresponding CI. Since x2 (years of prior experience) was deemed not useful, the model could also be re-fit without that variable, resulting in somewhat different estimated regression coefficients and, consequently, slightly different intervals above. ■ Assessing Model Adequacy The model assumptions of linearity, normality, constant variance, and independence are essentially the same for simple and multiple regression. Scatterplots of y versus each of the explanatory variables can give a preliminary sense of whether linearity is plausible, but the residual plots detailed in Section 12.6 are preferred. The standardized residuals in multiple regression result from dividing each residual ei by its estimated standard deviation; a matrix formula for the standard deviation of ei is given in Section 12.9. We recommend a normal probability plot of the standardized residuals as a basis for validating the normality assumption. Plots of the standardized residuals versus each predictor and/or versus ŷ should show no discernible pattern. If the linearity and/or constant variance conditions appear violated, transformation of the response variable (possibly in tandem with transforming some xj’s) may be required. The book by Kutner et al. discusses transformations as well as other diagnostic plots. Example 12.24 (Example 12.23 continued) Figure 12.29 shows a normal probability plot of the standardized residuals, as well as a plot of the standardized residuals versus the fitted values ŷi, for the MIS salary data. The probability plot is sufficiently straight that there is no reason to doubt the assumption of normally distributed errors. The residual-vs-fit plot shows no pattern, validating the linearity and constant variance assumptions. Plots of the standardized residuals against the explanatory variables x1, x2, and x3 (not shown) also exhibit no discernable pattern.
778
12
a
Regression and Correlation
Percent
99.9 99 95 90 80 70 60 50 40 30 20 10 5 1 0.1 –2
–3
b
–1
0
1
2
3
4
Standardized residual
Standardized residual
3 2 1 0 –1 –2 –3
ŷ 60
70
80
90
100
110
120
Figure 12.29 Residual plots for the MIS salary data
■
Exercises: Section 12.7 (77–87) 77. Let y = weekly sales at a fast-food outlet (in dollars), x1 = number of competing outlets within a 1-mile radius, and x2 = population within a 1-mile radius (in thousands of people). Suppose that the true regression model is Y ¼ 10000 1400x1 þ 2100x2 þ e
a. Determine expected sales when the number of competing outlets is 2 and there are 8000 people within a 1-mile radius. b. Determine expected sales for an outlet that has three competing outlets and 5000 people within a 1-mile radius. c. Interpret b1 and b2. d. Interpret b0. In what context does this value make sense?
12.7
Multiple Regression Analysis
78. Cardiorespiratory fitness is widely recognized as a major component of overall physical well-being. Direct measurement of maximal oxygen uptake (VO2max) is the single best measure of such fitness, but direct measurement is time-consuming and expensive. It is therefore desirable to have a prediction equation for VO2max in terms of easily obtained quantities. Consider the variables y ¼ VO2 maxðL=minÞ x1 ¼ weightðkgÞ x2 ¼ ageðyrÞ x3 ¼ time necessary to walk 1 mileðminÞ x4 ¼ heart rate at the end of the walkðbeats=minÞ
Here is one possible model, for male students, consistent with the information given in the article “Validation of the Rockport Fitness Walking Test in College Males and Females” (Res. Q. Exerc. Sport 1994: 152–158): Y ¼ 5:0 þ :01x1 :05x2 :13x3 :01x4 þ e r ¼ :4
a. Interpret b1 and b3. b. What is the expected value of VO2max when weight is 76 kg, age is 20 year, walk time is 12 min, and heart rate is 140 beats/min? c. What is the probability that VO2max will be between 1.00 and 2.60 for a single observation made when the values of the predictors are as stated in part (b)? 79. Athletic footwear is a multibillion-dollar industry, and manufacturers need to understand which features are most important to customers. The article “Overall Preference of Running Shoes Can Be Predicted by Suitable Perception Factors Using a Multiple Regression Model” (Human Factors 2017: 432–441) reports a survey of
779
100 young male runners in Beijing and Singapore. Each participant was asked to assess the Li Ning Hyper Arc (a running shoe) on five features: y = overall preference, x1 = fit, x2 = cushioning, x3 = arch support and x4 = stability. All measurements were made on a 0–15 visual analog scale, with 0 = dislike extremely and 15 = like extremely. a. The estimated regression equation reported in the article is y = –.66 + .35x1 + .34x2 + .09x3 + .32x4. Interpret the coefficient on x2. [Note: The units are simply “points.”] b. Estimate the true mean rating from runners whose ratings on fit, cushioning, arch support, and stability are 9.0, 8.7, 8.9, and 9.2, respectively. (These were the average ratings across all 100 participants.) What would be more informative than this point estimate? c. The authors report R2 = .777 for this four-variable model. Perform a model utility test at the .01 significance level. Can we conclude that all four predictors provide useful information? d. The article also reports variable utility test statistic values for each predictor; in order, they are t = 6.23, 4.92, 1.35, and 5.51. Perform all four variable utility tests at a simultaneous .01 level. Are all four predictors considered useful? 80. Roads in Egypt are notoriously haphazard, often lacking proper engineering consideration. The article “Effect of Speed Hump Characteristics on Pavement Condition” (J. Traffic Transp. Engr. 2017: 103–110) reports a study by two Egyptian civil engineering faculty of 52 speed bumps on a major road in upper Egypt. They measured each speed bump’s height (x1), width (x2),
780
12
and distance (x3) from the preceding speed bump (all in meters). They also evaluated each bump using a pavement condition index (PCI), a 0–100 scale with 100 the best condition. a. With y = PCI, the estimated regression equation reported in the article is y = 59.05 – 243.668x1 + 11.675x2 + .012x3. Interpret the coefficient on x1. [Hint: Does a one-meter increase make sense?] b. Estimate the pavement condition index of a speed bump 0.13 m tall, 2.5 m wide, and 666.7 m from the preceding speed bump. c. The authors report R2 = .770 for this three-variable model. Interpret this value, and then carry out the model utility test at the .05 significance level. 81. The accompanying data on sale price (thousands of dollars), size (thousands of sq. ft.), and land-to-building ratio for 10 large industrial properties appeared in the paper “Using Multiple Regression Analysis in Real Estate Appraisal” (Appraisal J. 2001: 424–430). Price
Size
L/B ratio
Price
Size
L/B ratio
10600 2625 10500 1850 20000
2167 752 2423 225 3918
2.011 3.543 3.632 4.653 1.712
8000 10000 6670 5825 4517
2867 1698 1046 1109 405
2.279 3.123 4.771 7.569 17.190
a. Use software to create an estimated multiple regression equation for predicting the sale price of a property from its size and land-to-building ratio. b. Interpret the estimated regression coefficients in this example. c. Based on the data, what is the predicted sale price for a 500,000 ft.2 industrial property with a land-to-building ratio of 4.0?
Regression and Correlation
82. There has been a shift recently away from using harsh chemicals to dye textiles in favor of natural plant extracts. The article “Ecofriendly Dyeing of Silk with Extract of Yerba Mate” (Textile Res. J. 2017: 829– 837) describes an experiment to study the effects of dye concentration (mg/L), temperature (°C), and pH on dye adsorption (mg of dye per gram of fabric). [The article also included pictures of the resulting color from each treatment combination; dye adsorption is a proxy for color here.] Conc. 10 20 10 20 10 20 10 20 15 15 15
Temp.
pH
Adsorption
70 70 90 90 70 70 90 90 80 80 80
3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 3.5 3.5 3.5
250 520 387 593 157 377 225 451 353 382 373
a. Obtain the estimated regression equation for this data. Then, interpret the coefficient on temperature. b. Calculate a point estimate for mean dye adsorption when concentration = 15 mg/L, temperature = 80 °C, and pH = 3.5 (i.e., the settings of the last three experimental runs). c. The model utility test results in a test statistic value of f = 44.02. What can be concluded at the a = .01 level? d. Calculate and interpret a 95% CI for the settings specified in part (b). e. Calculate and interpret a 95% PI for the settings specified in part (b). f. Perform variable utility tests on each of the predictors. Can each one be judged useful provided that the other two are included in the model?
12.7
Multiple Regression Analysis
781
83. Carbon nanotubes (CNTs) are used for everything from structural reinforcement to communication antennas. The article “Fast Mechanochemical Synthesis of Carbon Nanotube-Polyanaline Hybrid Materials” (J. Mater. Res. 2018: 1486–1495) reported the following data on y = electrical conductivity (S/cm), x1 = multi-walled CNT weight (mg), x2 = CNx weight (mg), and x3 = water volume (ml) from one experiment. y
x1
x2
x3
1.337 4.439 4.512 4.153 1.727 2.415 3.008 3.869 2.140 2.025 3.407 2.545 4.426 2.863 2.096 2.006
0.0 20.2 40.6 20.3 20.1 0.0 20.3 20.3 40.9 20.4 40.4 20.4 20.3 40.4 0.0 20.8
20.1 0.0 20.5 40.4 20.1 40.4 20.1 20.3 40.4 20.2 0.0 20.1 0.0 20.3 20.4 40.6
2 2 2 2 7 7 7 7 7 7 7 7 12 12 12 12
a. Obtain and interpret R2 and se for the model with predictors x1, x2, and x3. b. Test for model utility using a = .05. c. Does the data support the manufacturing goal of (relatively) consistent electrical conductivity across differing values of the experimental factors? Explain. 84. Electric vehicles have greatly increased in popularity recently, but their short battery life (with a few exceptions) continues to be of concern. The article “Design of Robust Battery Capacity Model for Electric Vehicle by Incorporation of Uncertainties” (Int. J. Energy Res. 2017: 1436–1451) includes the following data on temperature (°C), discharge rate, and battery capacity (A-h, ampere-hours) for a certain type of lithiumion battery.
Temp 0 0 0 25 40 40 20 20 20 25 50 40 50 25 30 50 0 20 25 30 40 50
Disch. rate
Capacity
Temp
Disch. rate
Capacity
0.50 1.75 3.00 0.50 1.75 3.00 0.50 1.75 3.00 3.00 3.00 0.50 0.50 1.75 1.75 1.75 1.00 1.00 1.00 1.00 1.00 1.00
0.96001 0.85001 0.89001 1.38001 1.05396 0.96337 1.27100 1.24897 1.20751 1.32001 1.39139 1.00208 1.44001 1.30001 0.82697 1.42713 1.08485 1.40556 1.47986 0.91734 1.18187 1.53058
0 20 25 30 40 50 0 20 25 30 40 50 30 40 50 0 20 25 30 40 50
1.25 1.25 1.25 1.25 1.25 1.25 1.50 1.50 1.50 1.50 1.50 1.50 1.25 1.25 1.25 1.50 1.50 1.50 1.50 1.50 1.50
1.06506 1.34459 1.45473 1.32355 1.54713 1.47159 0.85171 1.20890 1.29703 1.16097 1.32047 1.36305 1.32355 1.54713 1.47159 0.85171 1.20890 1.29703 1.16097 1.32047 1.36305
a. Determine the estimated regression equation based on this data (y = capacity). b. Calculate a point estimate for the expected capacity of a lithium-ion battery of this type when operated at 30 °C with discharge rate 1.00 (meaning the battery should drain in 1 h). c. Perform a model utility test at the .05 significance level. d. Calculate a 95% CI for expected capacity at the settings in part (b). e. Calculate a 95% PI for the capacity of a single such battery at the settings in part (b). f. Perform variable utility tests on both predictors at a simultaneous .05 significance level. Are both temperature and discharge rate useful predictors of capacity? 85. Steel microfibers are an alternative to conventional rebar reinforcement of concrete structures. The article “Measurement of Average Tensile Force for Individual Steel Fiber Using New Direct Tension Test”
782
12
(J. Test. Eval. 2016: 2403–2413) proposes a new evaluation method for concrete specimens infused with twisted steel micro rebar (TSMR) fibers, which were loaded until a crack occurred. The accompanying data on y = load until cracking (lb), x1 = diameter at break (in), x2 = number of TSMR fibers per in2, and x3 = concrete compressive strength (psi) appears in the article. y
x1
x2
x3
y
x1
x2
x3
213 706 440 984 155 251 311 989 326 479 324 404
2.778 2.725 2.683 2.835 2.779 2.712 2.712 2.686 2.821 2.810 2.845 2.755
1.2 3.3 3.4 8.2 2.3 2.3 1.6 6.0 4.0 2.7 1.9 3.0
7020 7020 7020 5680 5680 5680 7960 7960 7960 7090 7090 7090
688 180 1046 272 1168 821 418 1102 56 91 97
2.735 2.705 2.755 2.734 2.725 2.750 2.740 2.708 2.860 3.002 2.749
3.4 2.6 7.7 3.2 7.4 6.1 4.4 9.2 1.6 1.3 1.9
6130 6130 6130 6880 7270 7270 7270 7390 2950 2950 2950
a. Determine the estimated regression equation for this data. b. Perform variable utility tests on each of the three explanatory variables. Can each one be judged useful given that the other two are included in the model? c. Calculate R2 and adjusted R2 for this three-predictor model. d. Perform a multiple regression of y on just x2 and x3. Determine both R2 and adjusted R2 for this reduced model. How do they compare to the values in part (c)? Explain. 86. An investigation of a die casting process resulted in the accompanying data on x1 = furnace temperature, x2 = die close time, and y = temperature difference on the die surface (“A Multiple-Objective Decision-Making Approach for Assessing Simultaneous Improvement in Die Life and Casting Quality in a Die Casting Process,” Qual. Engr. 1994: 371–383).
Regression and Correlation
x1
1250
1300
1350
1250
x2
6
7
6
7
1300 6
y
80
95
101
85
92
x1
1250
1300
1350
1350
x2
8
8
7
8
y
87
96
106
108
Minitab output from fitting the multiple regression model with predictors x1 and x2 is given here. The regression equation is tempdiff = −200 + 0.210 furntemp + 3.00 clostime Predictor Constant furntemp clostime s = 1.058
Coef St dev t ratio −199.56 11.64 −17.14 0.210000 0.008642 24.30 3.0000 0.4321 6.94 R-sq = 99.1% R-sq(adj) =
p 0.000 0.000 0.000 98.8%
Analysis of variance Source Regression Error Total
DF 2 6 8
SS 715.50 6.72 722.22
MS 357.75 1.12
F 319.31
P 0.000
a. Carry out the model utility test. b. Calculate and interpret a 95% confidence interval for b2, the population regression coefficient of x2. c. When x1 = 1300 and x2 = 7, the estimated standard deviation of Y^ is sY^ ¼ :353. Calculate a 95% confidence interval for true average temperature difference when furnace temperature is 1300 and die close time is 7. d. Calculate a 95% prediction interval for the temperature difference resulting from a single experimental run with a furnace temperature of 1300 and a die close time of 7. e. Use appropriate diagnostic plots to see if there is any reason to question the regression model assumptions. 87. The article “Analysis of the Modeling Methodologies for Predicting the Strength of Air-Jet Spun Yarns” (Textile Res.
12.7
Multiple Regression Analysis
783
J. 1997: 39–44) reported on a study carried out to relate yarn tenacity (y, in g/tex) to yarn count (x1, in tex), percentage polyester (x2), first nozzle pressure (x3, in kg/cm2), and second nozzle pressure (x4, in kg/cm2). The estimate of the constant term in the corresponding multiple regression equation was 6.121. The estimated coefficients for the four predictors were −.082, .113, .256, and −.219, respectively, and the coefficient of multiple determination was .946. Assume that n = 25. a. State and test the appropriate hypotheses to decide whether the fitted model
12.8
specifies a useful linear relationship between the response variable and at least one of the four model predictors. b. Calculate the value of adjusted R2 and comment. c. Calculate a 99% confidence interval for true mean yarn tenacity when yarn count is 16.5, yarn contains 50% polyester, first nozzle pressure is 3, and second nozzle pressure is 5 if the estimated standard deviation of predicted tenacity under these circumstances is .350.
Quadratic, Interaction, and Indicator Terms
The fit of a multiple regression model can often be improved by creating new predictors from the original explanatory variables. In this section we discuss the two primary examples: quadratic terms and interaction terms. We also explain how to incorporate categorical predictor variables into the multiple regression model through the use of indicator variables. Polynomial Regression Let’s return for a moment to the case of bivariate data consisting of n (x, y) pairs. Suppose that a scatterplot shows a parabolic rather than linear shape. Then it is natural to specify a quadratic regression model: Y ¼ b0 þ b1 x þ b2 x2 þ e The corresponding population regression function f ðxÞ ¼ b0 þ b1 x þ b2 x2 gives the mean or expected value of Y for any particular x. What does this have to do with multiple regression? Re-write the quadratic model equation as follows: Y ¼ b0 þ b1 x1 þ b2 x2 þ e
where x1 ¼ x and x2 ¼ x2
Now this looks exactly like a multiple regression equation with two predictors. Although we interpret this model as a quadratic function of x, the multiple linear regression model (12.12) only requires that the response be a linear function of the bj’s and e. Nothing precludes one predictor being a mathematical function of another one. So, from a modeling perspective, quadratic regression is a special case of multiple regression. Thus any software package capable of carrying out a multiple regression analysis can fit the quadratic regression model. The same is true of cubic regression and even higherorder polynomial models, although in practice very rarely are such higher-order predictors needed. The coefficient b1 on the linear predictor x1 cannot be interpreted as the change in expected Y when x1 increases by one unit while x2 is held fixed. This is because it is impossible to increase x without also increasing x2. A similar comment applies to b2. More generally, the interpretation of regression coefficients requires extra care when some predictor variables are mathematical functions of others.
784
12
Regression and Correlation
Example 12.25 Reconsider the solar cell data of Example 12.2. Figure 12.2 clearly shows a parabolic relationship between x = sheet resistance and y = cell efficiency. To calculate the “least squares parabola” for this data, software fits a multiple regression model with two predictors: x1 = x and x2 = x2. The first few rows of data for this scenario are as follows: y
x1 = x
x2 = x2
13.91 13.50 13.59 13.86 .. .
43.58 50.94 60.03 66.82 .. .
1898.78 2594.63 3603.60 4464.91 .. .
(In most software packages, it is not necessary to calculate x2 for each observation; rather, the user can merely instruct the software to fit a quadratic model.) The coefficients that minimize the residual sum of squares are b^0 = 4.008, b^1 = .3617, and b^2 = –.003344, so the estimated regression equation is y ¼ 4:008 þ :3617x1 :003344x2 ¼ 4:008 þ :3617x :003344x2 Figure 12.30 shows this parabola superimposed on a scatterplot of the original (x, y) data. Notice that the negative coefficient on x2 matches the concave-downward contour of the data. Efficiency 15 14 13 12 11 10 9 8
Sheet resistance (ohms) 40
50
60
70
80
90
Figure 12.30 Scatterplot for Example 12.25 with a best-fit parabola
The estimated equation can now be used to make estimates and predictions at any particular x value. For example, the predicted efficiency at x = 60 ohms is determined by substituting x1 = x = 60 and x2 = x2 = 602 = 3600: y ¼ 4:008 þ :3617ð60Þ :003344ð60Þ2 ¼ 13:67 percent Using software, a 95% CI for the mean efficiency of all 60-ohm solar panels is (13.50, 13.84), while a 95% PI for the efficiency of a single future 60-ohm panel is (12.32, 15.03). As always, the prediction interval is substantially wider than the confidence interval. ■
12.8
Quadratic, Interaction, and Indicator Terms
785
Models with Interaction Suppose that an industrial chemist is interested in the relationship between product yield (y) from a certain reaction and two explanatory variables, x1 = reaction temperature and x2 = pressure at which the reaction is carried out. The chemist initially proposes the relationship Y ¼ 1200 þ 15x1 35x2 þ e for temperature values between 80 and 100 in combination with pressure values ranging from 50 to 70. The population regression function 1200 + 15x1 − 35x2 gives the mean y value for any particular values of the predictors. Consider this mean y value for three different particular temperatures: x1 ¼ 90: mean y value ¼ 1200 þ 15ð90Þ 35x2 ¼ 2550 35x2 x1 ¼ 95: mean y value ¼ 2625 35x2 x1 ¼ 100: mean y value ¼ 2700 35x2 Graphs of these three mean y value functions are shown in Figure 12.31a. Each graph is a straight line, and the three lines are parallel, each with a slope of −35. Thus irrespective of the fixed value of temperature, the change in mean yield associated with a one-unit increase in pressure is −35.
a
b
Mean y value
Mean y value
27 1
x2
=
)
0)
)
90
(x
95
10
=
=
(x
1
30 x2
(x 1
)
=
95
0)
10
35 x2
x2 40
=
=
(x 1
−
−
−
(x 1
(x 1
x2 35
x2 35
x2 35
−
−
50
00
−
25
50
22
25
30
00
26 25
26
90
)
x2
Figure 12.31 Graphs of the mean y value for two different models: (a) 1200 + 15x1−35x2; (b) −4500 + 75x1 + 60x2 − x1x2
In reality, when pressure increases the decline in average yield should be more rapid for a high temperature than for a low temperature, so the chemist has reason to doubt the appropriateness of the proposed model. Rather than the lines being parallel, the line for a temperature of 100 should be steeper than the line for a temperature of 95, and that line in turn should be steeper than the line for x1 = 90. A model that has this property includes, in addition to x1 and x2, a third predictor variable, x3 ¼ x1 x2 . One such model is Y ¼ 4500 þ 75x1 þ 60x2 x1 x2 þ e for which the population regression function is −4500 + 75x1 + 60x2 − x1x2. This gives
786
12
Regression and Correlation
ðmean y value when temperature is 100Þ ¼ 4500 þ ð75Þð100Þ þ 60x2 100x2 ¼ 3000 40x2 ðmean value when temperature is 95Þ ¼ 2625 35x2 ðmean value when temperature is 90Þ ¼ 2250 30x2 These are graphed in Figure 12.31b. Now each different value of x1 yields a line with a different slope, so the change in expected yield associated with a l-unit increase in x2 depends on the value of x1. When this is the case, the two predictor variables are said to interact.
DEFINITION
If the effect on y of one explanatory variable x1 depends on the value of a second explanatory variable x2, then x1 and x2 have an interaction effect on the (mean) response. We can model this interaction by including as an additional predictor x3 = x1x2, the product of the two explanatory variables, known as an interaction term.
The general equation for a multiple regression model based on two explanatory variables x1 and x2 and also including an interaction term is Y ¼ b0 þ b1 x1 þ b2 x2 þ b3 x3 þ e
where x3 ¼ x1 x2
When an interaction effect is present, this model will usually give a much better fit to the data than would the no-interaction model. Failure to consider a model with interaction too often leads an investigator to conclude incorrectly that the relationship between y and a set of explanatory variables is not very substantial. In applied work, quadratic predictors x21 and x22 are often also included, to model a potentially curved relationship. This leads to the complete second-order model Y ¼ b0 þ b1 x1 þ b2 x2 þ b3 x1 x2 þ b4 x21 þ b5 x22 þ e This model replaces the straight lines of Figure 12.31 with parabolas (each one is the graph of the population regression function as x2 varies when x1 has a particular value). Example 12.26 The need to devise environmentally friendly remedies for heavy-metal contaminated sites has become a global issue of serious concern. The article “Polyaspartate Extraction of Cadmium Ions from Contaminated Soil” (J. Hazard. Mater. 2018: 58–68) describes one possible cleanup method. Researchers varied five experimental factors: x1 = polyaspartate (PA) concentration (mM), x2 = PA-to-soil ratio, x3 = initial cadmium (Cd) concentration in soil, x4 = pH, and x5 = extraction time (hours). One of the response variables of interest was y = residual Cd concentration, based on a total of n = 47 experimental runs. Consider fitting a “first-order” model using all five predictors. Results from software include y ¼ 38:7 :174x1 1:308x2 þ :242x3 þ 10:26x4 þ 2:328x5
12.8
Quadratic, Interaction, and Indicator Terms
787
se ¼ 33:124ðdf ¼ 41Þ; R2 ¼ 71:25%; R2a ¼ 67:74% Variable utility tests indicate that all predictors except x1 are useful; it’s possible that x1 is redundant with x2. At the other extreme, a complete second-order model here involves 20 predictor variables: the original xj’s (five), their squares (another five), and all
5 2
!
¼ 10
possible interaction terms: x1 x2 ,
x1 x3 , …, x4 x5 . Summary quantities from fitting this enormous model include se = 24.878 (df = 26), R2 = 89.72%, and R2a = 81.80%. The reduced standard deviation and greatly increased adjusted R2 both suggest that at least some of the 15 second-order terms are useful, and so it was wise to incorporate these additional terms. By considering the relative importance of the terms based on P-values, the researchers reduced their model to “just” 12 terms: all first-order terms, two of the quadratic terms, and five of the ten interactions. Based on the resulting estimated regression equation (shown in the article, but not here) researchers were able to determine the values of x1 ; . . .; x5 that minimize residual Cd concentration. (Optimization is one of the added benefits of quadratic terms. For example, in Figure 12.30, we can see there is an x value for which solar cell efficiency is maximized. A linear model has no such local maxima or minima.) It’s worth noting that while x1 was not considered useful in the first-order model, several secondorder terms involving x1 were significant. When fitting second-order models, it is recommended to fit the complete model first and delete useless terms rather than building up from the simpler first-order model; using the latter approach, important quadratic and interaction effects can be missed. ■ One issue that arises with fitting a model with an abundance of terms, as in Example 12.26, is the potential to commit many type I errors when performing variable utility t tests on every predictor. Exercise 94 presents a method called the partial F test for determining whether a group of predictors can all be deleted while controlling the overall type I error rate. Models with Categorical Predictors Thus far we have explicitly considered the inclusion of only quantitative (numerical) predictor variables in a multiple regression model. Using simple numerical coding, categorical variables such as sex, type of college (private or state), or type of wood (pine, oak, or walnut) can also be incorporated into a model. Let’s first focus on the case of a dichotomous variable, one with just two possible categories—alive or dead, US or foreign manufacture, and so on. With any such variable, we associate an indicator (or dummy) variable whose possible values 0 and 1 indicate which category is relevant for any particular observation. Example 12.27 Is it possible to predict graduation rates from freshman test scores? Based on the median SAT score of entering freshmen at a university, can we predict the percentage of those freshmen who will get a degree there within six years? To investigate, let y = six-year graduation rate, x2 = median freshman SAT score, and x1 = a variable defined to indicate private or public status:
x1 ¼
1 0
if the university is private if the university is public
788
12
Regression and Correlation
The corresponding multiple regression model is Y ¼ b 0 þ b 1 x 1 þ b2 x 2 þ e The mean graduation rate depends on whether the university is public or private: mean graduation rate ¼ b0 þ b2 x2 mean graduation rate ¼ b0 þ b1 þ b2 x2
when x1 ¼ 0 ðpublicÞ when x1 ¼ 1 ðprivateÞ
Thus there are two parallel lines with vertical separation b1, as shown in Figure 12.32a. The coefficient b1 is the difference in mean graduation rates between private and public universities, after adjusting for median SAT score. If b1 > 0 then, on average, for a given SAT, private universities will have a higher graduation rate.
a
b
Mean y
Mean y Private
1) β2
x2
β1 β0
β2
1) Private
Public
(x 1
x2
( 0)
(x 1
β
)x 3
β
x2 2
x1
2
2 (β
β1
β0 β0
β0
(x 1
0)
x2
Public
x2
Figure 12.32 Regression functions for models with one indicator variable (x1) and one quantitative variable (x2): (a) no interaction; (b) interaction
A second possibility is a model with an interaction term: Y ¼ b0 þ b1 x1 þ b2 x2 þ b3 x1 x2 þ e Now the mean graduation rates for the two types of university are mean graduation rate ¼ b0 þ b2 x2
when x1 ¼ 0 ðpublicÞ
mean graduation rate ¼ b0 þ b1 þ ðb2 þ b3 Þx2
when x1 ¼ 1 ðprivateÞ
Here we have two lines, where b1 is the difference in intercepts and b3 is the difference in slopes, as shown in Figure 12.32b. Unless b3 = 0, the lines will not be parallel and there will be interaction effect, meaning that the separation between public and private graduation rates depends on SAT. To make inferences, we obtained a random sample of 20 Master’s level universities from the 2017 data file available on www.collegeresults.org.
12.8
Quadratic, Interaction, and Indicator Terms
University Appalachian State University Brenau University Campbellsville University Delta State University DeSales University Lasell College Marshall University Medaille College Mount Saint Joseph University Mount Saint Mary College Muskingum University Pacific University Simpson University SUNY Oneonta Texas A&M University-Texarkana Truman State University University of Redlands University of Southern Indiana University of Tennessee-Chattanooga Western State Colorado University
789
Grad rate
Median SAT
Sector
73.4 49.4 34.1 39.6 70.1 54.1 49.3 43.9 60.7 53.8 48.2 64.4 56.7 70.9 29.7 74.9 77.0 39.6 45.2 41.0
1140 973 1008 1028 1072 966 1010 906 1011 991 1009 1122 985 1082 1016 1224 1101 1005 1088 1026
Public Private Private Public Private Private Public Private Private Private Private Private Private Public Public Public Private Public Public Public
First of all, does the interaction predictor provide useful information over and above what is contained in x1 and x2? To answer this question, we should test the hypothesis H0: b3 = 0 versus Ha: b3 6¼ 0 first. If H0 is not rejected (meaning interaction is not informative) then we can use the parallel lines model to see if there is a separation (b1) between lines. Of course, it does not make sense to estimate the difference between lines if the difference depends on x2, which is the case when there is interaction. Figure 12.33 shows R output for these two tests. The coefficient for interaction has a P-value of roughly .42, so there is no reason to reject the null hypothesis H0: b3 = 0. Since we fail to reject the “no-interaction” hypothesis, we drop the interaction term and re-run the analysis. The estimated regression equation specified by R is y ¼ 124:56039 þ 13:33553x1 þ 0:16474x2 The t ratio values and P-values indicate that both H0: b1 = 0 and H0: b2 = 0 should be rejected at the .05 significance level. The coefficient on x1 indicates that a private university is estimated to have a graduation rate about 13 percentage points higher than a state university with the same median SAT. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -152.37773 49.18039 -3.098 0.006904 ** SectorPrivate 70.06920 69.28401 1.011 0.326908 Median.SAT 0.19077 0.04592 4.154 0.000746 *** SectorPrivate:Median.SAT -0.05457 0.06649 -0.821 0.423857
Testing without interaction Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -124.56039 35.29279 -3.529 0.002575 ** SectorPrivate 13.33553 4.64033 2.874 0.010530 * Median.SAT 0.16474 0.03289 5.009 0.000108 *** Figure 12.33 R output for an interaction model and a “parallel lines” model
790
12
Regression and Correlation
At the same time, the coefficient on x2 shows that, after adjusting for a university’s sector (private or public), a 100-point increase in median SAT score is associated with roughly a 16.5 percentage point increase in the school’s six-year graduation rate. ■ You might think that the way to handle a three-category variable is to define a single numerical predictor with coded values such as 0, 1, and 2 corresponding to the three categories. This is incorrect: doing so imposes an ordering on the categories that is not necessarily implied by the context, and it forces the difference in mean response between the 0 and 1 categories to equal the difference for categories 1 and 2 (because 1 – 0 = 2 – 1 and the model is linear in its predictors). The correct way to incorporate three categories is to define two different indicator variables. Suppose, for example, that y = score on a posttest taken after instruction, x1 = score on an ability pretest taken before instruction, and that there are three methods of instruction in a mathematics unit: (1) with symbols, (2) without symbols, and (3) a hybrid method. Then let
x2 ¼
1 0
instruction method 1 otherwise
x3 ¼
1 instruction method 2 0 otherwise
For an individual taught with method 1, x2 = 1 and x3 = 0, whereas for an individual taught with method 2, x2 = 0 and x3 = 1. For an individual taught with method 3, x2 = x3 = 0, and it is not possible that x2 = x3 = 1 because an individual cannot be taught simultaneously by both methods 1 and 2. The no-interaction model would have only the predictors x1, x2, and x3. The following interaction model allows the change in mean posttest score associated with a one-point increase in pretest to depend on the method of instruction: Y ¼ b 0 þ b 1 x 1 þ b 2 x 2 þ b3 x 3 þ b4 x 1 x 2 þ b5 x 1 x 3 þ e Construction of a picture like Figure 12.32 with a graph for each of the three possible (x2, x3) pairs gives three nonparallel lines (unless b4 = b5 = 0). More generally, incorporating a categorical variable with c possible categories into a multiple regression model requires the use of c − 1 indicator variables (e.g., five methods of instruction would necessitate using four indicator variables). Thus even one categorical variable can add many predictors to a model. Indicator variables can be used for categorical variables without any other predictors in the model. For example, consider Example 11.3, which compared the maximum power of five different thermoelectric modules. Using a regression with four indicator variables (to represent the five categories) produces the exact same ANOVA table presented in Example 11.3. In particular, the “treatment sum of squares” SSTr in Chapter 11 and the “regression sum of squares” SSR of this chapter are identical, as are SSE, SST, and hence the results of the F test. In a sense, analysis of variance is a special case of multiple regression, with the only predictor variables being indicators for various categories. Analysis that involves both quantitative and categorical predictors, as in Example 12.27, is sometimes called analysis of covariance, and the quantitative variable(s) are called covariates. This terminology is typically applied when the effect of the categorical predictor is of primary interest, while the inclusion of the quantitative variables serves to reduce the amount of unexplained variation in the model.
12.8
Quadratic, Interaction, and Indicator Terms
791
Exercises: Section 12.8 (88–98) 88. The article “Selling Prices/Sq. Ft. of Office Buildings in Downtown Chicago – How Much Is It Worth to Be an Old but Class-A Building?” (J. Real Estate Res. 2010: 1–22) considered a regression model to relate y = ln($/ft2) to 16 predictors, including age, age squared, number of stories, occupancy rate, and indicator variables for whether it is a class-A building, whether a building has a restaurant and whether it has conference rooms. The model was fit to data resulting from 203 sales. a. The coefficient of multiple determination was .711. What is the value of the adjusted coefficient of multiple determination? Does it suggest that the relatively high R2 value was the result of including too many predictors in the model relative to the amount of data available? b. Using the R2 value from (a), carry out a test of hypotheses to see whether there is a useful linear relationship between the response variable and at least one of the predictors. c. The estimated coefficient of the indicator variable for whether or not a building was class-A was .364. Interpret this estimated coefficient, first in terms of y and then in terms of $/ft2. d. The t ratio for the estimated coefficient of (c) was 5.49. What does this tell you? 89. Cerium dioxide (also called ceria) is used in many applications, including pollution control and wastewater treatment. The article “Mechanical Properties of Gelcast Cerium Dioxide from 23 to 1500 °C” (J. Engr. Mater. Technol. 2017) reports an experiment to determine the relationship between y = elastic modulus (GPa) and x = temperature for ceria specimens under certain conditions. A scatterplot in the article suggests a quadratic relationship.
a. The article reports the estimated equation y = −1.92 10−5x2 – .0191x + 89.0. Over what temperature range does elastic modulus increase with temperature, and for what temperature range does it decrease? b. Predict the elastic modulus of a ceria specimen at 800 °C. c. The coefficient of determination is reported as R2 = .948. Use the fact that the data consisted of n = 28 observations to perform a model utility test at the .01 level. d. Information consistent with the article suggests that at x = 800, sY^ = 2.9 GPa. Use this to calculate a 95% CI for lYj800 . e. The residual standard deviation for the quadratic model is roughly se = 2.37 GPa. Use this to calculate a 95% PI at x = 800, and interpret this interval. 90. Many studies have researched how traffic load affects road asphalt, but fewer have examined the effect of extreme cold weather. The article “Effects of Large Freeze-Thaw Cycles on Stiffness and Tensile Strength of Asphalt Concrete” (J. Cold Regions Engr. 2016) reports the following data on y = indirect tensile strength (MPa) and x = temperature (°C) for six asphalt specimens in one particular experiment. x
−35
−20
−10
0
10
22
y
3.01
3.56
3.47
2.72
2.15
1.20
a. Verify that a scatterplot of the data is consistent with the choice of a quadratic regression model. b. Determine the estimated quadratic regression equation. c. Calculate a point prediction for the indirect tensile strength of this asphalt type at 0 °C.
792
12
d. What proportion of the observed variation in tensile strength can be attributed to the quadratic regression relationship? e. Obtain a 95% CI for lYj0 , the true expected tensile strength of this asphalt type at 0 °C. f. Obtain and interpret a 95% PI at x = 0. 91. Ethyl vanillin is used in food, cosmetics, and pharmaceuticals for its vanilla-like scent. The article “Determination and Correlation of Ethyl Vanillin Solubility in Different Binary Solvents at Temperatures from 273.15 to 313.15 K” (J. Chem. Engr. Data 2017: 1788–1796) reported an experiment to determine y = ethyl vanillin solubility (mole fraction) as a function of x1 = initial mole fraction of the chemical propan-2-one in the solvent mixture and x2 = temperature (°K). The experiment was run at seven x1 and nine x2 values. The accompanying table shows the response, y, at each combination. [Note: 273.15 °K corresponds to 0 °C.] x1 = x2 = 273.15 278.15 283.15 288.15 293.15 298.15 303.15 308.15 313.15
.4
.5
.6
.7
.8
.9
1.0
10.6 12.8 13.7 17.3 20.5 23.7 27.4 34.3 38.5
13.4 16.2 18.8 20.7 25.2 27.2 31.6 36.9 42.1
16.8 19.2 20.9 23.8 26.9 29.8 33.4 38.7 43.4
18.5 21.0 23.1 26.4 29.1 31.2 35.0 40.7 45.3
19.5 21.4 23.4 26.7 29.6 32.1 36.2 41.0 45.5
19.8 21.5 23.5 26.9 29.8 32.4 36.2 41.3 45.7
19.9 21.6 23.7 27.1 30.0 33.0 36.9 41.6 45.9
a. Create scatterplots of y versus x1 and y versus x2. Does it appear the predictors are linearly related to y, or would quadratic terms be appropriate? b. Would a scatterplot of x1 versus x2 indicate whether an interaction term might be suitable? Why or why not? c. Perform a regression using the complete second-order model. Based on a residual analysis, does it appear that the model assumptions are satisfied? d. Test various hypotheses to determine which term(s) should be retained in the model.
Regression and Correlation
92. In the construction industry, a “project labor agreement” (PLA) between clients and contractors stipulates that all bidders on a project will use union labor and abide by union rules. The article “Do Project Labor Agreements Raise Construction Costs?” (CS-BIGS 2007: 71–79) investigated construction costs for 126 schools in Massachusetts over an eight-year period. Among the variables considered were y = project cost, in dollars per square foot; x1 = project size, in 1000s of ft2; x2 = 1 for new construction and 0 for remodel; and x3 = 1 if a PLA was in effect and 0 otherwise. a. What would it mean in this context to say that x1 and x3 have an interaction effect? b. What would it mean in this context to say that x2 and x3 have an interaction effect? c. No second-order terms are statistically significant here, and the estimated regression equation for the first-order model is y = 138.69 – .1236x1 + 17.89x2 + 18.83x3. Interpret the coefficient on x1. Does the sign make sense? d. Interpret the coefficient on x3. e. The estimated standard error of b^3 is 4.96. Test the hypotheses H0: b3 = 0 vs. Ha: b3 > 0 at the .01 significance level. Does the data indicate that PLAs tend to raise construction costs? 93. A regression analysis carried out to relate y = repair time for a water filtration system (hr) to x1 = elapsed time since the previous service (months) and x2 = type of repair (1 if electrical and 0 if mechanical) yielded the following model based on n = 12 observations: y = .950 + .400x1 + 1.250x2. In addition, SST = 12.72, SSE = 2.09, and sb^2 ¼ :312. a. Does there appear to be a useful linear relationship between repair time and the two model predictors? Carry out a test of the appropriate hypotheses using a significance level of .05.
12.8
Quadratic, Interaction, and Indicator Terms
b. Given that elapsed time since the last service remains in the model, does type of repair provide useful information about repair time? State and test the appropriate hypotheses using a significance level of .01. c. Calculate and interpret a 95% CI for b2. d. The estimated standard deviation of a prediction for repair time when elapsed time is six months and the repair is electrical is .192. Predict repair time under these circumstances by calculating a 99% prediction interval. Does the interval suggest that the estimated model will give an accurate prediction? Why or why not? 94. Sometimes an investigator wishes to decide whether a group of m predictors (m > 1) can simultaneously be eliminated from the model. The null hypothesis says that all b’s associated with these m predictors are 0, which is interpreted to mean that as long as the other k − m predictors are retained in the model, the m predictors under consideration collectively provide no useful information about y. The test is carried out by first fitting the “full” model with all k predictors to obtain SSE(full) and then fitting the “reduced” model consisting just of the k − m predictors not being considered for deletion to obtain SSE(red). The test statistic is F¼
½SSEðredÞ SSEðfullÞ=m SSEðfullÞ=½n ðk þ 1Þ
The test is upper-tailed and based on m numerator df and n − (k + 1) denominator df. This procedure is called the partial F test. Refer back to Example 12.26. The following are the SSEs and numbers of predictors for the first-order, complete second-order, and the researchers’ final model.
793 Model First-order Complete second-order Final
Predictors 5 20 12
SSE 44,985 16,092 17,794
a. Use the partial F test to compare the first-order and complete second-order models. Is there evidence that at least some of the second-order terms should be retained? b. Use the partial F test to compare the final model to the complete secondorder model. Do you agree with the researchers’ decision to eliminate the “other” eight predictors? 95. Utilization of sucrose as a carbon source for the production of chemicals is uneconomical. Beet molasses is a readily available and low-priced substitute. The article “Optimization of the Production of b-Carotene from Molasses by Blakeslea trispora” (J. Chem. Tech. Biotech. 2002: 933–943) carried out a multiple regression analysis to relate the response variable y = amount of b-carotene (g/dm3) to the three predictors: amount of linoleic acid, amount of kerosene, and amount of antioxidant (all g/dm3). Obs. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Linoleic
Kerosene
Antiox.
Betacarotene
30.00 30.00 30.00 40.00 30.00 13.18 20.00 20.00 40.00 30.00 30.00 40.00 40.00 30.00 30.00 30.00 30.00 20.00 20.00 46.82
30.00 30.00 30.00 40.00 30.00 30.00 40.00 40.00 20.00 30.00 30.00 20.00 40.00 30.00 46.82 30.00 13.18 20.00 20.00 30.00
10.00 10.00 18.41 5.00 10.00 10.00 5.00 15.00 5.00 10.00 1.59 15.00 15.00 10.00 10.00 10.00 10.00 5.00 15.00 10.00
0.7000 0.6300 0.0130 0.0490 0.7000 0.1000 0.0400 0.0065 0.2020 0.6300 0.0400 0.1320 0.1500 0.7000 0.3460 0.6300 0.3970 0.2690 0.0054 0.0640
794
12
a. Fitting the complete second-order model in the three predictors resulted in R2 = .987 and adjusted R2 = .974, whereas fitting the first-order model gave R2 = .016. What would you conclude about the two models? b. For x1 = x2 = 30, x3 = 10, a statistical software package reported that ^y ¼ :66573, sY^ ¼ :01785, and se = .044 based on the complete secondorder model. Predict the amount of b-carotene that would result from a single experimental run with the designated values of the explanatory variables, and do so in a way that conveys information about precision and reliability. 96. Snowpacks contain a wide spectrum of pollutants that may represent environmental hazards. The article “Atmospheric PAH Deposition: Deposition Velocities and Washout Ratios” (J. Environ. Engr. 2002: 186–195) focused on the deposition of polyaromatic hydrocarbons. The authors proposed a multiple regression model for relating deposition over a specified time period (y, in lg/m2) to two rather complicated predictors x1 (lg-s/m3) and x2 (lg/m2) defined in terms of PAH air concentrations for various species, total time, and total amount of precipitation. Here is data on the species fluoranthene and corresponding Minitab output: x1
x2
y
92017 51830 17236 15776 33462 243500 67793 23471 13948 8824 7699 15791 10239 43835 49793 40656 50774
.0026900 .0030000 .0000196 .0000360 .0004960 .0038900 .0011200 .0006400 .0004850 .0003660 .0002290 .0014100 .0004100 .0000960 .0000896 .0026000 .0009530
278.78 124.53 22.65 28.68 32.66 604.70 27.69 14.18 20.64 20.60 16.61 15.08 18.05 99.71 58.97 172.58 44.25
Regression and Correlation
The regression equation is flth dep = −33.5 + 0.00205 x1 + 29836 x2 Predictor Constant x1 x2 S = 44.28
Coef SE Coef T −33.46 14.90 −2.25 0.0020548 0.0002945 6.98 29836 13654 2.19 R-Sq = 92.3% R-Sq(adj) =
P 0.041 0.000 0.046 91.2%
Analysis of variance Source Regression Residual error Total
DF 2 14
SS 330,989 27,454
16
35,8443
MS 165,495 1961
F 84.39
P 0.000
Formulate questions and perform appropriate analyses. Construct the appropriate residual plots, including plots against the predictors. Based on these plots, justify adding a quadratic predictor, and fit the model with this additional predictor. Does this predictor provide additional useful information over and above what x1 and x2 contribute, and does it help the appearance of the diagnostic plots? Also, the data includes a clear outlier. Re-run the regression without the outlier and determine whether quadratic terms are appropriate. 97. The following data set has ratings from ratebeer.com along with values of IBU (international bittering units, a measure of bitterness) and ABV (alcohol by volume) for 25 beers. Notice which beers have the lowest ratings and which are highest. Beer Amstel Light Anchor Liberty Ale Anchor Steam Bud Light Budweiser Coors DAB Dark Dogfish 60 min IPA Great Divide Titan IPA Great Divide Hercules Double IPA Guinness Extra Stout Harp Lager Heineken Heineken Premium Light Michelob Ultra Newcastle Brown Ale Pilsner Urquell Redhook ESB Rogue Imperial Stout Samuel Adams Boston Lager Shiner Light Sierra Nevada Pale Ale Sierra Nevada Porter Terrapin All-Amer. Imperial Pilsner Three Floyds Alpha King
IBU
ABV
Rating
18 54 33 7 11 14 32 60 65 85 60 21 23 11 4 18 35 29 88 31 13 37 40 75 66
3.5 5.9 4.9 4.2 5 5 5 6 6.8 9.1 5 4.3 5 3.2 4.2 4.7 4.4 5.77 11.6 4.9 4.03 5.6 5.6 7.5 6
1.93 3.60 3.31 1.15 1.38 1.63 2.73 3.76 3.81 4.05 3.38 2.85 2.13 1.62 1.01 3.05 3.28 3.06 3.98 3.19 2.57 3.61 3.60 3.46 4.04
12.8
Quadratic, Interaction, and Indicator Terms
a. Find the correlations (and the corresponding P-values) among Rating, IBU, and ABV. b. Regress rating on IBU and ABV. Notice that although both predictors have strongly significant correlations with Rating, they do not both have significant regression coefficients. How do you explain this? c. Plot the residuals from the regression of (b) to check the assumptions. Also plot rating against each of the two predictors. Which of the assumptions is clearly not satisfied? d. Regress rating on IBU and ABV with the square of IBU as a third predictor. Again check assumptions. e. How effective is the regression in (d)? Interpret the coefficients with regard to statistical significance and sign. In particular, discuss the relationship to IBU. f. Summarize your conclusions. 98. The article “Promoting Healthy Choices: Information versus Convenience” (Amer. Econ. J.: Appl. Econ. 2010: 164–178) reported on a field experiment at a fast-food sandwich chain to see whether calorie information provided to patrons would affect calorie intake. One aspect of the
12.9
795
study involved fitting a multiple regression model with seven predictors to data consisting of 342 observations. Predictors in the model included age and indicator variables for sex, whether or not a daily calorie recommendation was provided, and whether or not calorie information about choices was provided. The reported value of the F ratio for testing model utility was 3.64. a. At significance level .01, does the model appear to specify a useful linear relationship between calorie intake and at least one of the predictors? b. What can be said about the P-value for the model utility F test? c. What proportion of the observed variation in calorie intake can be attributed to the model relationship? Does this seem very impressive? Why is the Pvalue as small as it is? d. The estimated coefficient for the indicator variable calorie information provided was −71.73, with an estimated standard error of 25.29. Interpret the coefficient. After adjusting for the effects of other predictors, does it appear that true average calorie intake depends on whether or not calorie information is provided? Carry out a test of appropriate hypotheses.
Regression with Matrices
Throughout this chapter we have explored linear models with both one and several predictors. It should perhaps not be surprising that such models can be imbedded in the language of linear algebra, i.e., in matrix form. In this section, we re-write the model equation and least squares estimates in terms of certain matrices and then derive matrix-based formulas for several of the quantities mentioned in earlier sections. (The focus here will be on multiple regression, since simple linear regression is just the special case where k = 1.) In fact, all software packages that perform regression analysis rely on these matrix representations for computation. The Model Equation in Matrix Form In Section 12.7 we used the following additive model equation to relate a response variable y to explanatory variables x1, …, xk: Y ¼ b0 þ b1 x1 þ b2 x2 þ þ bk xk þ e;
796
12
Regression and Correlation
where e * N(0, r) and the ei’s for different observations are independent of one another. Suppose there are n observations, each consisting of a y value and values of the k predictors (so each observation consists of k + 1 numbers). Then the n equations for the various observations can be expressed compactly using matrix notation: 2
3
2
Y1 ¼ b0 þ b1 x11 þ b2 x12 þ þ bk x1k þ e1 1 x11 Y1 6 .. 7 6 .. .. .. ¼ ) 4 . 5 4. . . 1 xn1 Yn Yn ¼ b0 þ b1 xn1 þ b2 xn2 þ þ bk xnk þ en
2 3 3 b0 2 3 e1 x1k 6 7 b .. 76 1 7 6 .. 7 þ . 7 4 . 5 . 56 4 .. 5 en xnk bk ð12:15Þ
The dimensions of the four matrices in (12.15), from left to right, are n 1, n (k + 1), (k + 1) 1, and n 1. If we denote these four matrices by Y, X, b, and e, then the multiple linear regression model is equivalent to Y ¼ Xb þ e We will use y to denote the n 1 column vector of observed y values: y = ½y1 ; . . .; yn 0 , where ′ denotes matrix transpose. The vector y (or Y) is called the response vector, while the matrix X is known as the design matrix. The design matrix consists of one row for each observation (n rows total) and one column for each predictor, along with a leading column of 1’s to accommodate the constant term. Parameter Estimation in Matrix Form We now estimate b0, b1, …, bk using the principle of least squares. Let kuk denote the (Euclidean) P length of a column vector u, i.e., kuk2 ¼ u2i ¼ u0 u. Then our goal is to find b0, b1, …, bk to minimize gðb0 ; b1 ; . . .; bk Þ ¼
n X
½yi ðb0 þ b1 xi1 þ b2 xi2 þ þ bk xik Þ2 ¼ ky Xbk2 ;
i¼1
where b is the column vector with entries b0, b1, …, bk. One solution method was outlined in Section 12.7: if we set the partial derivatives of g with respect to b0, b1, …, bk equal to zero, the result is the normal equations in Expression (12.13). In matrix form, (12.13) becomes 2
n P
1
6 i¼1 6 n 6P 6 xi1 6 6 i¼1 6 6 6 n 4P xik i¼1
n P i¼1 n P i¼1
xi1 xi1 xi1
.. . n P xik xi1
i¼1
n P
3 xik
2
n P
3 yi
72 3 6 i¼1 7 7 b0 7 6 n 7 7 6P 6 b1 7 6 xi1 yi 7 xi1 xik 7 76 7 6 7 i¼1 76 .. 7 ¼ 6 i¼1 7 74 . 5 6 . 7 7 7 6 .. 7 7 6 n n P 5 bk 5 4P ... xik xik xik yi i¼1 n P
i¼1
i¼1
The matrix on the left is X′X and the one on the far right is X′y. The normal equations then become X′Xb = X′y. We will assume throughout this section that X′X has an inverse, so the vector of estimated coefficients is given by
12.9
Regression with Matrices
797
^b ¼ b ¼ ðX0 XÞ1 X0 y
ð12:16Þ
All statistical software packages use the matrix version of the normal equations to calculate the estimated regression coefficients. (At the end of this section, we present an alternative derivation of ^ b that relies on linear algebra rather than partial derivatives from calculus.) Example 12.28 For illustrative purposes, suppose we wish to predict horsepower using engine size (liters) and fuel type (premium or regular) based on a sample of n = 6 cars. Horsepower 132 167 170 204 230 260
Engine size
Fuel type
2.0 2.0 2.5 2.5 3.0 3.0
Regular Premium Regular Premium Regular Premium
Define variables y = horsepower, x1 = engine size, and x2 = 1 for premium fuel and 0 for regular (an indicator variable). Then the response vector and design matrix here are 3 2 3 2 1 2:0 0 132 6 1 2:0 1 7 6 167 7 2 3 2 3 7 6 7 6 6 15 3 1163 6 1 2:5 0 7 6 170 7 0 7 6 7 4 5 and X0 y ¼ 4 3003 5 y¼6 6 204 7 X ¼ 6 1 2:5 1 7 ) X X ¼ 15 38:5 7:5 7 6 7 6 3 7:5 3 631 4 1 3:0 0 5 4 230 5 1 3:0 1 260 Notice that X′X is symmetric (and will be in all cases—do you see why?). The least squares estimates of the regression coefficients are 2 32 3 2 3 79=12 5=2 1=3 1163 61:417 ^b ¼ ðX0 XÞ1 X0 y ¼ 4 5=2 1 0 54 3003 5 ¼ 4 95:5 5 1=3 0 2=3 631 33 Figure 12.34 shows R output from multiple regression using this (toy) data set. Notice that estimated regression coefficients exactly match our vector ^b.
Residuals: 1 2 3 4 2.417 4.417 -7.333 -6.333
5 4.917
6 1.917
Coefficients: Estimate Std. Error t value (Intercept) -61.417 17.966 -3.419 x1 95.500 7.002 13.639 x2 33.000 5.717 5.772 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
Pr(>|t|) 0.041887 * 0.000853 *** 0.010337 * ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.002 on 3 degrees of freedom Multiple R-squared: 0.9865, Adjusted R-squared: 0.9775 F-statistic: 109.7 on 2 and 3 DF, p-value: 0.001567 Figure 12.34 R output for Example 12.28
■
798
12
Regression and Correlation
Residuals, ANOVA, F, and R2 The estimated regression coefficients can be used to obtain the predicted values and the residuals. Recall that the ith predicted value is ^yi ¼ b^0 þ b^1 xi1 þ b^2 xi2 þ þ b^k xik . The vector of predicted values, ŷ, is 2 3 2 3 ^y1 b^0 þ b^1 x11 þ þ b^k x1k 6 7 6 7 .. ^y ¼ 4 ... 5 ¼ 4 b ¼ XðX0 XÞ1 X0 y 5¼ X^ . ^yn b^0 þ b^1 xn1 þ þ b^k xnk If we define a matrix H ¼ XðX0 XÞ1 X0 , this relationship can be re-written as ŷ = Hy. The matrix H is amusingly called the hat matrix because it “puts a hat” on the vector y. The residual for the ith observation is defined by ei ¼ yi ^yi , and so the residual vector is e ¼ y ^y ¼ y Hy ¼ ðI HÞy; where I denotes the n n identity matrix. Now the sums of squares encountered throughout this chapter can also be written in matrix form (more precisely, as squared lengths of particular vectors). Let y denote an n 1 column vector whose every entry is the sample mean y value, y. Then SSE ¼ SSR ¼ SST ¼
X X X
e2i ¼ kek2 ¼ ky ^ yk2 ð^yi yÞ2 ¼ k^ y yk2 ðyi yÞ2 ¼ ky yk2
from which, as before, s2e = MSE = SSE/[n − (k + 1)] and R2 = SSR/SST. The fundamental ANOVA identity SST = SSR + SSE can be obtained as follows: SST ¼ ky yk2 ¼ ðy yÞ0 ðy yÞ ¼ ½ðy ^ yÞ þ ð^ y yÞ0 ½ðy ^ yÞ þ ð^ y yÞ ¼ ky ^yk2 þ k^y yk2 ¼ SSE þ SSR The cross-terms in the matrix product are zero because of the normal equations (see Exercise 104). Equivalently, the middle two terms drop out because the vectors ^ y y and e ¼ y ^ y are orthogonal. The model utility test of H0: b1 ¼ ¼ bk ¼ 0 uses the same F ratio as before: f ¼
MSR SSR=k y yk2 =k k^ ¼ ¼ MSE SSE=½n ðk þ 1Þ kek2 =½n ðk þ 1Þ
Example 12.29 (Example 12.28 continued) The predicted values and residuals are easily obtained: 2
1 61 6 61 ^ y ¼ X^b ¼ 6 61 6 41 1
2:0 2:0 2:5 2:5 3:0 3:0
3 2 3 129:583 0 2 3 6 162:583 7 17 7 6 7 61:417 7 6 07 74 95:50 5 ¼ 6 177:333 7 6 210:333 7 17 7 6 7 33 4 225:083 5 05 258:083 1
3 2 3 3 2 129:583 132 2:417 6 167 7 6 162:583 7 6 4:417 7 7 6 6 7 7 6 6 170 7 6 177:333 7 6 7:333 7 76 7 7¼6 e¼y^ y¼6 6 204 7 6 210:333 7 6 6:333 7 7 6 6 7 7 6 4 230 5 4 225:083 5 4 4:917 5 260 1:917 258:083 2
12.9
Regression with Matrices
799
From these, SSE = kek2 ¼ 2:4172 þ þ 1:9172 = 147.083, MSE = SSE/[n – (k + 1)] = pffiffiffiffiffiffiffiffiffiffiffiffiffiffi 147.083/[6 − (2 + 1)] = 49.028, and se = 49:028 = 7.002. The total sum of squares is SST ¼ P ðyi 193:83Þ2 ¼ 10;900:83, and the regression sum of squares can be obtained most ky yk2 ¼ easily by subtraction: SSR = SST − SSE = 10,900.83 − 147.083 = 10,753.75. The coefficient of multiple determination is R2 = SSR/SST = .9865. Finally, the model utility test statistic value is f = MSR/MSE = (10,753.75/2)/49.028 = 109.67, a massive F ratio that decisively rejects H0 at any reasonable significance level. Notice that many of these quantities appear in the lower half of the R output in Figure 12.34. ■ Inference About Individual Parameters In order to develop hypothesis tests and confidence intervals for the regression coefficients, the expected values and standard deviations of the estimators b^0 ; b^1 ; . . .; b^k are needed.
DEFINITION
Let U1 ; . . .; Um be rvs and U denote the m 1 column vector ½U1 ; . . .; Um 0 . Then the mean vector of U is the m 1 column vector l = E(U) whose ith entry is li = E(Ui). The covariance matrix of U is the m m matrix whose (i, j)th entry is the covariance of Ui and Uj. That is, 2
CovðU1 ; U1 Þ 6 CovðU2 ; U1 Þ 6 CovðUÞ ¼ 6 .. 4 .
CovðU1 ; U2 Þ CovðU2 ; U2 Þ
.. .
CovðUm ; U1 Þ CovðUm ; U2 Þ
3 CovðU1 ; Um Þ CovðU2 ; Um Þ 7 7 7 5 CovðUm ; Um Þ
If we define the expected value of a matrix of rvs by the element-wise expectations of its entries, then it follows from the definition CovðUi ; Uj Þ ¼ E½ðUi li ÞðUj lj Þ that CovðUÞ ¼ E½ðU lÞðU lÞ0
ð12:17Þ
The diagonal entries of the covariance matrix are the variances of the rvs: CovðUi ; Ui Þ ¼ VðUi Þ. Also, the covariance matrix is symmetric, since CovðUi ; Uj Þ ¼ CovðUj ; Ui Þ. For example, suppose U1 and U2 are rvs with means 10 and −4, standard deviations 2.5 and 2.3, and covariance −1.1. Then the mean vector and covariance matrix of U = [U1, U2]′ are
EðU1 Þ CovðU1 ; U2 Þ 10 VðU1 Þ 2:52 EðUÞ ¼ ¼ and CovðUÞ ¼ ¼ EðU2 Þ CovðU2 ; U1 Þ 4 VðU2 Þ 1:1
1:1 2:32
Now consider the vector of random errors e ¼ ½e1 ; . . .; en 0 . The linear regression model assumes that the ei’s are independent (so covariance = 0 for each pair) with mean 0 and common variance r2. Under these assumptions, the mean vector of e is 0 (an n 1 vector of 0’s), while the covariance matrix of e is r2I (an n n matrix with r2 along the main diagonal and 0’s everywhere else) . It then follows from the model equation Y = Xb + e that the (random) response vector Y satisfies
800
12
EðYÞ ¼ Xb þ 0 ¼ Xb
Regression and Correlation
and CovðYÞ ¼ CovðeÞ ¼ r2 I
To determine the sampling distribution of ^b, we will require the following proposition. PROPOSITION
Let U be a random vector. If A is a matrix with constant entries and V = AU, then E(V) = AE(U) and Cov(V) = ACov(U)A′.
Proof By the linearity of the expectation operator, E(V) = E(AU) = AE(U) = Al. Then, using (12.17), CovðVÞ ¼ E½ðV EðVÞÞðV EðVÞÞ0 Equation ð12:17Þ ¼ E½ðAU AlÞðAU AlÞ0 ¼ E½AðU lÞðAðU lÞÞ0 ¼ E½AðU lÞðU lÞ0 A0 ¼ AE½ðU lÞðU lÞ0 A0 ¼ ACov(UÞA0
linearity of expectation ■
Let’s apply this proposition to find the mean vector and covariance matrix of ^ b. As an estimator, ^b ¼ ðX0 XÞ1 X0 Y, so let A ¼ ðX0 XÞ1 X0 and U = Y. By linearity of expectation (the first part of the proposition), Eð^bÞ ¼ ðX0 XÞ1 X0 EðYÞ ¼ ðX0 XÞ1 X0 Xb ¼ b That is, ^b is an unbiased estimator of b (for each j, b^j is unbiased for estimating bj).
Next, the transpose of A is A0 ¼ ½ðX0 XÞ1 X0 0 ¼ XðX0 XÞ1 ; this relies on the fact that X0 X is symmetric, so ðX0 XÞ1 is symmetric as well. Applying the second part of the proposition and the earlier observation that Cov(Y) = r2I, Covð^bÞ ¼ ACovðYÞA0 ¼ ðX0 XÞ1 X0 ½r2 IXðX0 XÞ1 ¼ r2 ðX0 XÞ1 X0 XðX0 XÞ1 ¼ r2 ðX0 XÞ1
So, the variance of the regression coefficient b^j is the jth diagonal entry of the matrix r2 ðX0 XÞ1 . Exercise 101 asks you to demonstrate that this matrix formula matches the variance formulas presented earlier in the chapter for simple linear regression. Since r is unknown, it must be estimated from the data, and the estimated covariance matrix of ^ b is s2e ðX0 XÞ1 . Example 12.30 (Example 12.29 continued) For the engine horsepower scenario we previously found the matrix ðX0 XÞ1 and the residual standard deviation se = 7.002. The estimated covariance matrix of ^b is 2
79=12 5=2 1 7:0022 4 5=2 1=3 0
3 2 1=3 322:766 0 5 ¼ 4 122:569 2=3 16:343
122:569 49:028 0:0
3 16:343 0:0 5 32:685
12.9
Regression with Matrices
801
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi The estimated standard deviations of the three coefficients are sb^0 ¼ 322:766 ¼ 17:966, pffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffi sb^1 ¼ 49:028 ¼ 7:002, and sb^2 ¼ 32:685 ¼ 5:717. Notice that these exactly match the standard errors given in the R output of Figure 12.34. These estimated standard errors form the basis for the variable utility tests and CIs for the bj’s presented in Section 12.7. The covariance matrix also indicates that Covðb^ ; b^ Þ 122:569, meaning the estimates for the 0
1
y-intercept and the slope on engine size are negatively correlated. In other words, if for a particular sample the estimated slope b^1 is greater than its expectation (the true coefficient b1), then typically the value of b^0 will be less than b0. This makes sense for (x, y) values in the first quadrant: rotating from the true line y = b0 + b1x, if sample data results in a slope estimate that is too high (so the line is overly steep), the y-intercept estimate will naturally be too low. ■ What about the standard error of Y^ ¼ b^0 þ b^1 x1 þ þ b^k xk , our point estimate of the mean b, where response lYjx at a specified set of x values? The point estimate may be written as Ŷ = x ^ ^ (a vector of estimators) has x is the row vector x ¼ ½1; x ; . . .; x . Here, x is constant but b 1
k
sampling variability. Applying the earlier proposition, ^ ¼ Vðx ^bÞ ¼ x Vð^bÞ½x 0 ¼ x r2 ðX0 XÞ1 ½x 0 ¼ r2 x ðX0 XÞ1 ½x 0 ) VðYÞ s2Y^ ¼ s2e x ðX0 XÞ1 ½x 0 (It’s easy to verify here that the expression for s2Y^ is a 1 1 matrix and, as such, may be treated as a scalar.) The square root of this expression gives the estimated standard error of Ŷ, which is required for confidence and prediction intervals. The Hat Matrix, Leverage, and Outlier Detection The foregoing proposition can also be used to find estimated standard deviations for the residuals. Recall that the n n hat matrix is defined by H ¼ XðX0 XÞ1 X0 .With the help of the matrix rules (AB)′ = B′A′ and (A−1)′ = (A′)−1, we find that H is symmetric, i.e., H′ = H: h i0 0 H0 ¼ XðX0 XÞ1 X0 ¼ ðX0 Þ ½ðX0 XÞ1 0 X0 ¼ X½ðX0 XÞ0 1 X0 ¼ XðX0 XÞ1 X0 ¼ H Next, recall that the vector of predicted values is given by Ŷ = HY; here, we’re treating the response vector Y as random, which implies that Ŷ is also a random vector. Thus ^ ¼ HCovðYÞH0 ¼ XðX0 XÞ1 X0 ½r2 IXðX0 XÞ1 X0 CovðYÞ ¼ r2 XðX0 XÞ1 X0 ¼ r2 H
ð12:18Þ
A similar calculation shows that the covariance matrix of the residuals is ^ ¼ r2 ðI HÞ CovðY YÞ
ð12:19Þ
The variances of Ŷi and ei are the diagonal entries of the matrices in (12.18) and (12.19), respectively. Of course, the value of r2 is generally unknown, so the estimate s2e = MSE is used instead. If we let hii denote the ith diagonal entry of H, then (12.18) and (12.19) imply that the (estimated) standard deviations of Ŷi and ei are
802
12
sY^i ¼ se
pffiffiffiffiffi hii and
s ei ¼ s e
Regression and Correlation
pffiffiffiffiffiffiffiffiffiffiffiffiffi 1 hii
For the case of simple linear regression, it can be shown that these expressions match the standard error formulas given previously (Exercise 110). The hat matrix is also important as a measure of the influence of individual observations. Because ^ y ¼ Hy, ^yi ¼ hi1 y1 þ þ hii yi þ þ hin yn , and therefore the ith diagonal element of H measures the impact of the ith observation yi on its own predicted value ŷi. The hii’s are sometimes called the leverages to indicate their impact on the regression. An observation with very high leverage will tend to pull the regression toward it, and its residual will tend to be small. Notice, though, that H depends only on the values of the predictors (through the design matrix X), so leverage measures only one aspect of influence. Example 12.31 Students in a statistics class measured their height, foot length, and wingspan (measured fingertip to fingertip with hands outstretched) in inches. The accompanying table shows the measurements for 16 students; we encountered this data previously in Example 12.16. The last column has the leverages for the regression of wingspan on height and foot length. Student 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Height (x1)
Foot (x2)
Wingspan (y)
Leverage
63.0 63.0 65.0 64.0 68.0 69.0 71.0 68.0 68.0 72.0 73.0 73.5 70.0 70.0 72.0 74.0
9.0 9.0 9.0 9.5 9.5 10.0 10.0 10.0 10.5 10.5 11.0 11.0 11.0 11.0 11.0 11.2
62.0 62.0 64.0 64.5 67.0 69.0 70.0 72.0 70.0 72.0 73.0 75.0 71.0 70.0 76.0 76.5
0.239860 0.239860 0.228236 0.223625 0.196418 0.083676 0.262182 0.067207 0.187088 0.151959 0.143279 0.168719 0.245380 0.245380 0.128790 0.188340
Figure 12.35 shows a plot of x1 = height against x2 = foot length, along with the leverage for each point. Notice that the points at the extreme right and left of the plot have high leverage, and the points Height (x1) 74 0.15 72
0.17 0.19 0.14 0.13
0.26 0.25
70
0.08 0.20
68
0.07
0.19
10.0
10.5
66 0.23
0.22
64 0.24 Foot length (x2)
62 9.0
9.5
11.0
11.5
Figure 12.35 Plot of height and foot length showing leverage
12.9
Regression with Matrices
803
near the center have low leverage. However, it is interesting that the point with highest leverage is not at the extremes of height or foot length. This is student number 7, with a 10-in. foot and height of 71 in., and the high leverage comes from the height being extreme relative to foot length. Indeed, when there are several predictors, high leverage often occurs when values of one predictor are extreme relative to the values of other predictors. For example, if height and weight are predictors, then an overweight or underweight subject would likely have high leverage. ■ Together, standardized residuals and leverages can be used to identify unusual observations in a regression setting. This is particularly helpful in multiple regression, where outliers are more difficult to detect with the naked eye (e.g., student 7 in Example 12.31). By convention, the ith observation is said to have a large residual if | ei | > 2 and a very large residual if | ei | > 3, since these indicate that yi is more than two (resp., three) standard deviations away from the value predicted by the estimated regression function. As for the leverage values, it can be shown that n X 0 hii 1 and hii ¼ k þ 1 i¼1
where k is the number of predictor variables in the regression model. (In fact, it isn’t difficult to show P directly that hii ¼ 2 for simple regression, i.e., k = 1.) This implies that the mean of the leverage values is (k + 1)/n, since there are n leverage values total (one for each observation). By convention, the ith observation is said to possess high leverage if hii > 2(k + 1)/n and very high leverage if hii > 3(k + 1)/n. More sophisticated tools for outlier detection in regression are also available. If the “influence” of an observation is defined in terms of the effect on the predicted values when the observation is omitted, then an influential observation is one that has both large leverage and a large residual. A popular measure that combines leverage and residual is Cook’s distance; consult the book by Kutner et al. for more information. Many statistical software packages will provide the standardized residual, leverage, and Cook’s distance for all n observations upon request; some will also flag observations with unusually high values (e.g., according to the criteria above). Another Perspective on Least Squares We previously used multivariate calculus—in particular, the normal equations (12.13)—to determine the least squares estimates of the regression coefficients. The matrix representation of regression allows an alternative derivation of these estimates that relies instead on linear algebra. Let 1, x1, …, xk denote the k + 1 columns of the design matrix X. The principle of least squares says we should determine coefficients b0 ; b1 ; . . .; bk that minimize X
ðyi ½b0 þ b1 xi1 þ þ bk xik Þ2 ¼ ky ½b0 1 þ b1 x1 þ þ bk xk k2
The expression in brackets is a (generic) linear combination of the vectors 1, x1, …, xk. Since k k2 denotes Euclidean distance, we know from linear algebra that such a distance is minimized by finding the projection of y onto the vector space (i.e., the closest vector to y in the space) spanned by 1, x1, …, xk. Call this projection vector p. Since p lies in span{1, x1, …, xk} it must have the form p ¼ b0 1 þ b1 x1 þ þ bk xk = Xb; our goal now is to find an explicit formula for the coefficients.
804
12
Regression and Correlation
Use the property that if p is the projection of y onto span{1, x1, …, xk}, then the vector y – p must be orthogonal to that space. That is, the vector y – p must be perpendicular to each of 1, x1, …, xk, meaning that 10 p ¼ 0 and x0j p ¼ 0 for j = 1, …, k. In matrix form, these k + 1 requirements can be written as X0 ðy pÞ ¼ 0. Put it all together: with p = Xb and X0 ðy pÞ ¼ 0, X0 ðy pÞ ¼ 0 ) X0 ðy XbÞ ¼ 0 ) X0 y ¼ X0 Xb ) b ¼ ðX0 XÞ1 X0 y; matching the previous formula (12.16). Incidentally, the projection vector itself is p = Xb = Hy = ŷ (the vector of fitted values), and the vector orthogonal to the space is y – p = y – ŷ = e (the vector of residuals).
Exercises: Section 12.9 (99–110) 99. Consider fitting the model Y ¼ b0 þ b1 x1 þ b2 x2 þ e to the following data: x1
x2
y
−1 −1 1 1
−1 1 −1 1
1 1 0 4
a. Determine X and y, and express the normal equations in terms of matrices. b. Determine the ^b vector, which contains the estimates for the three coefficients in the model. c. Determine ŷ and e. Then calculate SSE, and use this to get the estimated variance MSE. d. Use MSE and ðX0 XÞ1 to construct a 95% confidence interval for b1. e. Carry out a t test for the hypothesis H0: b1 = 0 against a two-tailed alternative, and interpret the result. f. Form the analysis of variance table, and carry out the F test for the hypothesis H0: b1 = b2 = 0. Find R2 and interpret. 100. Consider the model Y ¼ b0 þ b1 x1 þ e for the following data: x1
y
x1
y
−.5 −.5 −.5 −.5
1 2 2 3
.5 .5 .5 .5
8 9 7 8
a. Determine the X and y matrices and express the normal equations in terms of matrices. b. Determine the ^ b vector, which contains the estimates for the two coefficients in the model. c. Determine ŷ and e. d. Calculate SSE (by summing the squared residuals) and then the estimated variance MSE. e. Use MSE and ðX0 XÞ1 to construct a 95% confidence interval for b1. f. Carry out a t test of H0: b1 = 0 against a two-sided alternative. g. Carry out the F test of H0: b1 = 0. How is this related to part (f)? 101. Consider the simple linear regression model Y ¼ b0 þ b1 x þ e, so k = 1 and X consists of a column of 1’s and a column of the values x1, …, xn of x. a. Determine X′X and (X′X)−1 using the matrix inverse formula
a c
b d
1
1 d ¼ ad bc c
b a
b. Determine X′y, then calculate the coefficient vector ^ b. Compare your answers to the formulas given in SecP tion 12.2. [Hint: Sxy ¼ xi yi n x y, and similarly for Sxx.]
12.9
Regression with Matrices
805
c. Use (X′X)−1 to obtain expressions for the variances of the coefficients, and check your answers against the results given in Sections 12.3 and 12.4. [Note: b^0 is the predicted value corresponding to x = 0, so the variance of b^0 appears implicitly in Section 12.4.] 102. Suppose we have bivariate data (x1, y1), …, (xn, yn). Consider the centered model yi ¼ b0 þ b1 ðxi xÞ þ ei for i = 1, …, n. a. Show that
n XX¼ 0 0
0 Sxx
b. Determine (X′X)−1 and the coefficient vector ^b. c. Determine the estimated standard errors of the regression coefficients. d. Compare this exercise to the previous one. Why is it more efficient to have xi1 ¼ xi x rather than xi1 ¼ xi in the design matrix? 103. Consider the model Yi ¼ b0 þ ei (so k = 0). Estimate b0 from Equation (12.16). Find a simple expression for sb^0 and then the 95% confidence interval for b0. [Note: Your result should be equivalent to the onesample t confidence interval in Section 8.3.] 104. a. Show that the normal equations are equivalent to X0 e ¼ 0. [Hint: Use the matrix representation of the normal equations in this section and substitute the formula for b ¼ ^b.] b. Use part (a) to prove the ANOVA identity SST = SSE + SSR by showing that ð^y yÞ0 e ¼ 0. [Hint: Part (a) also implies that each row of X0 is orthogonal to e; in particular, the first column of X, a column of n 1’s, satisfies 1′e = 0.] 105. Suppose that we have Y1, …, Ym * N(l1, r), Ym+1, …, Ym+n * N(l2, r), and all m + n observations are independent. These are the assumptions of the pooled
t procedure in Section 10.2. Let k = 1, x11 = .5, …, xm1 = .5, xm+1,1 = −.5, …, xm+n,1 = −.5. For convenience in inverting X′X assume m = n. a. Obtain b^ and b^ from Equa0
1
tion (12.16). [Hint: Let y1 be the mean of the first m observations and y2 be the mean of the next n observations.] b. Find simple expressions for ŷ, SSE, se, and sb^1 . c. Use parts (a) and (b) to find a simple expression for the 95% CI for b1. Show that your formula is equivalent to b^1 t:025;m þ n2 se
rffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 þ m n
rffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 ¼ y1 y2 t:025;m þ n2 þ m n sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pm Pm þ n 2 2 i¼1 ðyi y1 Þ þ i¼m þ 1 ðyi y2 Þ mþn 2
which is the pooled variance confidence interval discussed in Section 9.2. d. Let m = 3 and n = 3, with y1 = 117, y2 = 119, y3 = 127, y4 = 129, y5 = 138, y6 = 139. These are the prices in thousands for three houses in Brookwood and then three houses in Pleasant Hills. Apply parts (a), (b), and (c) to this data set. 106. The constant term b0 is not always needed in the regression equation. For example, many physical principles imply that the response variable should be 0 when the explanatory variables are 0, so the constant term is not needed. Then it is preferable to omit b0 and use the model Y ¼ b1 x1 þ b2 x2 þ þ bk xk þ e. Here we focus on the special case k = 1. a. Differentiate the appropriate sum of squares to derive the one normal equation for estimating b1. b. Express your normal equation in matrix form, where X consists of a single
806
12
column with the values of the predictor variable. c. Apply part (b) to the data of Example 12.28, using hp for y and just engine size in X. d. Explain why deletion of the constant term might be appropriate for the data set in part (c). e. By fitting a regression model with a constant term added to the model of part (c), test the hypothesis that the constant is not needed. 107. a. Prove that the hat matrix H satisfies H2 = H. b. Prove Equation (12.19). [Hint: Look at the derivation of Equation (12.18).] 108. Use Eqs. (12.18) and (12.19) to show that each of the leverages is between 0 and 1, and therefore the variances of the predicted values and residuals are between 0 and r2. 109. The measurements here are similar to those in Example 12.31, except that here the students did the measuring at home, and the results suffered in accuracy. Wingspan
Foot
Height
74 56 65 66 62 69 75 66 66 63
13.0 8.5 10.0 9.5 9.0 11.0 12.0 9.0 9.0 8.5
75 66 69 66 54 72 75 63 66 63
a. Regress wingspan on the other two variables. Carry out the test of model utility and the tests for the two individual regression coefficients of the predictors.
12.10
Regression and Correlation
b. Obtain the diagonal elements of the hat matrix (leverages). Identify the point with the highest leverage. What is unusual about the point? Given the instructor’s assertion that there were no students in the class less than five feet tall, would you say that there was an error? Give another reason that this student’s measurements seem wrong. c. For the other points with high leverages, what distinguishes them from the points with ordinary leverage values? d. Examining the residuals, find another student whose data might be wrong. e. Discuss the elimination of questionable points in order to obtain valid regression results. 110. Refer back to the centered simple regression model in Exercise 102. a. Show that the leverage hii, the ith diagonal entry of the hat matrix H, is given by hii ¼
1 ðxi xÞ2 þ n Sxx
b. Show that the sum of the leverages in simple regression is 2. c. Use part (a) and the discussion of H in this section to confirm the following formulas from Sections 12.4 and 12.6: "
# 2 1 ðx xÞ i VðY^i Þ ¼ r2 þ n Sxx " # 2 1 ðx xÞ i 2 VðYi Y^i Þ ¼ r 1 n Sxx
Logistic Regression
All of the regression models thus far have assumed a quantitative response variable y. (Section 12.8 discussed how to incorporate a categorical explanatory variable using one or more indicators, but y was still numerical.) In this final section, we describe procedures for modeling the relationship
12.10
Logistic Regression
807
between a categorical response variable and one or more predictors. For example, university administrators may wish to predict whether a student will graduate (a yes-or-no variable) as a function of high school GPA, SAT scores, and number of extracurricular activities. Medical researchers frequently construct models to determine the effect of treatment dosage and other factors (age, weight, and so on) on whether or not someone contracts a certain disease. The Simple Logistic Regression Model The simple linear regression model is appropriate for relating a quantitative response variable y to a quantitative predictor x. But suppose we have a dichotomous categorical response variable, whose “values” are success and failure. We can encode this with a Bernoulli rv Y, with possible values 1 and 0 corresponding to success and failure. As in previous chapters, let p = P(S) = P(Y = 1) and 1 – p = P(F) = P(Y = 0). Frequently, the value of p will depend on the value of some quantitative variable x. For example, the probability that a car needs warranty service should depend on the car’s mileage, or the probability of avoiding an infection might depend on the dosage in an inoculation. Instead of using just the symbol p for the success probability, we now use p(x) to emphasize the dependence of this probability on the value of x. The simple linear regression equation Y = b0 + b1x + e is no longer appropriate, for taking the mean value on each side of that equation would give lYjx ¼ 1 pðxÞ þ 0 ½1 pðxÞ ¼ pðxÞ ¼ b0 þ b1 x Whereas p(x) is a probability and therefore must be between 0 and 1, b0 + b1x need not be in this range. Instead of letting the mean value of y be a linear function of x, we now consider a model in which the mean response p(x) is a particular nonlinear function of x. A function that has been found quite useful in many applications is the logit function pðxÞ ¼
e b0 þ b1 x 1 þ e b0 þ b1 x
ð12:20Þ
It is easy to see that the logit function is bounded between 0 and 1, so 0 < p(x) < 1 as desired. Figure 12.36 shows a graph of p(x) for particular values of b0 and b1 with b1 > 0. As x increases, the probability of success increases. For b1 < 0, the success probability would be a decreasing function of x. p(x)
1.0
.5
0 10
20
30
40
50
60
70
80
Figure 12.36 A graph of a logit function
x
808
12
Regression and Correlation
Logistic regression means assuming that p(x) is related to x by the logit function. Straightforward algebra shows that pðxÞ ¼ e b0 þ b1 x 1 pðxÞ The expression on the left-hand side is called the odds. If, for example p(60) = 3/4 = .75, then pð60Þ=ð1 pð60ÞÞ ¼ :75=ð1 :75Þ ¼ 3 and when x = 60 a success is three times as likely as a failure. This is described by saying that the odds are 3 to 1 because the success probability is three times the failure probability. Taking natural logs of both sides, we see that the logarithm of the odds is a linear function of the predictor: pðxÞ ln ¼ b0 þ b1 x 1 pðxÞ In particular, the slope parameter b1 is the change in the log-odds associated with a one-unit increase in x. This implies that the odds itself changes by the multiplicative factor eb1 when x increases by one unit. The quantity eb1 is called the odds ratio, because it represents the ratio of the odds of success when the predictor variable equals x + 1 to the odds of success when the predictor variable equals x. Example 12.32 It seems reasonable that the size of a cancerous tumor should be related to the likelihood that the cancer will spread (metastasize) to another site. The article “Molecular Detection of p16 Promoter Methylation in the Serum of Patients with Esophageal Squamous Cell Carcinoma” (Cancer Res. 2001: 3135–3138) investigated the spread of esophageal cancer to the lymph nodes. With x = size of a tumor (cm) and Y = 1 if the cancer does spread, consider the logistic regression model with b1 = .5 and b0 = −2 (values suggested by data in the article). Then pðxÞ ¼
e2 þ :5x 1 þ e2 þ :5x
from which p(2) = .27 and p(8) = .88 (tumor sizes for patients in the study ranged from 1.7 to 9.0 cm). Because e2 þ :5ð6:77Þ 4, the odds for a 6.77 cm tumor are 4, so that it is four times as likely as not that a tumor of this size will spread to the lymph nodes. Finally, for every 1-cm increase in tumor size, the odds of metastasis increase by a multiplicative factor of e:5 1:65, or 65%. Be careful here: the probability of metastasis is not increasing by 65%, but rather the odds; under the logistic regression model, the probability of an outcome does not increase linearly with x (see Figure 12.36). ■ Fitting the Simple Logistic Regression Model Fitting the logit model (12.20) to sample data requires that the parameters b0 and b1 be estimated. Rather than apply the principle of least squares from linear regression, the standard way to estimate logistic regression parameters is by the method of maximum likelihood. Suppose, for example, that n = 5 and that the observations made at x2, x4, and x5 are successes whereas the other two observations are failures. Then the likelihood function is Lðb0 ; b1 Þ ¼ PðY1 ¼ 0; Y2 ¼ 1; Y3 ¼ 0; Y4 ¼ 1; Y5 ¼ 1Þ ¼ ½1 pðx1 Þ½pðx2 Þ½1 pðx3 Þ½pðx4 Þ½pðx5 Þ
b þ b x2
b þ b x4 b þ b x5
1 e0 1 1 e0 1 e0 1 ¼ b þ b x b þ b x b þ b x b þ b x 1 þ e 0 1 1 1 þ e 0 1 2 1 þ e 0 1 3 1 þ e 0 1 4 1 þ e b0 þ b1 x 5
12.10
Logistic Regression
809
Unfortunately it is not at all straightforward to maximize this likelihood, and there are no nice formulas for the mles b^0 and b^1 . The maximization process must be carried out using iterative numerical methods. The details are involved, but fortunately the most popular statistical software packages will do this on request and provide both quantitative and graphical indications of how well the model fits. In particular, the mle b^1 is typically provided along with its estimated standard deviation sb^1 . For large n, the mle has approximately a normal distribution and the standardized variable ðb^ b Þ=S ^ 1
1
b1
has approximately a standard normal distribution. This allows for calculation of a confidence interval for b1 as well as for testing H0: b1 = 0, according to which the value of x has no impact on the likelihood of success. Example 12.33 The following data resulted from a study commissioned by a large management consulting company to investigate the relationship between amount of job experience (x, in months) for a junior consultant and the likelihood of the consultant being able to perform a certain complex task. The value y = 1 indicates the consultant completed the task (success), whereas y = 0 corresponds to failure. x
4
5
6
6
7
8
9
10
11
11
13
13
14
15
18
y
0
0
0
0
0
1
0
0
0
0
1
0
1
0
1
x
18
19
20
20
21
21
22
23
25
26
27
28
29
30
32
y
0
0
1
0
1
1
1
0
1
1
0
1
1
1
1
Figure 12.37 shows Minitab output for a logistic regression analysis. The estimates of the parameters b0 and b1 are b^0 = −3.21107 and b^1 = 0.177717, respectively. The resulting estimated logistic regression function, denoted ^pðxÞ, is ^
^pðxÞ ¼
^
eb 0 þ b 1 x 1 þ eb^0 þ b^1 x
¼
e3:211 þ 0:1777x 1 þ e3:211 þ 0:1777x
The graph of p^ðxÞ is the curve shown in Figure 12.38; notice that the (estimated) probability of success increases as x increases. Remember that the logit curve is modeling the mean y value for each x value; we do not anticipate that it will intersect the points in the scatterplot. Binary Logistic Regression: Success versus Months Logistic Regression Table Predictor Constant Months
Coef -3.21107 0.177717
SE Coef 1.23540 0.0657308
Z -2.60 2.70
P 0.009 0.007
Odds Ratio
95% CI Lower Upper
1.19
Figure 12.37 Logistic regression output from Minitab
1.05
1.36
810
12
Regression and Correlation
P(Success)
1.0 0.8
0.6 0.4
0.2 0.0
Months 0
5
10
15
20
25
30
35
Figure 12.38 Scatterplot with the fitted logistic regression function for Example 12.33
We may use ^pðxÞ to estimate the likelihood of a junior consultant completing the complex task, based upon her/his duration of job experience. For example, ^pð12Þ ¼
e3:211 þ 0:1777ð12Þ ¼ :254 1 þ e3:211 þ 0:1777ð12Þ
and ^ pð24Þ ¼
e3:211 þ 0:1777ð24Þ ¼ :742 1 þ e3:211 þ 0:1777ð24Þ
So, it is estimated that a consultant with just one year (12 months) of experience has about a .25 chance of successfully completing the task, compared to a probability of over .74 for someone with two years’ experience. The Minitab output includes sb^1 under SE Coef. For the “utility test” of H0: b1 = 0 versus Ha: b1 6¼ 0, the test statistic value and two–tailed P-value are z¼
b^1 0 :177717 0 ¼ 2:70 ¼ sb^1 :0657308
P-value ¼ 2PðZ 2:70Þ ¼ 2½1 Uð2:70Þ ¼ :007
The null hypothesis is rejected at the .05 or .01 level, and we conclude that months of experience is a useful predictor of a junior consultant’s ability to complete the task. ^
The estimated odds ratio is eb1 = e0.1777 = 1.19. A 95% CI for b1 is given by b^1 z:025 sb^1 ¼ :177717 1:96ð:0657308Þ ¼ ð:04888; :30655Þ from which a 95% CI for the true odds ratio, eb1 , is ðe:04888 ; e:30655 Þ = (1.05, 1.36). The estimated odds ratio and the CI all appear in the output. With a high degree of confidence, for each additional month of experience, the odds that a consultant can successfully complete the task increase by a multiplicative factor of between 1.05 and 1.36, i.e., increase by 5–36%. ■
12.10
Logistic Regression
811
Some software packages report the value of the chi-squared statistic z2 rather than z itself, along with the corresponding P-value for a two-tailed test. Example 12.34 Here is data on launch temperature (°F) and the incidence of failure for O-rings in 23 space shuttle launches prior to the Challenger disaster of January 28, 1986. Temperature
Failure
Temperature
Failure
Temperature
Failure
53 57 58 63 66 67 67 67
Y Y Y Y N N N N
68 69 70 70 70 70 72 73
N N N N Y Y N N
75 75 76 76 78 79 81
N Y N N N N N
Figure 12.39 shows JMP output from a logistic regression analysis. We have chosen to let p denote the probability of an O-ring failure, since this is really the event of interest. Failures tended to occur at lower temperatures and successes at higher temperatures, so the graph of ^ pðxÞ decreases as temperature (x) increases. 1.00
0.75 failure
1 0.50
0.25 0 0.00 50
55
60
65 70 temp
75
80
85
Parameter Estimates Term Intercept temp
Estimate 15.0422911 –0.2321537
Std Error 7.378391 0.1082329
ChiSquare 4.16 4.60
Prob>ChiSq 0.0415 0.0320
Figure 12.39 Logistic regression output from JMP
The estimate of b1 is b^1 ¼ :2322, and the estimated standard deviation of b^1 is sb^1 ¼ :1082. The value of z for testing H0: b1 = 0, which asserts that temperature does not affect the likelihood of O-ring failure, is z = b^1 =sb^1 ¼ :2322=:1082 ¼ 2:15. The P-value is 2P(Z −2.15) = 2(.0158) = .032. JMP reports the value of a chi-squared statistic computed as (−2.15)2 4.60 (there is a slight disparity due to rounding in the z value). Either way, the P-value indicates that H0 should be rejected at the .05 level and, hence, that temperature at launch time has a statistically significant effect on the likelihood of an O-ring failure. Specifically, for each 1 °F increase in launch temper^
ature, we estimate that the odds of failure are multiplied by a factor of eb1 ¼ e:2322 :79, i.e., the odds are estimated to decrease by 21%.
812
12
Regression and Correlation
The launch temperature for the Challenger mission was only 31 °F. Because this value is much smaller than any temperature in the sample, it is dangerous to extrapolate the estimated relationship. Nevertheless, it appears that for a temperature this small, O-ring failure is almost a sure thing. The logistic regression gives the estimated probability at x = 31 as ^pð31Þ ¼
eb0 þ b1 ð31Þ e15:0423:23215ð31Þ ¼ ¼ :99961 b þ b ð31Þ 1þe 0 1 1 þ e15:0423:23215ð31Þ
and the odds associated with this probability are .99961/(1 −.99961) 2563. Thus, if the logistic regression can be extrapolated down to 31°F, the probability of failure is .99961, the probability of success is .00039, and the predicted odds are 2563 to 1 against avoiding an O-ring failure. ■ Multiple Logistic Regression Multiple logistic regression, a natural extension of simple logistic regression, postulates a model for relating a categorical response variable to more than one explanatory variable. The explanatory variables themselves may be true quantitative predictors or indicator variables coding categorical predictors. We continue to restrict attention to a binary response, such as yes/no or happy/sad, which may be coded as 1 or 0 (with 1 indicating the event of interest, i.e., a “success”). With predictors x1, …, xk in the model, let pðx1 ; . . .; xk Þ denote the true probability of the event of interest occurring and assume the following multiple logit function applies: pðx1 ; . . .; xk Þ ¼
eb0 þ b1 x1 þ þ bk xk 1 þ eb0 þ b1 x1 þ þ bk xk
ð12:21Þ
The multiple logit function (12.21) is the obvious extension of the simple logit function (12.20) to accommodate more than one explanatory variable. As in simple logistic regression, this logit function can be re-written in terms of the natural log of the odds of the event of interest: ln
pðx1 ; . . .; xk Þ 1 pðx1 ; . . .; xk Þ
¼ b0 þ b1 x1 þ þ bk xk
Written this way, the coefficient bj (j = 1, …, k) is interpreted as the change in the log-odds of the event of interest associated with a one-unit increase in xj, after adjusting for the effects of all the other predictors in the model. Equivalently, ebj is the multiplicative change in odds associated with a oneunit increase in xj after accounting for the other k – 1 predictors, i.e., ebj is the odds ratio associated with xj. Inference procedures in multiple logistic regression are similar to those outlined for simple logistic regression. In particular, the point estimators b^0 ; b^1 ; . . .; b^k for the unknown bj’s are based upon the principle of maximum likelihood, and each estimator has approximately a normal sampling distribution provided the sample size n is reasonably large. Several statistical software packages will provide point estimates and estimated standard errors for the coefficients, allowing for variable utility hypothesis tests as well as confidence intervals. Example 12.35 The authors of the article “Building Social Capital in Forest Communities: Analysis of New Mexico’s Collaborative Forest Restoration Program” (Natural Resour. J., Fall 2007: 867–915) analyzed the factors that helped determine which proposals were funded by the
12.10
Logistic Regression
813
Collaborative Forest Restoration Program (CFRP, a federally funded grant program). Data was available on 219 proposals made in New Mexico between 2001 and 2006. The response variable of interest is
y¼
1 if the grant proposal was funded 0 if the grant proposal was not funded
We will consider just a few of the predictor variables the authors used in their analysis: x1 = amount of funding requested by the project (in $1000), x2 = percent of county residents living below the poverty threshold, and x3 = 1 if the proposed treatment of private lands was cited by the review panel as a weakness of the project (x3 = 0 otherwise). Parameter estimates from software are b^ ¼ 1:216, b^ ¼ :00156, b^ ¼ :0327, and b^ ¼ 2:002. 0
1
2
3
Consider a proposal requesting $360,000 (x1 = 360) that was not criticized for its proposed treatment of private lands (x3 = 0) in a county with a 16.6% poverty rate (x2 = 16.6); this exactly matches one of the proposals. Then the estimated log-odds of the project being funded are 1:216 þ :00156ð360Þ þ :0327ð16:6Þ 2:002ð0Þ ¼ :11158 and the estimated probability of being funded is ^pð360; 16:6; 0Þ ¼
e:11158 ¼ :4721 1 þ e:11158
(For the record, that particular proposal was funded!) Funding request amount had little practical impact on whether a project was funded: adjusting for poverty rate and private land use consideration, a $1000 (one-unit) increase in requested funding actually increased the estimated odds of acceptance by e:00156 = 1.0016, i.e., by .16%. In contrast, criticism for private land treatment was a veritable death-knell: removing the effects of the other two variables, odds of acceptance when x3 = 1 are e2:002 = .1351 times the acceptance odds when x3 = 0. In other words, if a proposal was criticized in this way, the odds of acceptance were reduced by more than 86%. ■ A model utility test of H0: b1 ¼ ¼ bk ¼ 0 versus Ha: not all b’s are zero is based on the likelihood ratio test statistic K presented in Section 9.5; most statistical software packages will include the test statistic value and P-value when multiple logistic regression is performed. (The test is based on the large-sample approximation mentioned at the end of Section 9.5, whereby −2ln(K) has approximately a chi-squared distribution with k df.) The logit functions in (12.20) and (12.21) are not the only choices for modeling the probability of success. Two other popular options are the probit and complimentary log–log functions, both of which are implemented in many software packages. The relative suitability of these functions to fitting a particular data set can be assessed using various automated “goodness-of-fit” procedures, including the deviance test and the Hosmer-Lemeshow test. Consult the text by Kutner et al. listed in the bibliography for more information.
814
12
Regression and Correlation
Exercises: Section 12.10 (111–120) 111. A major electronics retailer sensibly believes that customers are more likely to redeem an emailed coupon if it’s worth more money. With x = coupon discount amount ($), and Y = 1 if a customer redeems the coupon, consider a logistic regression model with b0 = −3.75 and b1 = 0.1. a. Calculate and interpret both p(10) and p(50). b. Calculate the odds that a $10 coupon is redeemed, then repeat for $50. c. Interpret b1 in this context. d. According to this model, for what discount amount is there a 50–50 chance the coupon will be redeemed? 112. In Example 12.32, the probability of cancer metastasizing was given by the logistic regression model with b0 = −2 and b1 = 0.5. a. Tabulate values of x, p(x), the odds pðxÞ=½1 pðxÞ, and the log-odds for x = 2, 3, 4, …, 9. (In the cited article, tumor sizes ranged from 1.7 to 9.0 cm.) b. Explain what happens to the odds when x is increased by 1. Your explanation should involve the .5 that appears in the formula for p(x). c. Support your answer to (b) algebraically, starting from the formula for p(x). d. For what value of x are the odds 1? 5? 10? 113. Adolescents are getting less sleep that ever before, and this can have serious behavioral repercussions. The article “Dose-Dependent Associations Between Sleep Duration and Unsafe Behaviors Among US High-School Students” (JAMA Pediatr. 2018: 1187– 1189) reported a large-scale study of American teenagers. The investigators fit a simple logistic regression model with the response variable y = 1 if a teenager had driven drunk in the last 30 days (and 0 otherwise), and x = typical number of hours
of sleep per night. Information in the article suggests b^1 = –.1998 and sb^1 = .0986. a. Test whether sleep has an effect on the likelihood of driving drunk among American teenagers, at the .05 significance level. b. Calculate a 95% confidence interval for e b1 . c. Interpret the confidence interval from part (b) in terms of a one-hour decrease in sleep. 114. The pharmaceutical industry has increasingly developed “nanoformulations” for drug delivery, but quality control at such a small scale is tricky. The article “Quality by Design Approach Using Multiple Linear and Logistic Regression Modeling Enables Microemulsion Scale Up” (Molecules 2019) describes one study to determine how x = oil concentration (g/100 mL) affects whether a development run meets a certain critical quality attribute (CQA) with respect to polydispersity. Here, y = 1 if the CQA was achieved and = 0 if not. x
6.0
6.0
2.0
4.0
6.0
2.0
2.0
y
0
0
1
1
0
1
1
4.0
2.0
1
1
x
2.0
2.0
6.0
2.0
4.0
6.0
y
1
1
0
1
0
0
x
4.5
2.0
2.0
6.0
2.0
y
0
1
1
0
1
6.0
2.0
6.0
2.0
2.0
1
0
1
1
2.0
2.0
4.5
4.0
2.0
1
1
1
0
1
0
Software reports coefficients b^0 = 11.13 and b^1 = −2.68 with estimated standard errors 5.96 and 1.42, respectively. a. Write out the estimated logit function, and use it to estimate p(2), p(4), and p(6). b. Does the data provide convincing statistical evidence that oil concentration affects the chance of meeting this particular CQA? Test at the .05 significance level.
12.10
Logistic Regression
815
c. Construct a 95% CI for b1, then use this to give an interval estimate for eb1 . Interpret the latter interval. d. What does eb0 represent in this context? Does that interpretation seem appropriate here? Why or why not? 115. Kyphosis, or severe forward flexion of the spine, may persist despite corrective spinal surgery. A study carried out to determine risk factors for kyphosis reported the following ages (months) for 40 subjects at the time of the operation; the first 18 subjects did have kyphosis and the remaining 22 did not. Kyphosis
12 82 121
15 91 128
42 96 130
52 105 139
59 114 139
73 120 157
No kyphosis
1 22 97 151
1 31 112 159
2 37 118 177
8 61 127 206
11 72 131
18 81 140
a. Use software to fit a logistic regression model to this data. b. Interpret the coefficient b^1 . [Hint: It might be more sensible to work in ^
terms of eb1 .] c. Test whether age has a statistically significant impact on the presence of kyphosis. 116. Exercise 16 of Chapter 1 presented data on the noise level (dBA) for 77 individuals working at a particular office. In fact, each person was also asked whether the noise level in the office was acceptable or not. Acceptable
55.3 56.1 57.0 58.8 65.3 Unacceptable 63.8 64.7 68.7 73.1 79.3
55.3 56.1 57.0 58.8 65.3 63.8 65.1 68.7 74.6 79.3
55.3 56.1 57.8 58.8 65.3 63.8 65.1 68.7 74.6 83.0
55.9 56.1 57.8 59.8 65.3 63.9 65.1 70.4 74.6 83.0
55.9 56.1 57.8 59.8 68.7 63.9 67.4 70.4 74.6 83.0
55.9 56.8 57.9 59.8 69.0 63.9 67.4 71.2 79.3
55.9 56.8 57.9 62.2 73.0 64.7 67.4 71.2 79.3
56.1 57.0 57.9 62.2 73.0 64.7 67.4 73.1 79.3
a. Use software to fit a logistic regression model to this data.
b. Interpret the coefficient b^1 . [Hint: It might be more sensible to work in ^
terms of eb1 .] c. Construct and interpret a 95% confidence interval for eb1 . 117. The article “Consumer Attitudes Toward Genetic Modification and Other Possible Production Attributes for Chicken” (J. Food Distr. Res. 2005: 1–11) reported a survey of 498 randomly selected consumers concerning their views on genetically modified (GM) food. The researchers’ goal was to model the response variable Y = 1 if a consumer wants GM chicken products labeled (and 0 otherwise) as a function of x1 = consumer’s age (yr), x2 = income ($1000s), sex (x3 = 1 if female), and whether there are children in the consumer’s household (x4 = 1 if yes). Estimated model parameter values are b^0 ¼ :8247, b^1 ¼ :0073, b^2 ¼ :0041, b^3 ¼ :9910, and b^4 ¼ :0224. a. Estimate the likelihood that a consumer wants GM chicken products labeled if that person is a 35-year-old female with $65,000 annual income and no children. b. Repeat part (a) for a 35-year-old male (keep other features the same). c. Interpret the coefficient on age. d. Interpret the coefficient on the sex indicator variable. 118. Road trauma is the leading cause of death and injury among young people in Australia. The article “The Journey from Traffic Offender to Severe Road Trauma Victim: Destiny or Preventive Opportunity?” (PLoS ONE, April 22, 2015) reported a study to determine factors that might help predict future serious accidents. The article included estimated odds ratios and 95% CIs for true odds ratios for several variables. The response variable here has value y = 1 if a subject was in an accident leading to intensive care admission or death, and 0 otherwise.
816
12 Est. OR
x1 x2 x3 x4
= = = =
age/10 1 if male, 0 if female years with a driver’s license number of prior traffic offenses
1.02 1.18 0.99 1.10
signify? [Hint: The latter should not be surprising.] c. Estimate the probability of spotting a humpback whale during a 30-minute tour one week (i.e., seven days) after the final salmon release. d. The estimated standard errors of the coefficients are sb^1 = .253 and sb^2 = .120. Perform variable utility tests at the .1 significance level. e. Interpret both e:096 and e:210 in this context.
OR 95% CI (1.01, (0.98, (0.98, (1.08,
1.03) 1.42) 0.99) 1.11)
a. Which of these four explanatory variables were associated with a decreased likelihood of later severe road trauma? How can you tell? b. Which of these four explanatory variables were not statistically significant predictors in this model? How can you tell? c. Interpret the 95% CI provided for eb4 . 119. Whale-watching is big business in Alaska, particularly around salmon release sites where whales tend to congregate. The article “Humpback Whales Feed on Hatchery-Released Juvenile Salmon” (Roy. Soc. Open Sci. 2017) reported a study to determine what factors help predict the likelihood of spotting a humpback whale when visiting one of these sites. The following data on x1 = days after final salmon release, x2 = duration of visit (min), and whether a whale was sighted are from a recent year at Little Port Walter. x1
x2
Whale?
x1
x2
Whale?
2 1 1 2 3 3 4 5 5 6 6
15 15 15 15 15 15 15 30 15 15 15
Y N N N N N N N N N N
7 7 8 8 9 9 10 12 12 13 13
15 15 15 15 15 15 15 15 15 15 35
N N N N N N N N N N Y
(The full study investigated five sites across several years before, during, and after salmon release.) a. Use software to fit a multiple logistic regression model to this data, and confirm that the estimated log-odds function is −5.68 − .096x1 + .210x2. b. What does the negative sign for the coefficient .096 signify? What does the positive sign for the coefficient .210
Regression and Correlation
120. The article “Developing Coal Pillar Stability Chart Using Logistic Regression” (J. Rock Mech. Mining Sci. 2013: 55–60) includes the following data on x1 = height– width ratio, x2 = strength–stress ratio, and y = 1 (stable) or 0 (not stable) for 29 pillars used to stabilize current and former mines in India. x1
1.80
1.65
2.70
3.67
1.41
1.76
2.10
2.10
x2
2.40
2.54
0.84
1.68
2.41
1.93
1.77
1.50
y
1
1
1
1
1
1
1
1
x1
4.57
3.59
8.33
2.86
2.58
2.90
3.89
0.80
x2
2.43
5.55
2.58
2.00
3.68
1.13
2.49
1.37
y
1
1
1
1
1
1
1
0
x1
0.60
1.30
0.83
0.57
1.44
2.08
1.50
1.38
x2
1.27
0.87
0.97
0.94
1.00
0.78
1.03
0.82
y
0
0
0
0
0
0
0
0
x1
0.94
1.58
1.67
3.00
2.21
x2
1.30
0.83
1.05
1.19
0.86
y
0
0
0
0
0
a. Fit a multiple logistic regression model to this data, and report the estimated logit equation. b. Perform the two variable utility tests H0: b1 = 0 and H0: b2 = 0, each at the .1 significance level. c. Calculate the predicted probability of stability for pillar 8 (x1 = 2.10, x2 = 1.50). d. Calculate the predicted probability of stability for pillar 28 (x1 = 3.00, x2 = 1.19).
Supplementary Exercises
817
Supplementary Exercises: (121–138) 121. In anticipation of future floods, insurance companies must quantify the relationship between water depth and the amount of flood damage that will occur. The Federal Insurance Administration provided the following information on x = depth of flooding (in feet above first-floor level) and y = flood damage (as a percentage of structural value) for homes with no basements. Flood level (x) 0 1 2 3 4 5 6 7
Flood damage (y)
Flood level (x)
Flood damage (y)
7 10 14 26 28 29 41 43
8 9 10 11 12 13 14
44 45 46 47 48 49 50
a. Create a scatterplot of the data, and briefly describe what you see. b. Does a straight-line relationship seem appropriate for this data? Why or why not? 122. The article “Exhaust Emissions from FourStroke Lawn Mower Engines” (J. Air Water Manage. Assoc. 1997: 945–952) reported data from a study in which both a baseline gasoline mixture and a reformulated gasoline were used. Consider the following observations on age (year) and NOx emissions (g/kWh): Engine Age Baseline Reformulated
1 0 1.72 1.88
2 0 4.38 5.93
3 2 4.06 5.54
4 11 1.26 2.67
5 7 5.31 6.53
Engine Age Baseline Reformulated
6 16 .57 .74
7 9 3.37 4.94
8 0 3.44 4.89
9 12 .74 .69
10 4 1.24 1.42
a. Construct a scatterplot of baseline vs. reformulated NOx emissions. Comment on what you find.
b. Construct scatterplots of NOx emissions versus age. What appears to be the nature of the relationship between these two variables? 123. The presence of hard alloy carbides in high chromium white iron alloys results in excellent abrasion resistance, making them suitable for materials handling in the mining and materials processing industries. The accompanying data on x = retained austenite content (%) and y = abrasive wear loss (mm3) in pin wear tests with garnet as the abrasive was read from a plot in the article “Microstructure-Property Relationships in High Chromium White Iron Alloys” (Internat. Mater. Rev. 1996: 59–82). Refer to the accompanying SAS output. x 4.6 17.0 17.4 18.0 18.5 22.4 26.5 30.0 34.0 y .66
.92 1.45 1.03 .70
.73 1.20 .80
.91
x 38.8 48.2 63.5 65.8 73.9 77.2 79.8 84.0 y 1.19 1.15 1.12 1.37 1.45 1.50 1.36 1.29
a. What proportion of observed variation in wear loss can be attributed to the simple linear regression model relationship? b. What is the value of the sample correlation coefficient? c. Test the utility of the simple linear regression model using a = .01. d. Estimate the true average wear loss when content is 50% and do so in a way that conveys information about reliability and precision. e. What value of wear loss would you predict when content is 30%, and what is the value of the corresponding residual?
818
12
Regression and Correlation
Analysis of variance Source Model Error C Total
DF 1 15 16 Root MSE Dep Mean C.V.
Sum of squares 0.63690 0.61860 1.25551 0.20308 1.10765 18.33410
Mean square 0.63690 0.04124
F Value 15.444
R-square Adj R-sq
Prob > F 0.0013
0.5073 0.4744
Parameter estimates Variable INTERCEP AUSTCONT
DF 1 1
Parameter estimate 0.787218 0.007570
Standard error 0.09525879 0.00192626
124. An investigation was carried out to study the relationship between speed (ft/s) and stride rate (number of steps taken/s) among female marathon runners. Resulting sumP mary quantities included n = 11, P P (speed) = 205.4, (speed)2 = 3880.08, (rate) = P 35.16, (rate)2 = 112.681, and P (speed)(rate) = 660.130. a. Calculate the equation of the least squares line that you would use to predict stride rate from speed. [Hint: x ¼ P x =n and similarly for y; Sxy ¼ P P P i xi yi ð xi Þð yi Þ=n and similarly for Sxx and Syy.] b. Calculate the equation of the least squares line that you would use to predict speed from stride rate. c. Calculate the coefficient of determination for the regression of stride rate on speed of part (a) and for the regression of speed on stride rate of part (b). How are these related? d. How is the product of the two slope estimates related to the value calculated in (c)? 125. Suppose that x and y are positive variables and that a sample of n pairs results in r 1. If the sample correlation coefficient is computed for the (x, y2) pairs, will the resulting value also be approximately 1? Explain. 126. In Section 12.4, we presented a formula for the variance Vðb^0 þ b^1 x Þ and a CI for b0 þ b1 x . Taking x = 0 gives r2b^ and a CI 0
for b0. Use the data of Example 12.18 to calculate the estimated standard deviation
T for H0: Parameter = 0 8.264 3.930
Prob > |T| 0.0001 0.0013
of b^0 and a 95% CI for the y-intercept of the true regression line. 127. In biofiltration of wastewater, air discharged from a treatment facility is passed through a damp porous membrane that causes contaminants to dissolve in water and be transformed into harmless products. The accompanying data on x = inlet temperature (°C) and y = removal efficiency (%) was the basis for a scatterplot that appeared in the article “Treatment of Mixed Hydrogen Sulfide and Organic Vapors in a Rock Medium Biofilter” (Water Environ. Res. 2001: 426–435). Obs
Temp
Removal %
Obs
Temp
Removal %
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
7.68 6.51 6.43 5.48 6.57 10.22 15.69 16.77 17.13 17.63 16.72 15.45 12.06 11.44 10.17 9.64
98.09 98.25 97.82 97.82 97.82 97.93 98.38 98.89 98.96 98.90 98.68 98.69 98.51 98.09 98.25 98.36
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
8.55 7.57 6.94 8.32 10.50 16.02 17.83 17.03 16.18 16.26 14.44 12.78 12.25 11.69 11.34 10.97
98.27 98.00 98.09 98.25 98.41 98.51 98.71 98.79 98.87 98.76 98.58 98.73 98.45 98.37 98.36 98.45
P Calculated xi ¼ P summary quantities are 384:26, yi ¼ 3149:04, Sxx = 485.00, Sxy = 36.71, and Syy = 3.50. a. Does a scatterplot of the data suggest appropriateness of the simple linear regression model? b. Fit the simple linear regression model, obtain a point prediction of removal
Supplementary Exercises
c.
d.
e.
f.
819
efficiency when temperature = 10.50, and calculate the value of the corresponding residual. Roughly what is the size of a typical deviation of points in the scatterplot from the least squares line? What proportion of observed variation in removal efficiency can be attributed to the model relationship? Estimate the slope coefficient in a way that conveys information about reliability and precision, and interpret your estimate. Personal communication with the authors of the article revealed that one additional observation was not included in their scatterplot: (6.53, 96.55). What impact does this additional observation have on the equation of the least squares line and the values of s and R2?
128. Normal hatchery processes in aquaculture inevitably produce stress in fish, which may negatively impact growth, reproduction, flesh quality, and susceptibility to disease. Such stress manifests itself in elevated and sustained corticosteroid levels. The article “Evaluation of Simple Instruments for the Measurement of Blood Glucose and Lactate, and Plasma Protein as Stress Indicators in Fish” (J. World Aquacult. Soc. 1999: 276–284) described an experiment in which fish were subjected to a stress protocol and then removed and tested at various times after the protocol had been applied. The accompanying data on x = time (min) and y = blood glucose level (mmol/L) was read from a plot. x
2
2
5
7
12
13
17
18
23
24
26
28
y 4.0 3.6 3.7 4.0 3.8 4.0 5.1 3.9 4.4 4.3 4.3 4.4 x 29
30
34
36
40
41
44
56
56
57
60
60
y 5.8 4.3 5.5 5.6 5.1 5.7 6.1 5.1 5.9 6.8 4.9 5.7
Use the methods developed in this chapter to analyze the data, and write a brief report summarizing your conclusions (assume that
the investigators are particularly interested in glucose level 30 min after stress). 129. The article “Evaluating the BOD POD for Assessing Body Fat in Collegiate Football Players” (Med. Sci. Sports Exerc. 1999: 1350–1356) reports on a new air displacement device for measuring body fat. The customary procedure utilizes the hydrostatic weighing device, which measures the percentage of body fat by means of water displacement. Here is representative data read from a graph in the paper. a. Use various methods to decide whether it is plausible that the two techniques measure on average the same amount of fat. b. Use the data to develop a way of predicting an HW measurement from a BOD POD measurement, and investigate the effectiveness of such predictions. BOD
2.5
4.0
4.1
6.2
7.1
7.0
HW
8.0
6.2
9.2
6.4
8.6
12.2
BOD
8.3
9.2
9.3
12.0
12.2
HW
7.2
12.0
14.9
12.1
15.3
BOD
12.6
14.2
14.4
15.1
15.2
HW
14.8
14.3
16.3
17.9
19.5
BOD
16.3
17.1
17.9
17.9
HW
17.5
14.3
18.3
16.2
130. Reconsider the situation of Exercise 123, in which x = retained austenite content using a garnet abrasive and y = abrasive wear loss were related via the simple linear regression model Y = b0 + b1x + e. Suppose that for a second type of abrasive, these variables are also related via the simple linear regression model Y = c0 + c1x + e and that V(e) = r2 for both types of abrasive. If the data set consists of n1 observations on the first abrasive and n2 on the second and if SSE1 and SSE2 denote the two error sums of squares, then a pooled estimate of r2 is ^2 ¼ ðSSE1 þ SSE2 Þ=ðn1 þ n2 4Þ. r Let P 2 SSx1 and SSx2 denote ðxi xÞ for the data on the first and second abrasives, respectively. A test of H0: b1 − c1 = 0 (equal slopes) is based on the statistic
820
12
driving 40 mph (“Comparative Study of Asphalt Pavement Responses Under FWD and Moving Vehicular Loading,” J. Transp. Engr. 2016).
b^1 ^c1 ffi T ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 ^ þ r SSx1 SSx2 When H0 is true, T has a t distribution with n1 + n2 − 4 df. Suppose the 15 observations using the alternative abrasive give SSx2 = 7152.5578, ^c1 ¼ :006845, and SSE2 = .51350. Using this along with the data of Exercise 123, carry out a test at level .05 to see whether expected change in wear loss associated with a 1% increase in austenite content is identical for the two types of abrasive. 131. Show that the ANOVA version of the model utility test discussed in Section 12.3 (with test statistic F = MSR/MSE) is in fact a likelihood ratio test for H0: b1 = 0 versus Ha: b1 6¼ 0. [Hint: We have already pointed out that the least squares estimates of b0 and b1 are the mle’s. What is the mle of b0 when H0 is true? Now determine the mle of r2 both in X (when b1 is not necessarily 0) and in X0 (when H0 is true).] 132. Show that the t ratio version of the model utility test is equivalent to the ANOVA F statistic version of the test. Equivalent here means that rejecting H0: b1 = 0 when either t ta/2,n−2 or t −ta/2,n−2 is the same as rejecting H0 when f Fa,1,n−2. 133. When a scatterplot of bivariate data shows a pattern resembling an exponentially increasing or decreasing curve, the following multiplicative exponential model is often used: Y ¼ aebx e. a. What does this multiplicative model imply about the relationship between Y′ = ln(Y) and x? [Hint: take logs on both sides of the model equation and let b0 = ln(a), b1 = b, e′ = ln(e), and suppose that e has a lognormal distribution.] b. The accompanying data resulted from an investigation of how road pulse duration (y, in ms, a measure of structural stress) varied with asphalt depth (x, in mm) in a simulation of large trucks
Regression and Correlation
x
40
40
190
190
267
267
420
420
y
25
36
53
55
78
91
168
201
Fit the simple linear regression model to this data, and check model adequacy using the residuals. c. Is a scatterplot of the data consistent with the exponential regression model? Fit this model by first carrying out a simple linear regression analysis using ln(y) as the response variable and x as the explanatory variable. How good a fit is the simple linear regression model to the “transformed” data (i.e., the (x, ln(y)) pairs)? What are point estimates of the parameters a and b? d. Obtain a 95% prediction interval for pulse duration when asphalt thickness is 250 mm. [Hint: first obtain a PI for ln(y) based on the simple linear regression carried out in (c).] 134. No tortilla chip aficionado likes soggy chips, so it is important to identify characteristics of the production process that produce chips with an appealing texture. The following data on x = frying time (s) and y = moisture content (%) appeared in the article “Thermal and Physical Properties of Tortilla Chips as a Function of Frying Time” (J. Food Process. Preserv. 1995: 175–189). x
5
10
15
20
25
30
45
60
y
16.3
9.7
8.1
4.2
3.4
2.9
1.9
1.3
a. Construct a scatterplot of the data and comment. b. Construct a scatterplot of the pairs (ln(x), ln(y)) (i.e., transform both x and y by logs) and comment. c. Consider the multiplicative power model Y = axbe. What does this model
Supplementary Exercises
821
imply about the relationship between y′ = ln(y) and x′ = ln(x) (assuming that e has a lognormal distribution)? d. Obtain a prediction interval for moisture content when frying time is 25 s. [Hint: first carry out a simple linear regression of y′ on x′ and calculate an appropriate prediction interval.] 135. Forest growth and decline phenomena throughout the world have attracted considerable public and scientific interest. The article “Relationships Among Crown Condition, Growth, and Stand Nutrition in Seven Northern Vermont Sugarbushes” (Canad. J. Forest Res. 1995: 386–397) included a scatterplot of y = mean crown dieback (%), one indicator of growth retardation, and x = soil pH (higher pH corresponds to less acidic soil), from which the following observations were taken: x 3.3 3.4
3.4
3.5
3.6 3.6 3.7
3.7
3.8
3.8
y 7.3 10.8 13.1 10.4 5.8 9.3 12.4 14.9 11.2 8.0 x 3.9 4.0
4.1
y 6.6 10.0 9.2
4.3 4.4 4.5
5.0
5.1
12.4 2.3 4.3 3.0
4.2
1.6
1.0
a. Construct a scatterplot of the data. What model is suggested by the plot? b. Use a statistical software package to fit the model suggested in (a) and test its utility. c. Use the software package to obtain a prediction interval for crown dieback when soil pH is 4.0, and also a confidence interval for expected crown dieback in situations where the soil pH is 4.0. How do these two intervals compare to each other? Is this result consistent with what you learned in simple linear regression? Explain. d. Use the software package to obtain a PI and CI when x = 3.4. How do these intervals compare to the corresponding intervals obtained in (c)? Is this result
consistent with what you learned in simple linear regression? Explain. 136. The article “Validation of the Rockport Fitness Walking Test in College Males and Females” (Res. Q. Exerc. Sport 1994: 152– 158) recommended the following estimated regression equation for relating y = VO2max (L/min, a measure of cardiorespiratory fitness) to the predictors x1 = gender (female = 0, male = 1), x2 = weight (lb), x3 = 1-mile walk time (min), and x4 = heart rate at the end of the walk (beats/min): y ¼ 3:5959 þ :65661x1 þ :0096x2 :0996x3 :0080x4 a. How would you interpret the estimated coefficient −.0996? b. How would you interpret the estimated coefficient .6566? c. Suppose that an observation made on a male whose weight was 170 lb, walk time was 11 min, and heart rate was 140 beats/min resulted in VO2max = 3.15. What would you have predicted for VO2max in this situation, and what is the value of the corresponding residual? d. Using SSE = 30.1033 and SST = 102.3922, what proportion of observed variation in VO2max can be attributed to the model relationship? e. Assuming a sample size of n = 20, carry out a test of hypotheses to decide whether the chosen model specifies a useful relationship between VO2max and at least one of the predictors. 137. Investigators carried out a study to see how various characteristics of concrete are influenced by x1 = % limestone powder and x2 = water–cement ratio, resulting in the accompanying data (“Durability of Concrete with Addition of Limestone Powder,” Mag. Concr. Res. 1996: 131–137).
822
12
x1
x2
28-day comp str. (MPa)
Adsorbability (%)
21 21 7 7 28 0 14 14 14
.65 .55 .65 .55 .60 .60 .70 .50 .60
33.55 47.55 35.00 35.90 40.90 39.10 31.55 48.00 42.30
8.42 6.26 6.74 6.59 7.28 6.90 10.80 5.63 7.43
a. Consider first compressive strength as the dependent variable y. Fit a first-order model, and determine R2a . b. Determine the adjusted R2 value for a model including the interaction term and also for the complete second-order model. Of the three models in parts (a)–(b), which seems preferable? c. Use the “best” model from part (b) to predict compressive strength when % limestone = 14 and water–cement ratio = .60. d. Repeat parts (a)–(b) with adsorbability as the response variable. That is, fit three models: the first-order model, one with first-order terms plus an interaction, and the complete second-order model.
Regression and Correlation
138. A sample of n = 20 companies was selected, and the values of y = stock price and k = 15 predictor variables (such as quarterly dividend, previous year’s earnings, and debt ratio) were determined. When the multiple regression model using these 15 predictors was fit to the data, R2 = .90 resulted. a. Does the model appear to specify a useful relationship between y and the predictor variables? Carry out a test using significance level .05. [Hint: The F critical value for 15 numerator and 4 denominator df is 5.86.] b. Based on the result of part (a), does a high R2 value by itself imply that a model is useful? Under what circumstances might you be suspicious of a model with a high R2 value? c. With n and k as given previously, how large would R2 have to be for the model to be judged useful at the .05 level of significance?
Chi-Squared Tests
13
Introduction In the simplest type of situation considered in this chapter, each observation in a sample is classified as belonging to one of a finite number of categories—for example, blood type could be one of the four categories O, A, B, or AB. With pi denoting the probability that any particular observation belongs in category i, we wish to test a null hypothesis that completely specifies the values of all the pi’s (such as H0: p1 = .45, p2 = .35, p3 = .15, p4 = .05). Other times, the null hypothesis specifies that the pi’s depend on some smaller number of parameters without specifying the values of these parameters; the values of any unspecified parameters must then be estimated from the sample data. In either case, the test statistic will be a measure of the discrepancy between the observed numbers in the categories and the expected numbers when H0 is true. This method, called a chi-squared test and presented in Section 13.1, can also be applied to test the null hypothesis that the sample comes from a particular probability distribution. Chi-squared tests for two different situations are presented in Section 13.2. In the first, the null hypothesis states that the pi’s are the same for several different populations. The second type of situation involves taking a sample from a single population and cross-classifying each individual with respect to two different categorical factors (such as religious preference and political party registration). The null hypothesis in this situation is that the two factors are independent within the population.
13.1
Goodness-of-Fit Tests
Recall that a binomial experiment consists of n independent trials in which each trial can result in one of two possible outcomes, S (for success) and F (for failure). The probability of success is assumed to be constant from trial to trial, and n is fixed at the outset of the experiment. A multinomial experiment generalizes the binomial experiment by allowing each trial to result in one of k possible outcomes, where k 2. For example, suppose a store accepts three different types of credit cards: Visa, MasterCard, and American Express. A multinomial experiment would result from observing the type of credit card used—Visa, MC, or Amex—by each of 50 randomly selected customers who pay with a credit card.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. L. Devore et al., Modern Mathematical Statistics with Applications, Springer Texts in Statistics, https://doi.org/10.1007/978-3-030-55156-8_13
823
824
13
DEFINITION
Chi-Squared Tests
A multinomial experiment satisfies the following conditions: 1. The experiment consists of a sequence of n trials, where n is fixed in advance of the experiment. 2. Each trial can result in one of the same k possible outcomes (also called categories). 3. The trials are independent, so that the outcome on any particular trial does not influence the outcome on any other trial. 4. The probability that a trial results in category i is pi, which remains constant from trial to trial. P The parameters p1, …, pk must of course satisfy pi 0 and pi ¼ 1.
If the experiment consists of selecting n individuals or objects from a population and categorizing each one, then pi is interpreted as the proportion of the population falling in the ith category; such an experiment will be approximately multinomial provided that n is much smaller than the population size. In the aforementioned example, k = 3 (number of categories = number of credit cards accepted), n = 50 (number of trials = number of customers), and pi denotes the proportion of all credit card purchases made with type i (1 = Visa, 2 = MC, 3 = Amex). The null hypothesis of interest at this point will specify the value of each pi. For example, suppose the store manager believes 50% of all credit card customers use Visa, 30% MasterCard, and the remaining 20% American Express. This belief can be expressed as the assertion H0 : p1 ¼ :5; p2 ¼ :3; p3 ¼ :2 The alternative hypothesis will state that H0 is not true—i.e., that at least one of the pi’s has a value different from that asserted by H0 (in which case at least two must be different, since they sum to 1). The symbol pi0 will represent the value of pi claimed by the null hypothesis. In the example just given, p10 = .5, p20 = .3, and p30 = .2. (The symbol p10 is read “p one naught” and not “p ten.”) Before the multinomial experiment is performed, the number of trials that will result in the ith category (i = 1, 2, …, or k) is a random variable—just as the number of successes and the number of failures in a binomial experiment are random variables. This random variable will be denoted by Ni P and its observed value by ni. Since each trial results in exactly one of the k categories, Ni ¼ n, and the same is true of the ni’s. As an example, an experiment with n = 50 and k = 3 might yield N1 = 22, N2 = 13, and N3 = 15. The Ni’s (or ni’s) are called the observed counts. When the null hypothesis H0 : p1 ¼ p10 ; . . .; pk ¼ pk0 is true, the expected number of trials resulting in category i is EðNi Þ ¼ ðtotal number of trialsÞ ðhypothesized probability of category iÞ ¼ npi0 These are the expected counts under H0. For the case H0: p1 = .5, p2 = .3, p3 = .2 and n = 50, E(N1) = 25, E(N2) = 15, and E(N3) = 10 when H0 is true. The expected counts, like the observed counts, sum to n. It is customary to display both sets of counts, with the expected counts under H0 in parentheses below the observed counts. The counts in the credit card situation under discussion would be displayed in tabular format as Credit card
Visa
MC
Amex
Observed count Expected count
22 (25)
13 (15)
15 (10)
13.1
Goodness-of-Fit Tests
825
A test procedure requires assessing the discrepancy between the observed and expected counts, with H0 being rejected when the discrepancy is sufficiently large. The test statistic, originally proposed by Karl Pearson around 1900, is X
k ðobserved count expected countÞ2 X ðNi npi0 Þ2 ¼ expected count npi0 i¼1 all categories
ð13:1Þ
The numerator of each term in the sum is the squared difference between observed and expected counts. The more these differ within any particular category, the larger will be the contribution to the overall sum. The reason for including the denominator term will be explained shortly. Since the observed counts (the Ni’s) are random variables, their values depend on the specific sample collected, and the test statistic (13.1) will vary in value from sample to sample. Larger values of the test statistic indicate a greater discrepancy between the observed and expected counts, making us more apt to reject H0. The approximate sampling distribution of (13.1) is given in the following theorem. PEARSON’S CHI-SQUARED THEOREM
When H0: p1 ¼ p10 ; . . .; pk ¼ pk0 is true, the statistic v2 ¼
k X ðNi npi0 Þ2 i¼1
npi0
has approximately a chi-squared distribution with k – 1 df. This approximation is reasonable provided that npi0 5 for every i (i = 1, 2, …, k). The chi-squared distribution was introduced in Chapter 6 and used in Chapter 8 to obtain a confidence interval for the variance of a normal population. Recall that the chi-squared distribution has a single parameter m, called the number of degrees of freedom (df) of the distribution. Analogous to the critical value ta;m for the t distribution, v2a; m is the value such that a of the area under the v2 curve with m df lies to the right of v2a; m (see Figure 13.1). Selected values of v2a; m are given in Appendix Table A.5. Notice that, unlike a z or t curve, the chi-squared distribution is positively skewed and only takes on nonnegative values.
2 v curve
Shaded area
0
2
,
Figure 13.1 A critical value for a chi-squared distribution
826
13
Chi-Squared Tests
P The fact that df = k – 1 in the preceding theorem is a consequence of the restriction Ni ¼ n: although there are k observed counts, once any k – 1 are known, the remaining one is uniquely determined. That is, there are only k – 1 “freely determined” cell counts, and thus k – 1 df.
CHI-SQUARED GOODNESS-OF-FIT TEST
Null hypothesis: Alternative hypothesis: Test statistic value:
H0: p1 = p10, . . ., pk = pk0 Ha: at least one pi does not equal pi0 k ðn np Þ2 P i i0 v2 ¼ npi0 i¼1
Rejection Region for Level a Test P-value Calculation v2 v2a;k1
area under v2k1 curve to the right of the calculated v2
The term “goodness-of-fit” refers to the idea that we wish to see how well the observed counts of a categorical variable “fit” a set of hypothesized population proportions. Appendix Table A.5 provides upper-tail critical values at five a levels for each different m. Because this is not sufficiently granular for accurate P-value information, we have also included Appendix Table A.10, analogous to Table A.7, that facilitates making more precise P-value statements. Example 13.1 If we focus on two different characteristics of an organism, each controlled by a single gene, and cross a pure strain having genotype AABB with a pure strain having genotype aabb (capital letters denoting dominant alleles and small letters recessive alleles), the resulting genotype will be AaBb. If these first-generation organisms are then crossed among themselves (a dihybrid cross), there will be four phenotypes depending on whether a dominant allele of either type is present. Mendel’s laws of inheritance imply that these four phenotypes should have probabilities 9/16, 3/16, 3/16, and 1/16 of arising in any given dihybrid cross. The article “Inheritance of Fruit Attributes in Chili Pepper” (Indian J. Hort. 2019: 86–93) reports the phenotype counts resulting from a dihybrid cross of two chili pepper varietals popular in India (WBC-Sel-5 and GVC-101). There are k = 4 categories corresponding to the four possible fruitbearing phenotypes, with the null hypothesis being H 0 : p1 ¼
9 ; 16
p2 ¼
3 ; 16
p3 ¼
3 ; 16
p4 ¼
1 16
Since the total sample size was n = 63, the expected cell counts are 63(9/16) = 35.44, 63(3/16) = 11.81, 11.81, and 63(1/16) = 3.94. Observed and expected counts are given in Table 13.1. (Although one expected count is slightly less than 5, Pearson’s chi-squared theorem should still apply reasonably well to this scenario.) Table 13.1 Observed and expected cell counts for Example 13.1 i=1 Single, drooping
i=2 Single, erect
i=3 Cluster, drooping
i=4 Cluster, erect
32 (35.44)
8 (11.81)
16 (11.81)
7 (3.94)
The contribution to the v2 test statistic value from the first cell is
13.1
Goodness-of-Fit Tests
827
ðn1 np10 Þ2 ð32 35:44Þ2 ¼ :333 ¼ np10 35:44 Cells 2, 3, and 4 contribute 1.230, 1.484, and 2.382, respectively, so v2 = .333 + 1.230 + 1.484 + 2.382 = 5.43. The expected value of v2 under H0 is roughly m ¼ k 1 ¼ 3 and the standard pffiffiffiffiffi deviation is approximately 2m ¼ 2:5. So our test statistic value is only about one standard deviation larger than what we’d expect if the null hypothesis was true, seemingly not highly contradictory to H 0. More formally, a test with significance level .10 at 3 df requires v2:10;3 , the number in the 3 df row and .10 column of Appendix Table A.5. This critical value is 6.251. Since 5.43 < 6.251, H0 cannot be rejected even at this rather large level of significance. (The m = 3 column of Appendix Table A.10 confirms that P-value > .10; software provides a P-value of .143.) The data is reasonably consistent with Mendel’s laws. ■ P Why not simply use ðNi npi0 Þ2 as the test statistic, rather than the more complicated statistic (13.1)? Suppose, for example, that np10 = 100 and np20 = 10. Then if n1 = 95 and n2 = 5, the two P categories contribute the same squared deviations to (ni – npi0)2. Yet n1 is only 5% less than what would be expected when H0 is true, whereas n2 is 50% less. To take relative magnitudes of the deviations into account, we divide each squared deviation by the corresponding expected count and then combine. v2 for Completely Specified Probability Distributions Frequently researchers wish to determine whether observed data is consistent with a particular probability distribution. When the distribution and all of its parameters are completely specified, Pearson’s chi-squared test can be applied to this scenario. Later in this section, we examine the case when the parameters must be estimated from the available data. Example 13.2 In a famous genetics article (“The Progeny in Generations F12 to F17 of a Cross Between a Yellow-Wrinkled and a Green-Round Seeded Pea,” J. Genet. 1923: 255–331), the early statistician G. U. Yule analyzed data resulting from crossing garden peas. The dominant alleles in the experiment were Y = yellow color and R = round shape, resulting in the double dominant YR. Yule examined 269 four-seed pods resulting from a dihybrid cross and counted the number of YR seeds in each pod. Let X denote the number of YR’s in a randomly selected peapod, so possible X values are 0, 1, 2, 3, 4. Based on the discussion in Example 13.1, the Mendelian laws are operative and genotypes of 9 individual seeds within a pod are independent of one another. Thus X has a Binð4; 16 Þ distribution. If for i = 1, 2, 3, 4, 5 we define pi = P(X = i – 1), then we wish to test H0: p1 = p10, …, p5 = p50, where ›
pi0 ¼ Pði 1 YR s among 4 seeds when H0 is trueÞ i1 4 9 9 4ði1Þ ¼ 1 i ¼ 1; 2; 3; 4; 5 16 16 i1 Substituting into this binomial pmf gives hypothesized probabilities .0366, .1884, .3634, .3115, and .1001. Yule’s data and the expected cell counts npi0 = 269pi0 are in Table 13.2.
828
13
Chi-Squared Tests
Table 13.2 Observed and expected cell counts for Example 13.2 i=1 X=0 16 (9.86)
i=2 X=1 45 (50.68)
i=3 X=2 100 (97.75)
i=4 X=3 82 (83.78)
i=5 X=4 26 (26.93)
The test statistic value is v2 ¼
ð16 9:86Þ2 ð26 26:93Þ2 þ þ ¼ 3:823 þ þ :032 ¼ 4:582 9:86 26:93
Since 4:582\v2:01;k1 ¼ v2:01;4 ¼ 13:277, H0 is not rejected at level .01. In fact, software provides a ■ P-value of .333, so H0 should not be rejected at any reasonable significance level. The v2 test can also be used to test whether a sample comes from a specific underlying continuous distribution. Let X denote the variable being sampled and suppose the hypothesized pdf of X is f0(x). As in the construction of a frequency distribution in Chapter 1, subdivide the measurement scale of X into k disjoint intervals (–1, a1), [a1, a2), …, [ak–1, 1). The cell probabilities for i = 2, …, k – 1 specified by H0 are then Zai pi0 ¼ Pðai1 X\ ai Þ ¼
f0 ðxÞdx ai1
and similarly for the two extreme intervals. The intervals should be chosen so that npi0 5 for i = 1, …, k; often they are selected so that the pi0’s are equal. Once the pi0’s are calculated, the underlying distribution is, in a sense, irrelevant—the chi-squared test will determine whether data is consistent with any probability distribution that places probability pi0 on the ith specified interval. Example 13.3 To see whether time of birth is uniformly distributed throughout a 24-hour day, we can divide a day into one-hour periods starting at midnight (k = 24 intervals). The null hypothesis states that f(x) is the uniform pdf on the interval [0, 24], so that pi0 = 1/24 for all i. A random sample of 1000 births from the CDC’s 2018 Natality Public Use File resulted in cell counts of 34 (midnight to 12:59 a.m.), 28, 37, 29, 31, 28, 32, 38, 73, 50, 43, 52, 58, 58, 46, 43, 51, 35, 46, 32, 53, 31, 35, and 37 (11:00 p.m. to 11:59 p.m.). Each expected “cell count” is 1000 1/24 = 41.67, and the resulting value of v2 is 74.432. Since v2:01;23 ¼ 41:637, the computed value is highly significant, and the null hypothesis is resoundingly rejected. In particular, babies were far more likely to be born in the 8:00 a.m.–8:59 a.m. window (observed count = 73) that in any other hour of the day. ■ Example 13.4 The developers of a new online tax program want it to satisfy three criteria: (1) actual time to complete a tax return is normally distributed; (2) the mean completion time is 90 min; (3) 90% of all users will finish their tax returns within 120 min (2 h). A pilot test of the program will utilize 120 volunteers, resulting in 120 completion time observations. This data will be used to test whether the performance criteria are met, using a chi-squared test with k = 8 intervals. Calculating normal probabilities requires both l and r. The target value l = 90 min is given; the 90th percentile of a normal distribution is l + 1.28r, and the criterion l + 1.28r = 120 min implies r = 23.44 min. To divide the standard normal scale into eight equally likely intervals, we look for
13.1
Goodness-of-Fit Tests
829
the .125 quantile, .25 quantile, etc., in the z table. From Table A.3 these values, which form the boundary points of our intervals, have z-scores equal to 1:15
:675 :32
0 :32
:675 1:15
For l = 90 and r = 23.44, these boundary points become 63:04
74:18 82:50
90:00
97:50
105:82
116:96
(Completion times obviously cannot be negative. The area to the left of x = 0 under this curve is negligible, so this issue is not of concern here.) If we define pi = the probability a randomly selected completion time falls in the ith interval defined by the above boundary points, then the goal is to test H0: p1 ¼ :125; . . .; p8 ¼ :125. Suppose the observed counts are as shown in the accompanying table; the expected count for each interval is npi0 = (120)(.125) = 15. Lower endpoint of interval Observed count Expected count
0
63.04
74.18
82.50
90.00
97.50
105.82
116.96
21 (15)
17 (15)
12 (15)
16 (15)
10 (15)
15 (15)
19 (15)
10 (15)
The resulting test statistic value is v2 ¼
ð21 15Þ2 ð10 15Þ2 þ þ ¼ 7:73 15 15
The corresponding P-value (using Table A.10 at df = 8 – 1 = 7) exceeds .100; statistical software gives P-value = .357. Thus we have no reason to reject H0; the 120 observations are consistent with a N(90, 23.44) population distribution, as desired. ■ Goodness-of-Fit Tests for Composite Hypotheses The goodness-of-fit test based on Pearson’s chi-squared theorem involves a simple null hypothesis, in the sense that each pi0 is a specified number, so that the expected cell counts when H0 is true are completely determined. But in some situations, H0 states only that the pi’s are functions of other parameters h1, …, hm without specifying the values of these hj’s. For example, a population may be in equilibrium with respect to proportions of the three genotypes AA, Aa, and aa. With p1, p2, and p3 denoting these proportions (probabilities), one may wish to test H0 : p1 ¼ h2 ; p2 ¼ 2hð1 hÞ; p3 ¼ ð1 hÞ2 , where h represents the proportion of gene A in the population. This hypothesis is composite because knowing that H0 is true does not uniquely determine the cell probabilities and expected cell counts, but only their general form. More generally, the null hypothesis now states that each pi is a function of a small number of parameters h = (h1, …, hm) with the hj’s otherwise unspecified: H0 : p1 ¼ p1 ðhÞ; . . .; pk ¼ pk ðhÞ Ha : the hypothesis H0 is not true
ð13:2Þ
830
13
Chi-Squared Tests
In the genotype example, m = 1 (there is only one h), p1(h) = h2, p2(h) = 2h(1 – h), and p3(h) = (1 – h)2. To carry out a v2 test, the unknown h’s must first be estimated. In the case k = 2, there is really only a single rv, N1 (since N1 + N2 = n), which has a binomial distribution. The joint probability that N1 = n1 and N2 = n2 is then n n1 n2 PðN1 ¼ n1 ; N2 ¼ n2 Þ ¼ p p / pn11 pn22 n1 1 2 where p1 + p2 = 1 and n1 + n2 = n. For general k, the joint distribution of N1, …, Nk is the multinomial distribution (Section 5.1) with PðN1 ¼ n1 ; . . .; Nk ¼ nk Þ / pn11 pn22 pnk k which, when H0 is true, becomes PðN1 ¼ n1 ; . . .; Nk ¼ nk Þ / ½p1 ðhÞn1 ½pk ðhÞnk
METHOD OF MULTINOMIAL ESTIMATION
ð13:3Þ
Let n1, n2, …, nk denote the observed values of N1, …, Nk. Then ^h1 ; . . .; ^hm are those values of the hj’s that maximize Expression (13.3), that is, the maximum likelihood estimates with respect to the multinomial model.
Example 13.5 In humans there is a blood group, the MN group, that is composed of individuals having one of three blood types: M, MN, or N. Type is determined by two alleles, and there is no dominance, so the three possible genotypes give rise to three phenotypes. A population consisting of individuals in the MN group is in equilibrium if PðMÞ ¼ p1 ¼ h2
PðMNÞ ¼ p2 ¼ 2hð1 hÞ PðNÞ ¼ p3 ¼ ð1 hÞ2
for some h. Suppose a sample from such a population yielded the results shown in Table 13.3. Table 13.3 Observed counts for Example 13.5 Type Observed count
M 125
MN 225
M 150
n = 500
Then Expression (13.3) becomes ½p1 ðhÞn1 ½p2 ðhÞn2 ½p3 ðhÞn3 ¼ ½h2 n1 ½2hð1 hÞn2 ½ð1 hÞ2 n3 ¼ 2n2 h2n1 þ n2 ð1 hÞn2 þ 2n3 Maximizing this with respect to h (or, equivalently, maximizing the natural logarithm of this quantity, which is easier to differentiate) yields
13.1
Goodness-of-Fit Tests
831
^h ¼
2n1 þ n2 2n1 þ n2 ¼ ½ð2n1 þ n2 Þ þ ðn2 þ 2n3 Þ 2n
With n1 = 125 and n2 = 225, ^h ¼ 475=1000 ¼ :475.
■
Once h = (h1, …, hm) has been estimated by ^h ¼ ð^ h1 ; . . .; ^ hm Þ, the estimated expected cell count ^ for the ith category is n^pi ¼ npi ðhÞ. These are now used in place of the npi0’s in Expression (13.1) to specify a v2 statistic. The following theorem was proved by R. A. Fisher in 1924 as a generalization of Pearson’s chi-squared test. FISHER’S CHI-SQUARED THEOREM
Under general “regularity” conditions on h1, …, hm and the pi(h)’s, if h1, …, hm are estimated by maximizing the multinomial expression (13.3), the rv v2 ¼
k k X ^ i Þ2 X ðNi nP ½Ni npi ð^ hÞ2 ¼ ^i nP npi ð^ hÞ i¼1
i¼1
has approximately a chi-squared distribution with k – 1 – m df when H0 of (13.2) is true. An approximately level a test of H0 versus Ha is then to reject H0 if v2 v2a;k1m . ^ 5 for every i. In practice, the test can be used if npi ðhÞ Notice that the number of degrees of freedom is reduced by the number of hj’s estimated. Example 13.6 (Example 13.5 continued) With ^h ¼ :475 and n = 500, the estimated expected cell counts are n^p1 ¼ np1 ð^hÞ ¼ n ^h2 ¼ 500 ð:475Þ2 ¼ 112:81 n^p2 ¼ np2 ð^hÞ ¼ n 2^hð1 ^hÞ ¼ 500 2ð:475Þð:525Þ ¼ 249:38 n^p3 ¼ np3 ð^hÞ ¼ n ð1 ^hÞ2 ¼ 500 ð:525Þ2 ¼ 137:81 Notice that these estimated expected counts sum to n = 500. Then v2 ¼
ð125 112:81Þ2 ð225 249:38Þ2 ð150 137:81Þ2 þ þ ¼ 4:78 112:81 249:38 137:81
Since v2:05;k1m ¼ v2:05;311 ¼ v2:05;1 ¼ 3:843 and 4.78 3.843, H0 is rejected at the .05 significance level (software provides P-value = .029). Therefore, the data does not conform to the proposed equilibrium model. Even using the h estimate that “fits” the data best, the expected counts under the null model are too discordant with the observed counts. ■ Example 13.7 Consider a series of games between two teams, I and II, that terminates as soon as one team has won four games (with no possibility of a tie)—this is the “best of seven” format used for many professional league play-offs. A simple probability model for such a series assumes that
832
13
Chi-Squared Tests
outcomes of successive games are independent and that the probability of team I winning any particular game is a constant h. We arbitrarily designate team I the better team, so that h .5. Any particular series can terminate after 4, 5, 6, or 7 games. Let p1(h), p2(h), p3(h), p4(h) denote the probability of termination in 4, 5, 6, and 7 games, respectively. Then p1 ðhÞ ¼ PðI wins in 4 gamesÞ þ PðII wins in 4 gamesÞ ¼ h4 þ ð 1 h Þ 4 p2 ðhÞ ¼ PðI wins 3 of the first 4 and the fifthÞ þ PðI loses 3 of the first 4 and the fifthÞ 4 4 3 hð1 hÞ3 ð1 hÞ h ð1 hÞ h þ ¼ 1 3 h i 3 ¼ 4hð1 hÞ h þ ð1 hÞ3 p3 ðhÞ ¼ 10h2 ð1 hÞ2 ½h2 þ ð1 hÞ2 p4 ðhÞ ¼ 20h3 ð1 hÞ3 The Mathematics Magazine article “Seven-Game Series in Sports” by Groeneveld and Meeden tested the fit of this model to results of National Hockey League playoffs during the period 1943–1967, when league membership was stable. The data appears in Table 13.4. Table 13.4 Observed and expected counts for the simple model i=1 4 games 15 (16.351)
i=2 5 games 26 (24.153)
i=3 6 games 24 (23.240)
i=4 7 games 18 (19.256)
n = 83
The estimated expected cell counts are 83pi ð^hÞ, where ^ h is the value of h that maximizes the multinomial expression n
h4 þ ð1 hÞ4
o15 n h io26 n h io24 n o18 4hð1 hÞ h3 þ ð1 hÞ3 10h2 ð1 hÞ2 h2 þ ð1 hÞ2 20h3 ð1 hÞ3
ð13:4Þ ^ so it must be Standard calculus methods fail to yield a nice formula for the maximizing value h, ^ ^ computed using numerical methods. The result is h ¼ :654, from which pi ðhÞ and the estimated expected cell counts in Table 13.4 were computed. The resulting test statistic value is v2 = .360, much lower than the critical value v2:10;k1m ¼ v2:10;411 ¼ v2:10;2 ¼ 4:605. There is thus no reason to reject the simple model as applied to NHL playoff series, at least for that early era. The cited article also considered World Series data for the period 1903–1973. For the preceding model, v2 = 5.97 4.605, so the model does not seem appropriate. The suggested reason for this is that for this simple model it can be shown that Pðseries lasts exactly six games j series lasts at least six gamesÞ :5;
ð13:5Þ
whereas of the 38 best-of-seven series that actually lasted at least six games, only 13 lasted exactly six. The following alternative model is then introduced:
13.1
Goodness-of-Fit Tests
833
p1 ðh1 ; h2 Þ ¼ h41 þ ð1 h1 Þ4 p2 ðh1 ; h2 Þ ¼ 4h1 ð1 h1 Þ½h31 þ ð1 h1 Þ3
p3 ðh1 ; h2 Þ ¼ 10h21 ð1 h1 Þ2 h2 p4 ðh1; h2 Þ ¼ 10h21 ð1 h1 Þ2 ð1 h2 Þ
The first two pi’s are identical to the simple model, while h2 is the conditional probability of (13.5), h2 that maximize the which can now be any number between zero and one. The values of ^ h1 and ^ h2 ¼ :342. multinomial expression analogous to (13.4) are determined numerically as ^ h1 ¼ :614 and ^ A summary appears in Table 13.5, and v2 = .384. Two parameters are estimated, so df = k – 1 – m = 4 – 1 – 2 = 1 and :384\v2:10;1 ¼ 2:706, indicating a good fit of the data to this new model. Table 13.5 Observed and expected counts for the more complex model 4 games
5 games
6 games
7 games
12 (10.85)
16 (18.08)
13 (12.68)
25 (24.39)
■ One of the regularity conditions on the hj’s in Fisher’s theorem is that they be functionally independent of one another. That is, no single hj can be determined from the values of other hj’s, so that m is the number of functionally independent parameters estimated. A general rule for degrees of freedom in a chi-squared test is the following. GENERAL v2 DF RULE
v df ¼ 2
number of freely determined cell counts
number of independent parameters estimated
This rule will be used in connection with several different chi-squared tests in the next section. v2 for Probability Distributions with Parameter Values Unspecified In Examples 13.2–13.4, we considered goodness-of-fit tests to assess whether quantitative data was 9 consistent with a particular distribution, such as Binð4; 16 Þ or N(90, 23.44). Pearson’s chi-squared theorem could be applied because all model parameters were completely specified. But quite often researchers wish to determine whether their data conforms to any member of a particular family—any Poisson distribution, any Weibull distribution, etc. To use the v2 test to see whether the distribution is Poisson, for example, the parameter l must be estimated. In addition, because there are actually an infinite number of possible values of a Poisson variable, these values must be grouped so that there are a finite number of cells. Example 13.8 Table 13.6 presents count data on X = the number of egg pouches produced by B. alexandrina snails that were subjected to both parasitic infection and drought stress (meant to simulate the effects of climate change), as reported in the article “One Stimulus, Two Responses: Host and Parasite Life-History Variation in Response to Environmental Stress” (Evolution 2016: 2640–2646).
Table 13.6 Observed counts for Example 13.8 Cell No. of egg pouches Observed count
i=1 0
i=2 1
i=3 2
i=4 3
i=5 4
44
2
5
1
9
834
13
Chi-Squared Tests
Denoting the sample values by x1, …, x61, 44 of the xi’s were 0, two were 1, and so on. The nine observed counts in the last cell were 4, 5, 6, 6, 7, 11, 13, 15, and 17, but these have been collapsed into a single “ 4” category in order to ensure that all expected counts will be at least 5. The authors considered fitting a Poisson distribution to the data; let l denote the Poisson parameter. The estimate of l required for Fisher’s v2 procedure is obtained by maximizing the multinomial expression (13.3). The cell probabilities are pi ðlÞ ¼ PðX ¼ i 1Þ ¼ p5 ðlÞ ¼ PðX ¼ 4Þ ¼ 1
el li1 i ¼ 1; 2; 3; 4 ði 1Þ! 3 X el lx x¼0
x!
so the right-hand side of (13.3) becomes
el l0 0!
44
el l1 1!
2
el l2 2!
5
el l3 3!
1 " 1
3 X el lx x¼0
#9
x!
ð13:6Þ
There is no nice formula for the maximizing value of l in Expression (13.6), so it must be obtained numerically. ■ While maximizing Expression (13.6) with respect to l is challenging, there is an alternative way to estimate l: apply the method of maximum likelihood from Chapter 7 to the full sample X1 ; . . .; Xn . Because parameter estimates are usually much more difficult to compute from the multinomial likelihood function (13.3) than from the full-sample likelihood, they are often computed using this latter method. Using Fisher’s critical value v2a;k1m then results in an approximate level a test. Example 13.9 (Example 13.8 continued) The likelihood of the observed sample x1, …, x61 under a Poisson(l) model is LðlÞ ¼ pðx1 ; lÞ pðx61 ; lÞ ¼
el lx1 el lx61 e61l lRxi e61l l99 ¼ ¼ x1 ! x61 ! x1 ! x61 ! x1 ! x61 !
The value of l for which this is maximized—i.e., the maximum likelihood estimate of l—is P ^¼ ^ ¼ 1:623, the estimated expected cell counts are computed l xi =n ¼ 99=61 ¼ 1:623. Using l from npi ð^ lÞ, where n = 61. For example, np1 ð^ lÞ ¼ 61
e1:623 ð1:623Þ0 ¼ ð61Þð:1973Þ ¼ 12:036 0!
Similarly, np2 ð^ lÞ ¼ 19:534, np3 ð^ lÞ ¼ 15:852, and np4 ð^ lÞ ¼ 8:576, from which the last count is np5 ð^ lÞ ¼ 61 ½12:036 þ þ 8:576 ¼ 5:002. Notice that, as planned, all of the estimated expected cell counts are 5, as required for the accuracy of chi-squared tests. Then
13.1
Goodness-of-Fit Tests
v2 ¼
835
ð44 12:036Þ2 ð9 5:002Þ2 þ þ ¼ 117:938 12:036 5:002
Since m = 1 and k = 5, at level .05 we need v2:05;511 ¼ v2:05;3 ¼ 7:815. Because 117.938 > 7.815, we strongly reject H0 at the .05 significance level (in fact, with such a ridiculously large test statistic value, H0 is rejected at any reasonable a). The largest contributor to the v2 statistic here is the number of 0’s in the sample: 44 were observed, but under a Poisson model only 12.036 are expected. This excess of zeros often occurs with “count” data, and statisticians have developed zero-inflated versions of the Poisson and other distributions to accommodate this reality. ■ When model parameters are estimated using full-sample maximum likelihood, it is known that the true level a critical value falls between v2a;k1m and v2a;k1 . So, applying Fisher’s chi-squared method to situations such as Example 13.9 will occasionally lead to incorrectly rejecting H0, though in practice this is uncommon. Sometimes even the maximum likelihood estimates based on the full sample are quite difficult to compute. This is the case, for example, for the two-parameter generalized negative binomial distribution (Exercise 17). In such situations, method-of-moments estimates are often used and the resulting v2 value compared to v2a;k1m , although it is not known to what extent the use of moments estimators affects the true critical value. In theory, the chi-squared test can also be used to test whether a sample comes from a specified family of continuous distributions, such as the exponential family or the normal family. However, when the parameter values are not specified, the goodness-of-fit test is rarely used for this purpose. Instead, practitioners use one or more of the test procedures mentioned in connection with probability plots in Section 4.6, such as the Shapiro–Wilk or Anderson–Darling test. For example, the Shapiro– Wilk procedure tests the hypotheses H0 : the population from which the sample was drawn is normal Ha : H0 is not true A P-value is calculated based on a test statistic similar to the correlation coefficient associated with the points in a normal probability plot. These procedures are generally considered superior to the chisquared tests of this section for continuous families; in particular, they do not rely on creating arbitrary class intervals. More on the Goodness-of-Fit Test When one or more expected counts are less than 5, the chi-squared distribution does not necessarily accurately approximate the sampling distribution of the test statistic (13.1). This can occur either because the sample size n is small or because one of the hypothesized proportions pi0 is small. If a larger sample is not available, a common practice is to sensibly “merge” some of the categories with small expected counts, so that the expected count for the merged category is larger. This might occur, for example, if we examined the political party affiliation of a sample of graduate students, categorized as Republican, Democrat, Libertarian, Green, Peace and Freedom, and Independent. The counts for the nonmajor parties might be quite small, in which case we could combine them to form, say, three categories: Republican, Democrat, and Other. The downside of this technique is that information has been discarded—two students belonging to different political parties (e.g., Libertarian and Green) are no longer distinguishable.
836
13
Chi-Squared Tests
Although the chi-squared test was developed to handle situations in which k > 2, it can also be used when k = 2. The null hypothesis in this case can be stated as H0: p1 = p10, since the relations p2 = 1 – p1 and p20 = 1 – p10 make the inclusion of p2 = p20 in H0 redundant. The alternative hypothesis is Ha: p1 6¼ p10. These hypotheses can also be tested using a two-tailed one-proportion z test with test statistic ^ 1 p10 P ðN1 =nÞ p10 Z ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ rffiffiffiffiffiffiffiffiffiffiffiffi p10 p20 p10 ð1 p10 Þ n n In fact, the two test procedures are completely equivalent. This is because it can be shown that Z2 = v2 (see Exercise 12) and z2a=2 ¼ v2a;1 , so that v2 v2a;1 if and only if jZj za=2 .1 In other words, the two-tailed z test from Chapter 9 rejects H0 if and only if the chi-squared goodness-of-fit test does. However, if the alternative hypothesis is either Ha: p1 > p10 or Ha: p1 < p10, the chi-squared test cannot be used. One must then revert to an upper- or lower-tailed z test. As is the case with all test procedures, one must be careful not to confuse statistical significance with practical significance. A computed v2 that leads to the rejection of H0 may be a result of a very large sample size rather than any practical differences between the hypothesized pi0’s and true pi’s. Thus if p10 ¼ p20 ¼ p30 ¼ 13, but the true pi’s have values .330, .340, and .330, a large value of v2 is pi ’s should be examined to see sure to arise with a sufficiently large n. Before rejecting H0, the ^ whether they suggest a model different from that of H0 from a practical point of view. Exercises: Section 13.1 (1–18) 1. What conclusion would be appropriate for an upper-tailed chi-squared test in each of the following situations? a. a = .05, df = 4, v2 = 12.25 b. a = .01, df = 3, v2 = 8.54 c. a = .10, df = 2, v2 = 4.36 d. a = .01, k = 6, v2 = 10.20 2. Say as much as you can about the P-value for an upper-tailed chi-squared test in each of the following situations: a. v2 = 7.5, df = 2 b. v2 = 13.0, df = 6 c. v2 = 18.0, df = 9 d. v2 = 21.3, k = 5 e. v2 = 5.0, k = 4 3. A statistics department at a large university maintains a tutoring center for students in its introductory service courses. The center has been staffed with the expectation that 40% of its clients would be from the business statistics
course, 30% from engineering statistics, 20% from the statistics course for social science students, and the other 10% from the course for agriculture students. A random sample of n = 120 clients revealed 52, 38, 21, and 9 from the four courses. Does this data suggest that the percentages on which staffing was based are not correct? State and test the relevant hypotheses using a = .05. 4. It is hypothesized that when homing pigeons are disoriented in a certain manner, they will exhibit no preference for any direction of flight after takeoff (so that the direction X should be uniformly distributed on the interval from 0° to 360°). To test this, 120 pigeons were disoriented, let loose, and the direction of flight of each was recorded; the resulting data follows. Use the chi-squared test at level .10 to see whether the data supports the hypothesis.
The fact that z2a=2 ¼ v2a;1 is a consequence of the relationship between the standard normal distribution and the chisquared distribution with 1 df: if Z * N(0, 1), then by definition Z2 has a chi-squared distribution with m = 1. See Section 6.3.
1
13.1
Goodness-of-Fit Tests
837
Direction Frequency
0– Yi) = P(Di > 0), then we may test H0: p = .5 versus Ha: p > .5 as at the end of Example 14.2. Example 14.3 Do technological improvements “slow down” as a product spends more time on the market? The 2017 MIT paper “Exploring the Relationship Between Technological Improvement and Innovation Diffusion: An Empirical Test” attempts to answer this question by quantifying the improvement rate (%/year) of 18 items, from washing machines to laptops, during both the early stage and late stage of each item’s market presence. The data is summarized below. Early stage 12.96 5.50 5.50 3.90 3.52 3.90 10.53 38.18 31.79
Late stage
Difference
Early stage
Late stage
Difference
12.65 3.52 5.00 3.10 2.93 3.10 14.41 31.79 36.33
−0.31 −1.98 −0.50 −0.80 −0.59 −0.80 +3.88 −6.39 +4.54
15.49 34.30 32.15 16.62 32.37 24.25 3.87 180.84 26.80
18.80 25.73 32.80 15.84 36.33 27.15 3.92 84.51 47.52
+3.31 −8.57 +0.65 −0.78 +3.96 +2.90 +0.05 −96.33 +20.72
With D = difference in improvement rate (late stage minus early stage), the authors tested the ~D ¼ 0 versus Ha : l ~D \0; the alternative hypothesis aligns with the prevailing theory hypotheses H0 : l of a late-stage slowdown. Eight of the 18 differences are positive, and the lower-tailed P-value is the chance of observing eight or fewer positive differences if the true median difference is zero (so positive and negative are each equally likely): P-value ¼ PðK 8 when K Binð18; :5ÞÞ ¼ Bð8; 18; :50Þ ¼ :408 The very large P-value, consistent with the close split (8 vs. 10) between negative and positive differences, suggest that H0 should not be rejected. The data does not lend credence to the slowdown theory of technological improvement. ■ Exercises: Section 14.1 (1–10) 1. Example 1.4 presented data on the starting salaries of n = 38 civil engineering graduates. Use this data to construct a 95% confidence interval for the population lower quartile (25th percentile) of civil engineering starting salaries. 2. The following Houston, TX house prices ($1000’s) were extracted from zillow.com in August 2019: 162 165 167 188 189 194 200 233 236 247 248 257 258 286 290 307 330 345 377 389 459 460 513 569 1399
Treating these 25 houses as a random sample of all available homes in Houston, calculate a 90% upper confidence bound for the true median home price in that city. 3. Based on a random sample of 40 observations from any continuous population, construct a confidence interval formula for the population median that has confidence level (at least) 95%. 4. Let g:3 denote the 30th percentile of a population. Find the smallest sample size n for which PðY1 g:3 Yn Þ is at least .99.
860
14
In other words, determine the smallest sample size for which the span of the sample, min to max, is a 99% CI for g:3 . 5. Refer back to Exercise 2. The median home value in the state of Texas in August 2019 was $197,000. Use the data in Exercise 2 to test the hypothesis that the true median home price in Houston exceeds this value, at the .05 significance level. 6. The following data on grip strength (N) for 42 individuals was read from a graph in the article “Investigation of Grip Force, Normal Force, Contact Area, Hand Size, and Handle Size for Cylindrical Handles” (Human Factors 2008: 734–744): 16 68 127 183 233
18 87 131 189 238
20 91 135 190 244
26 95 145 200 259
33 98 147 210 294
41 106 149 220 329
54 109 151 229 403
56 111 168 230
66 118 172
a. What does the data suggest about the population distribution of grip strength? Why might the median be a more appropriate measure of “typical” grip strength than the mean? b. Test the hypothesis that the population median grip strength is less than 170 N at the .05 significance level. 7. Child development specialists widely believe that diverse recreational activities can improve the social and emotional conduct of children. The article “Influence of Physical Activity on the Social and Emotional Behavior of Children Aged 2–5 Years” (Cuban J. of Gen. Integr. Med. 2016) reported a study of 25 young children diagnosed with social and emotional behavior problems. The children participated in a physical activity regimen for one year, and each child was measured for negative social behavior indicators (tantrums, crying, etc.) both before and after the regimen. Lower scores indicate improvement; the children’s changes in score (post minus pre) are summarized below.
Score change < 0 17
Nonparametric Methods
Score change = 0 1
Score change > 0 7
Use the sign test to determine whether the data indicates a statistically significant improvement in scores at the a = .05 level. [Hint: Delete the one 0 observation and work with the other 24 differences; this is a common way to address “ties” in pre- and post-intervention scores.] 8. Consider the following scene: an actor watches a man in a gorilla suit hide behind haystack A. The actor leaves the area, and the “gorilla” moves to behind haystack B, after which the actor re-enters. A video of this scene was shown to 22 (real) apes, and eye-tracking software was used to track which haystack they stared at more after the actor re-entered. “False belief” theory says that the viewer will look at haystack A more, matching the actor’s mistaken belief that the gorilla is still there, despite the viewer knowing better. In the study, 17 of the apes spent more time looking at haystack A (“Great Apes Anticipate That Other Individuals Will Act According to False Beliefs,” Science, 7 October 2016). a. Use a sign test to determine whether the true median looking-time difference (A − B) supports the false belief theory in primates. b. Since the data is paired (time looking at each of two haystacks for the 22 apes), the paired t procedure of Chapter 10 might also be applicable. What information would that test require, and what assumptions must be met? 9. The article “Hitting a High Note on Math Tests: Remembered Success Influences Test Preferences” (J. of Exptl. Psych. 2016: 17–38) reported a study in which 130 participants were administered two math tests (in random order): a shorter, more difficult exam and a longer, easier one. Participants were then asked to estimate how much time ~D denote they had spent on each exam. Let l
14.1
Exact Inference for Population Quantiles
861
the true median difference in time estimates (short test minus long test). Test the ~D ¼ 0 versus Ha: l ~D [ 0, hypotheses H0: l with Ha supporting the psychological contention that people perceive easier tasks to be quicker, using a sign test based on the fact that 109 participants gave positive differences and 21 gave negative differences. (In fact, test-takers required 3.2 min longer, on average, to complete the lengthier exam.)
Protein Metabolism,” Amer. J. of Clinical Nutr. 2009: 1244–1251):
10. Consider the following data on resting energy expenditure (REE, in calories per day) for eight subjects both while on an intermittent fasting regimen and while on a standard diet (“Intermittent Fasting Does Not Affect Whole-Body Glucose, Lipid, or
~D = true median difference in REE (IF Let l minus standard diet). Test the hypotheses ~D ¼ 0 versus Ha: l ~D \0 at the .05 H 0: l significance level using the sign test.
14.2
Subject Diet I.F. Standard
1
2
3
4
5
1753.7 1755.0
1604.4 1691.1
1576.5 1697.1
1279.7 1477.7
1754.2 1785.2
Subject Diet I.F. Standard
6
7
8
1695.5 1669.7
1700.1 1901.3
1717.0 1735.3
One-Sample Rank-Based Inference
~¼l ~0 , where l ~ The previous section introduced the sign test for assessing the plausibility of H0: l ~0 , …, Xn l ~0 denotes a population median. The basis of the test was to consider the quantities X1 l and count how many of those differences are positive. Thus the original sample is reduced to a collection of n “signs” (+ or −). If H0 is true, there should be roughly equal numbers of +’s and −’s, and the test statistic measures the degree of discrepancy from that 50–50 balance. The sign test is applicable to any continuous population distribution. Here, we consider a test procedure that is more powerful than the sign test but requires an additional distributional assumption. Suppose a research chemist replicated a particular experiment a total of 10 times and obtained the following values of reaction temperature (°C), ordered from smallest to largest: −.76
−.19
−.05
.57
1.30
2.02
2.17
2.46
2.68
3.02
The distribution of reaction temperature is of course continuous. Suppose the investigator is willing to assume that this distribution is symmetric, in which case the two halves of the distribution ~ are mirror images of each other. (Provided that the mean l exists for this on either side of l ~ ¼ l and they are both the point of symmetry.) The assumption of symmetry symmetric distribution, l may at first seem quite bold, but remember that we have frequently assumed a normal distribution for inference procedures. Since a normal distribution is symmetric, the assumption of symmetry without any additional distributional specification is actually a weaker assumption than normality. ~ ¼ 0. Symmetry implies that a Let’s now consider testing the specific null hypothesis that l temperature of any particular magnitude, say 1.50, is no more likely to be positive (+1.50) than to be negative (−1.50). A glance at the data above casts doubt on this hypothesis; for example, the sample median is 1.66, which is far larger in magnitude than any of the three negative observations. Figure 14.1 shows graphs of two symmetric pdfs, one for which H0 is true and the other for which the median of the distribution considerably exceeds 0. In the first case we expect the magnitudes of the negative observations in the sample to be comparable to those of the positive sample observations. However, in the second case observations of large absolute magnitude will tend to be positive rather than negative.
862
14
a
Nonparametric Methods
b 0
0
μ
~ ¼ 0; (b) l ~0 Figure 14.1 Distributions for which (a) l
A Rank-Based Test Statistic For the sample of ten reaction temperatures, let’s for the moment disregard the signs of the observations and rank their magnitudes (i.e., absolute values) from 1 to 10, with the smallest getting rank 1, the second smallest rank 2, and so on. Then apply the original sign of each observation to the corresponding rank, so some signed ranks will be negative (e.g., −3) whereas others will be positive (e.g., +8). Absolute value Rank Signed rank
.05 1 −1
.19 2 −2
.57 3 3
.76 4 –4
1.30 5 5
2.02 6 6
2.17 7 7
2.46 8 8
2.68 9 9
3.02 10 10
The test statistic for the procedure developed in this section will be S þ = the sum of the positively signed ranks. For the given data, the observed value of S þ is s þ ¼ sum of the positive ranks ¼ 3 þ 5 þ 6 þ 7 þ 8 þ 9 þ 10 ¼ 48 When the median of the distribution is much greater than 0, most of the observations with large absolute magnitudes should be positive, resulting in positively signed ranks and a large value of s þ . On the other hand, if the median is 0, magnitudes of positively signed observations should be intermingled with those of negatively signed observations, in which case s þ will not be very large. (As noted before, this characterization depends on the underlying distribution being symmetric.) Thus ~ ¼ 0 in favor of Ha: l ~ [ 0 when s þ is “quite large”—the rejection region we should reject H0 : l should have the form s þ c. The critical value c should be chosen so that the test has a desired significance level (type I error probability), such as .05 or .01. This necessitates finding the distribution of the test statistic S þ when the null hypothesis is true. Let’s consider n = 5, in which case there are 25 = 32 ways of applying signs to the five ranks 1, 2, 3, 4, and 5 (each rank could have a − sign or a + sign). The key is that when H0 is true, any collection of five signed ranks has the same chance as does any other collection. That is, the smallest observation in absolute magnitude is equally likely to be positive or negative, the same is true of the second smallest observation in absolute magnitude, and so on. Thus the collection −1, 2, 3, −4, 5 of signed ranks is just as likely as the collection 1, 2, 3, 4, −5, and just as likely as any one of the other 30 possibilities. Table 14.1 lists the 32 possible signed-rank sequences when n = 5 along with the value s þ for each sequence. This immediately gives the “null distribution” of S þ . For example, Table 14.1 shows that three of the 32 possible sequences have s þ ¼ 8, so PðS þ ¼ 8 when H0 is trueÞ ¼ 3=32. This null distribution appears in Table 14.2 and Figure 14.2. Notice that it is symmetric about 7.5; more generally, S þ ¼ 8 is symmetrically distributed over the possible values 0; 1; 2; . . .; nðn þ 1Þ=2 when H0 is true. This symmetry will be important in relating the rejection region of lower-tailed and twotailed tests to that of an upper-tailed test.
14.2
One-Sample Rank-Based Inference
863
Table 14.1 Possible signed-rank sequences for n = 5 sþ
Sequence −1 +1 −1 −1 +1 +1 −1 +1 −1 −1 +1 +1 −1 +1 −1 +1
−2 −2 +2 −2 +2 −2 −2 −2 +2 −2 +2 −2 +2 +2 −2 −2
−3 −3 −3 +3 −3 +3 −3 −3 −3 +3 −3 +3 +3 +3 −3 −3
−4 −4 −4 −4 −4 −4 +4 +4 +4 +4 +4 +4 −4 −4 −4 −4
−5 −5 −5 −5 −5 −5 −5 −5 −5 −5 −5 −5 −5 −5 +5 +5
sþ
Sequence −1 −1 +1 +1 −1 +1 −1 +1 −1 +1 −1 −1 +1 +1 −1 +1
0 1 2 3 3 4 4 5 6 7 7 8 5 6 5 6
+2 −2 +2 −2 +2 +2 +2 +2 −2 −2 +2 −2 +2 −2 +2 +2
−3 +3 −3 +3 +3 +3 +3 +3 −3 −3 −3 +3 −3 +3 +3 +3
−4 −4 −4 −4 −4 −4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4
+5 +5 +5 +5 +5 +5 −5 −5 +5 +5 +5 +5 +5 +5 +5 +5
7 8 8 9 10 11 9 10 9 10 11 12 12 13 14 15
Table 14.2 Null distribution of S þ when n = 5 sþ pðs þ Þ
0
1
2
3
4
5
6
7
1/32
1/32
1/32
2/32
2/32
3/32
3/32
3/32
sþ
8
9
10
11
12
13
14
15
pðs þ Þ
3/32
3/32
3/32
2/32
2/32
1/32
1/32
1/32
Probability 0.10
0.08
0.06
0.04
0.02
0.00 0
5
10
15
s+
Figure 14.2 Null distribution of S þ when n = 5
For n = 10 there are 210 = 1024 possible signed-rank sequences, so a listing would involve much effort. Each sequence, though, would have probability 1/1024 when H0 is true, from which the distribution of S þ when H0 is true can be obtained. ~ ¼ 0 versus Ha : l ~[0 We are now in a position to determine a rejection region for testing H0 : l that has a suitably small significance level a. For the case n = 5, consider the rejection region R ¼ fs þ : s þ 13g ¼ f13; 14; 15g. Then
864
14
Nonparametric Methods
a ¼ Pðreject H0 when H0 is trueÞ ¼ PðS þ ¼ 13; 14; or 15 when H0 is trueÞ ¼ 1=32 þ 1=32 þ 1=32 ¼ 3=32 ¼ :094 so that R = {13, 14, 15} specifies a test with approximate level .1. For the rejection region {14, 15}, a = 2/32 = .063. For the sample x1 ¼ :58; x2 ¼ 2:50; x3 ¼ :21; x4 ¼ 1:23; x5 ¼ :97, the signedrank sequence is −1, +2, +3, +4, +5, so s þ = 14 and at level .063 (or anything higher) H0 would be rejected. The Wilcoxon Signed-Rank Test ~, so we will state the hypotheses of Because the underlying distribution is assumed symmetric, l ¼ l ~.1 When the hypothesized value of l is l0, the absolute differences interest in terms of l rather than l jx1 l0 j; . . .; jxn l0 j must be ranked from smallest to largest. WILCOXON SIGNED-RANK TEST
Null hypothesis: H0 : l ¼ l0 Test statistic value: s þ = the sum of the ranks associated with positive (xi − l0)’s Rejection Region for Level a Test s þ c1 s þ nðn þ 1Þ=2 c1 either s þ c or s þ nðn þ 1Þ=2 c
Alternative Hypothesis Ha : l [ l0 Ha : l\l0 Ha : l 6¼ l0
where the critical values c1 and c obtained from Appendix Table A.11 satisfy PðS þ c1 Þ a and PðS þ cÞ a=2 when H0 is true. Example 14.4 A producer of breakfast cereals wants to verify that a filler machine is operating correctly. The machine is supposed to fill one-pound boxes with 460 g, on average. This is a little above the 453.6 g needed for one pound. When the contents are weighed, it is found that 15 boxes yield the following measurements: 454.4 461.6
470.8 457.3
447.5 452.0
453.2 464.3
462.6 459.2
445.0 453.5
455.9 465.8
458.2
Does the data provide convincing statistical evidence that the true mean weight differs from 460 g? Let’s use the seven-step hypothesis testing procedure outlined earlier in the book. 1. Parameter: l = the true average weight of all such cereal boxes 2. Hypotheses: H0: l = 460 versus Ha: l 6¼ 460 3. It is believed that deviations of any magnitude from 460 g are just as likely to be positive as negative (in accord with the symmetry assumption), but the distribution may not be normal. Therefore, the Wilcoxon signed-rank test will be used to see if the filler machine is calibrated correctly.
If the tails of the distribution are “too heavy,” as is the case with the Cauchy distribution, then l will not exist. In such ~. cases, the Wilcoxon test will still be valid for tests concerning l
1
14.2
One-Sample Rank-Based Inference
865
4. The test statistic will be s þ = the sum of the ranks associated with positive (xi − 460)’s 5. From Appendix Table A.11, PðS þ 95Þ ¼ PðS þ 25Þ ¼ :024 when H0 is true, so the two-tailed test with approximate level .05 rejects H0 when either s þ 95 or s þ 25 [the exact a is 2(.024) = .048]. 6. Subtracting 460 from each measurement gives −5.6 1.6
−12.5 −8.0
10.8 −2.7
−6.8 4.3
−15.0 −6.5
2.6 −.8
−4.1 5.8
−1.8
The ranks are obtained by ordering these from smallest to largest without regard to sign. Absolute magnitude
.8
1.6
1.8
2.6
2.7
4.1
4.3
5.6
5.8
6.5
6.8
8.0
10.8
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
Sign
−
+
−
+
−
−
+
−
+
−
−
−
+
12.5
15.0
14
15
−
−
Thus s þ ¼ 2 þ 4 þ 7 þ 9 þ 13 ¼ 35. 7. Since PðS þ 30Þ is not in the rejection region, it cannot be concluded at level .05 that l differs from 460. Even at level .094 (approximately .1), H0 cannot be rejected, since s þ PðS þ 30Þ= PðS þ 90Þ= .047 implies that s þ values between 30 and 90 are not significant at that level. The P-value for this test thus exceeds .1. ■ Although a theoretical implication of the continuity of the underlying distribution is that ties will not occur, in practice they often do because of the discreteness of measuring instruments. If there are several data values with the same absolute magnitude, then they are typically assigned the average of the ranks they would receive if they differed very slightly from one another. For example, if in Example 14.4 x8 = 458.2 were instead 458.4, then two different values of (xi − 460) would have absolute magnitude 1.6. The ranks to be averaged would be 2 and 3, so each would be assigned rank 2.5. Large-Sample Distribution of S+ Figure 14.2 displays the null distribution of S+ for n = 5, a symmetric distribution centered at 7.5. It is straightforward to show (see Exercise 18) that when H0 is true, EðS þ Þ ¼
nðn þ 1Þ 4
VðS þ Þ ¼
nðn þ 1Þð2n þ 1Þ 24
Moreover, when n is not small (say, n > 20), Lyapunov’s central limit theorem (Chapter 6, Exercise 68) implies that S þ has an approximately normal distribution. Appendix Table A.11 only presents critical values for the Wilcoxon signed-rank test for n 20; beyond that, the test may be performed using the “large-sample” test statistic S þ nðn þ 1Þ=4 Z ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi nðn þ 1Þð2n þ 1Þ=24 which has approximately a standard normal distribution when H0 is true. The Wilcoxon Test for Paired Data When the data consisted of pairs ðX1 ; Y1 Þ; . . .; ðXn ; Yn Þ and the differences D1 ¼ X1 Y1 ; . . .; Dn ¼ Xn Yn were normally distributed, in Chapter 10 we used a paired t test for hypotheses about the expected difference lD. If normality is not assumed, hypotheses about lD can be tested by using the Wilcoxon signed-rank test on the Di’s provided that the distribution of the differences is continuous
866
14
Nonparametric Methods
and symmetric. The null hypothesis is H0: lD ¼ D0 for some null value D0 (most frequently 0), and the test statistic S þ is the sum of the ranks associated with the positive (Di − D0)’s. Example 14.5 Poor sleep including insomnia is common among veterans, particularly those with PTSD. The article “Cognitive Behavioral Therapy for Insomnia and Imagery Rehearsal in Combat Veterans with Comorbid Posttraumatic Stress: A Case Series” (Mil. Behav. Health 2016: 58–64) reports on a pilot study involving 11 combat veterans diagnosed with both insomnia and PTSD. Each participant attended eight weekly individual therapy sessions with lessons including sleep education, relaxation training, and nightmare re-scripting. Total nightly sleep time (min) was recorded for each veteran both before the 8-week intervention and after. In the accompanying table, differences represent (sleep time after therapy) minus (sleep time before therapy). Subject Before After Difference Signed rank
1
2
3
4
5
6
7
8
9
10
11
255 330 75 11
261 323 62 8
257 312 55 6
275 308 33 2
191 251 60 7
528 559 31 1
298 261 −37 −3
247 296 49 5
314 387 73 10
340 386 46 4
315 387 72 9
The relevant hypotheses are H0: lD = 0 versus Ha: lD > 0. Appendix Table A.11 shows that for a test with significance level approximately .01, the null hypothesis should be rejected if s þ 59. The test statistic value is 11 þ 8 þ þ 9 ¼ 63, the sum of every rank except 3, which falls in the rejection region. We therefore reject H0 at significance level .01 in favor of the conclusion that the therapy regimen increases sleep time, on average, for this population. Figure 14.3 shows R output for this test, including the test statistic value (as V) and also the corresponding P-value, which is PðS þ hx2265; hx2009; 63whenH0 istrueÞ.
Wilcoxon signed rank test data: After and Before V = 63, p-value = 0.002441 alternative hypothesis: true location shift is greater than 0
■
Figure 14.3 R output for Example 14.5
Efficiency of the Sign Test and Signed-Rank Test When the underlying distribution being sampled is normal, any one of three procedures—the t test, the signed-rank test, or the sign test—can be used to test a hypothesis about l (the point of symmetry). The t test is the best test in this case because among all level a tests it is the one having the greatest power (smallest type II error probabilities). On the other hand, neither the t test nor the signed-rank test should be applied to data from a clearly skewed distribution, for two reasons. First, lack of normality (resp., symmetry) violates the requirements for the validity of the t test (resp., signed-rank test). Second, both of these latter test procedures concern the mean l, and for a heavily ~. skewed population the mean is arguably of less interest than the median l Test procedure Sign test Signed-rank test One-sample t test
Population assumption
Parameter of interest
Power (assuming normality)
Continuous Symmetric Normal
~ l ~ l¼l ~ l¼l
Least powerful Most powerful
14.2
One-Sample Rank-Based Inference
867
Let us now specifically compare Wilcoxon’s signed-rank test to the t test. Two questions will be addressed: 1. When the underlying distribution is normal, the “home ground” of the t test, how much is lost by using the signed-rank test? 2. When the underlying distribution is not normal, how much improvement can be achieved by using the signed-rank test? Unfortunately, there are no simple answers to the two questions. The difficulty is that power for the Wilcoxon test is very difficult to determine for every possible distribution, and the same can be said for the t test when the distribution is not normal. Even if power were easily obtained, any measure of efficiency would clearly depend on which underlying distribution was assumed. A number of different efficiency measures have been proposed by statisticians; one that many statisticians regard as credible is called asymptotic relative efficiency (ARE). The ARE of one test with respect to another is essentially the limiting ratio of sample sizes necessary to obtain identical error probabilities for the two tests. Thus if the ARE of one test with respect to a second equals .5, then when sample sizes are large, twice as large a sample size will be required of the first test to perform as well as the second test. Although the ARE does not characterize test performance for small sample sizes, the following results can be shown to hold: 1. When the underlying distribution is normal, the ARE of the Wilcoxon test with respect to the t test is approximately .95. 2. For any distribution, the ARE will be at least .86 and for many distributions will be much greater than 1. We can summarize these results by saying that, in large-sample situations, the Wilcoxon test is never very much less efficient than the t test and may be much more efficient if the underlying distribution is far from normal. Although the issue is far from resolved in the case of sample sizes obtained in most practical problems, studies have shown that the Wilcoxon test performs reasonably and is thus a viable alternative to the t test. In contrast, the sign test has ARE less than .64 with respect to the t test when the underlying distribution is normal. (But, again, the sign test is arguably the only appropriate test for heavily skewed populations.) The Wilcoxon Signed-Rank Interval In Section 9.6, we discussed the “duality principle” that links hypothesis tests and confidence intervals. Suppose we have a level a test procedure for testing H0 : h ¼ h0 versus Ha : h 6¼ h0 based on sample data x1 ; . . .; xn . If we let A denote the set of all h0 values for which H0 is not rejected, then A is a 100(1 − a)% CI for h.2 The two-tailed Wilcoxon signed-rank test rejects H0 if s þ is either c or n(n + 1)/2 − c, where c is obtained from Appendix Table A.11 once the desired significance level a is specified. For fixed x1, …, xn, the 100(1 − a)% signed-rank interval will consist of all l0 for which H0 : l ¼ l0 is not rejected at level a. To identify this interval, it is convenient to express the test statistic S þ in another form.
There are pathological examples in which the set A is not an interval of h values, but instead the complement of an interval or something even stranger. To be more precise, we should really replace the notion of a CI with that of a confidence set. In the cases of interest here, the set A does turn out to be an interval.
2
868
14
PROPOSITION
Nonparametric Methods
S þ = the number of pairwise averages ðXi þ Xj Þ=2 with i j that are l0 (These pairwise averages are known as Walsh averages.)
That is, if we average each xj in the list with each xi to its left, including (xj + xj)/2 = xj, and count the number of these averages that are l0, s þ results. In moving from left to right in the list of sample values, we are simply averaging every pair of observations in the sample—again, including (xj + xj)/2— exactly once, so the order in which the observations are listed before averaging is not important. The equivalence of the two methods for computing s þ is not difficult to verify. The number of pairwise averages is
n 2
!
þ n ¼ nðn þ 1Þ=2. If either too many or too few of these pairwise averages are l0,
H0 is rejected. Example 14.6 The following observations are values of cerebral metabolic rate for rhesus monkeys: x1 = 4.51, x2 = 4.59, x3 = 4.90, x4 = 4.93, x5 = 6.80, x6 = 5.08, x7 = 5.67. The 28 pairwise averages are, in increasing order, 4.51 4.915 5.655
4.55 4.93 5.67
4.59 4.99 5.695
4.705 5.005 5.85
4.72 5.08 5.865
4.745 5.09 5.94
4.76 5.13 6.235
4.795 5.285 6.80
4.835 5.30
4.90 5.375
The first few and the last few of these are pictured on a measurement axis in Figure 14.4. s s s
s
26
s 3
28
4.5
2
27
4.6
4.7
4.8
s
25
5.5
1 s
5.75
0
6
At level .0469, H0 is not rejected for µ 0 in here Figure 14.4 Plot of the data for Example 14.6
Because of the discreteness of the distribution of S þ , a = .05 cannot be obtained exactly. The rejection region {0, 1, 2, 26, 27, 28} has a = .046, which is as close as possible to .05, so the level is approximately .05. Thus if the number of pairwise averages l0 is between 3 and 25, inclusive, H0 is not rejected. As displayed in Figure 14.4, the approximate 95% CI for l is (4.59, 5.94); the endpoints are the 3rd-lowest and 3rd-highest (3rd and 26th ordered) Walsh averages. ■ In general, once the pairwise averages are ordered from smallest to largest, the endpoints of the Wilcoxon interval are two of the “extreme” averages. To express this precisely, let the smallest pairwise average be denoted by xð1Þ , the next smallest by xð2Þ ; . . .; and the largest by xðnðn þ 1Þ=2Þ .
14.2
One-Sample Rank-Based Inference
PROPOSITION
869
If the level a Wilcoxon signed-rank test for H0 : l ¼ l0 versus Ha : l 6¼ l0 is to reject H0 if either s þ c or s þ n(n + 1)/2 − c, then a 100(1 − a)% CI for l is ðxðnðn þ 1Þ=2c þ 1Þ ; xðcÞ Þ
ð14:2Þ
In words, the interval extends from the kth smallest pairwise average to the kth largest average, where k ¼ nðn þ 1Þ=2 c þ 1. Appendix Table A.12 gives the values of c that correspond to the usual confidence levels for n = 5, 6, …, 25. In R, the wilcox.test function applied to a vector x containing the sample data will return this signed-rank interval if the user includes the option conf.int = T. Example 14.7 (Example 14.6 continued) For n = 7, an 89.1% interval (approximately 90%) is obtained by using c = 24, since the rejection region {0, 1, 2, 3, 4, 24, 25, 26, 27, 28} has a = .109. The interval is ðxð2824 þ 1Þ ; xð24Þ Þ ¼ ðxð5Þ ; xð24Þ Þ ¼ ð4:72; 5:85Þ, which extends from the fifth smallest to the fifth largest pairwise average. ■ The derivation of the signed-rank interval depended on having a single sample from a continuous symmetric distribution with mean (median) l. When the data is paired, the interval constructed from the Walsh averages of the differences d1, d2, …, dn is a CI for the mean (median) difference lD. For n > 20, the large-sample approximation to the Wilcoxon test based on standardizing S þ gives an approximation to c in (14.2). The result for a 100(1 − a)% interval is rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi nðn þ 1Þ nðn þ 1Þð2n þ 1Þ þ za=2 c 4 24 The efficiency of the Wilcoxon interval relative to the t interval is roughly the same as that for the Wilcoxon test relative to the t test. In particular, for large samples when the underlying population is normal, the Wilcoxon interval will tend to be slightly longer than the t interval, but if the population is quite nonnormal (e.g., symmetric but with heavy tails), then the Wilcoxon interval will tend to be much shorter than the t interval. Exercises: Section 14.2 (11–24) 11. Reconsider the situation described in Exercise 34(a) of Section 9.2, and use the Wilcoxon test with a = .05 to test the specified hypotheses. 12. Use the Wilcoxon test to analyze the data given in Example 9.12. 13. The following pH measurements at a proposed water intake site appear in the 2011 report “Sacramento River Water Quality Assessment for the Davis-Woodland Water Supply Project”: 7.20
7.24
7.31
7.38
7.45
7.60
7.86
Use the Wilcoxon signed-rank test to determine whether the true mean pH level at this site exceeds 7.3 with significance level .05.
14. A random sample of 15 automobile mechanics certified to work on a certain type of car was selected, and the time (in minutes) necessary for each one to diagnose a particular problem was determined, resulting in the following data: 30.6 31.9
30.1 53.2
15.6 12.5
26.7 23.2
27.1 8.8
25.4 24.9
35.0 30.2
30.8
Use the Wilcoxon test at significance level .10 to decide whether the data suggests that true average diagnostic time is less than 30 min. 15. Both a gravimetric and a spectrophotometric method are under consideration for determining phosphate content of a
870
14
with joystick minus number correct with glove) was computed, and the signedrank test statistic value was s þ = 136. Perform a two-sided large-sample test, and interpret your findings.
particular material. Twelve samples of the material are obtained, each is split in half, and a determination is made on each half using one of the two methods, resulting in the following data: Sample Grav. Spec.
1 54.7 55.0
2 58.5 55.7
3 66.8 62.9
4 46.1 45.5
Sample Grav. Spec.
5 52.3 51.1
6 74.3 75.4
7 92.5 89.6
8 40.2 38.4
Sample Grav. Spec.
9 87.3 86.8
10 74.8 72.5
11 63.2 62.3
12 68.5 66.0
Use the Wilcoxon test to decide whether one technique gives on average a different value than the other technique for this type of material. 16. Fifty-three participants performed a series of tests in which a small “zap” was delivered to one compass point, selected at random, on a joystick in their hand. In one setting, subjects were told to move the joystick in the same direction of the zap; in another setting, they were told to move the joystick in the direction opposite to the zap. A series of trials was performed under each setting, and the number of correct moves under both settings was recorded. (“An Experimental Setup to Test Dual-Joystick Directional Responses to Vibrotactile Stimuli,” IEEE Trans. on Haptics 2018.) a. The authors performed a Wilcoxon signed-rank test on the paired differences (number correct in same direction minus number correct in opposite direction, which is discrete but won't greatly impact the analysis). The resulting test statistic value was s þ = 695. Test H0: lD = 0 versus Ha: lD 6¼ 0 at the .10 significance level using the large-sample version of the test. b. The same article also explored whether participants would make more correct moves if the “zaps” were instead delivered by a glove they wore while grasping the joystick. Again for n = 53 subjects, the difference (number correct
Nonparametric Methods
17. The article “Punishers Benefit from ThirdParty Punishment in Fish” (Science, 8 Jan 2010: 171) describes an experiment meant to simulate behavior of cleaner fish, so named because they eat parasites off of “client” fish but will sometimes take a bite of the client’s mucus instead. (Cleaner fish prefer the mucus over the parasites.) Eight female cleaner fish were provided bits of prawn (preferred food) and fish-flake (less preferred), then a male cleaner fish chased them away. One minute later, the process was repeated. a. The following data on the amount of prawn eaten by each female in the two rounds is consistent with information in the article. Use Wilcoxon’s signed-rank test to determine whether female fish eat less of their preferred food, on average, after having been chased by a male. Female
1
2
3
4
5
6
7
8
1st trial .207 .215 .103 .182 .282 .228 .152 .293 2nd trial .164 .033 .092 .003 .115 .250 .056 .247
b. The researchers recorded the same information on the male cleaner fish (the chasers), resulting in a signed-rank test statistic value of s þ = 28. Does this provide evidence that males increase their average preferred food consumption the second time around? 18. The signed-rank statistic can be represented as S þ ¼ 1 U1 þ 2 U2 þ þ n Un where Ui = 1 if the sign of the (xi − l0) with the ith largest absolute magnitude is positive (in which case i is included in S þ ) and Ui = 0 if this value is negative (i = 1, 2, 3, …, n). Furthermore, when H0 is true, the Ui’s are independent Bernoulli rvs with p = .5. a. Use this representation to obtain the mean and variance of S þ when H0 is true. [Hint: The sum of the first n positive integers is nðn þ 1Þ=2, and the sum
14.2
One-Sample Rank-Based Inference
871
of the squares of the first n positive integers is nðn þ 1Þð2n þ 1Þ=6.] b. A particular type of steel beam has been designed to have a compressive strength (lb/in2) of at least 50,000. An experimenter obtained a random sample of 25 beams and determined the strength of each one, resulting in the following data (expressed as deviations from 50,000):
Carry out a test using a significance level of approximately .01 to see if there is strong evidence that the design condition has been violated.
Suppose the Xi’s are ranked from 1 to n. Then when Ha is true, larger ranks tend to occur later in the sequence, whereas if H0 is true, large and small ranks tend to be mixed together. Let Ri be the rank of Xi and P consider the test statistic D ¼ ni¼1 ðRi iÞ2 . Then small values of D give support to Ha (e.g., the smallest value is 0 for R1 ¼ 1; R2 ¼ 2; . . .; Rn ¼ n), so H0 should be rejected in favor of Ha if d c. When H0 is true, any sequence of ranks has probability 1/n!. Use this to find c for which the test has a level as close to .10 as possible in the case n = 4. [Hint: List the 4! rank sequences, compute d for each one, and then obtain the null distribution of D. See the Lehmann book in the bibliography for more information.]
19. Reconsider the calorie-burning data in Exercise 10 from Section 14.1.
21. Obtain the 99% signed-rank interval for true average pH using the data in Exercise 13.
a. Utilize the Wilcoxon signed-rank procedure to test H0: lD = 0 versus Ha: lD < 0 for the population of REE differences (IF minus standard diet). What assumption is required here that was not necessary for the sign test? b. Now apply the paired t test to the hypotheses in part (a). What extra assumptions are required? c. Compare the results of the three tests (sign, signed-rank, and paired t) and discuss what you find.
22. Obtain a 95% signed-rank interval for true average diagnostic time using the data in Exercise 14. [Hint: Try to compute only those pairwise averages having relatively small or large values, rather than all 120 averages.]
−10 90 −150 −199
−27 −95 −155 −212
36 −99 −159 −217
−55 113 165 −229
73 −127 −178
−77 −129 −183
−81 136 −192
20. Suppose that observations X1, X2, …, Xn are made on a process at times 1, 2, …, n. On the basis of this data, we wish to test H0: the Xi’s constitute an independent and identically distributed sequence versus Ha: Xi+1 tends to be larger than Xi for i = 1, …, n (an increasing trend)
14.3
23. Obtain a CI for lD of Exercise 17 using the data given there; your confidence level should be roughly 95%. 24. The following observations are copper contents (%) for a sample of Bidri artifacts (a type of ancient Indian metal handicraft) at the Victoria and Albert Museum in London (“Enigmas of Bidri,” Surface Engr. 2005: 333–339): 2.4, 2.7, 5.3, and 10.1. What confidence levels are achievable for this sample size using the signed-rank interval? Select an appropriate confidence level and compute the interval.
Two-Sample Rank-Based Inference
When at least one of the sample sizes in a two-sample problem is small, the t test requires the assumption of normality (at least approximately). There are situations, though, in which an investigator would want to use a test that is valid even if the underlying distributions are quite nonnormal. We now describe such a test, called the Wilcoxon rank-sum test. An alternative name for the procedure is the Mann–Whitney test, although the Mann–Whitney test statistic is sometimes
872
14
Nonparametric Methods
expressed slightly differently from that of the Wilcoxon test. The Wilcoxon test procedure is “distribution-free” because it will have the desired level of significance for a very large collection of underlying distributions rather than just the normal distribution. X1, …, Xm and Y1, …, Yn are two independent random samples from continuous distributions with means l1 and l2, respectively. The X and Y distributions have the same shape and spread, the only possible difference between the two being in the values of l1 and l2.
ASSUMPTIONS
When H0 : l1 l2 ¼ D0 is true, the X distribution is shifted by the amount D0 to the right of the Y distribution; i.e., fX ðxÞ ¼ fY ðx D0 Þ. When H0 is false, the shift is by an amount other than D0; note, though, that we still assume the two distributions only differ by a shift in means. This assumption can be difficult to verify in practice, but the Wilcoxon rank-sum test is nonetheless a popular approach to comparisons based on small samples. A Rank-Based Test Statistic Let’s first test H0 : l1 l2 ¼ 0; then the X’s and Y’s are identically distributed when H0 is true. Consider the case n1 = 3, n2 = 4. Denote the observations by x1, x2, and x3 (the first sample) and y1, y2, y3, and y4 (the second sample). If l1 is actually much larger than l2, then most of the observed x’s will be larger than the observed y’s. However, if H0 is true, then the values from the two samples should be intermingled. The test statistic will quantify how much intermingling there is in the two samples. To begin, pool the x’s and y’s into a single combined sample of size m + n = 7 and rank these observations from smallest to largest, with the smallest receiving rank 1 and the largest, rank 7. If most of the large ranks (or most of the small ranks) were associated with x observations, we would begin to doubt H0. This suggests the test statistic W ¼ the sum of the ranks in the combined sample associated with X observations
ð14:3Þ
For the values of m and n under consideration, the smallest possible value of W is 1 + 2 + 3 = 6 (if all three x’s are smaller than all four y’s), and the largest possible value is 5 + 6 + 7 = 18 (if all three x’s are larger than all four y’s). As an example, suppose x1 = –3.10, x2 = 1.67, x3 = 2.01, y1 = 5.27, y2 = 1.89, y3 = 3.86, and y4 = .19. Then the pooled ordered sample is –3.10, .19, 1.67, 1.89, 2.01, 3.86, and 5.27. The X ranks for this sample are 1 (for –3.10), 3 (for 1.67), and 5 (for 2.01), giving w = 1 + 3 + 5 = 9. The test procedure based on the statistic (14.3) requires knowledge of the null distribution of W. When H0 is true, all seven observations come from the same population. This means that under H0, any possible triple of ranks associated with the three x’s—such as (1, 4, 5), (3, 5, 6), or (5, 6, 7)— has the same probability as any other possible rank triple. Since there are 73 ¼ 35 possible rank triples, under H0 each rank triple has probability 1=35. From a list of all 35 rank triples and the W value associated with each, the null distribution of W can immediately be determined. For example, there are four rank triples for which W = 11—(1, 3, 7), (1, 4, 6), (2, 3, 6), and (2, 4, 5)—so PðW ¼ 11Þ ¼ 4=35. The complete sampling distribution appears in Table 14.3 and Figure 14.5.
Table 14.3 Probability distribution of W when H0 is true (n1 = 3, n2 = 4) w p(w)
6
7
8
9
10
11
12
13
14
15
16
17
18
1/35
1/35
2/35
3/35
4/35
4/35
5/35
4/35
4/35
3/35
2/35
1/35
1/35
14.3
Two-Sample Rank-Based Inference
873
Probability
0.15
0.10
0.05
w
0.00 6
8
10
12
14
16
18
Figure 14.5 Sampling distribution of W when H0 is true (n1 = 3, n2 = 4)
Suppose we wished to test H0: l1 − l2 = 0 against Ha: l1 – l2 < 0 based on the example data given previously, for which the observed value of W was 9. Then the P-value associated with the test is the chance of observing a W-value of 9 or lower, assuming H0 is true. Using Table 14.3, P-value ¼ PðW 9 when H0 is trueÞ ¼ PðW ¼ 6; 7; 8; 9Þ ¼
7 ¼ :2 35
We would thus not reject H0 at any reasonable significance level. Constructing the null sampling distribution of W manually can be tedious, since there are generally
mþn n
possible arrangements of ranks to consider. Software will provide w and the associated
P-value quickly, though various packages perform the P-value calculation slightly differently. The Wilcoxon Rank-Sum Test The null hypothesis H0 : l1 l2 ¼ D0 is handled by subtracting D0 from each Xi and using the (Xi − D0)’s as the Xi’s were previously used. The smallest possible value of the statistic W is 1 þ 2 þ þ m ¼ mðm þ 1Þ=2, which occurs when the (Xi − D0)’s are all to the left of the Y sample. The largest possible value of W occurs when the (Xi − D0)’s lie entirely to the right of the Y’s; in this case, W ¼ ðn þ 1Þ þ þ ðm þ nÞ ¼ ðsum of first m þ n integersÞ ðsum of first n integersÞ, which gives mðm þ 2n þ 1Þ=2. As with the special case m = 3, n = 4, the null distribution of W is symmetric about the value that is halfway between the smallest and largest values; this middle value is m(m + n + 1)/2. Because of this symmetry, probabilities involving lowertail critical values can be obtained from corresponding upper-tail values.
874
14
WILCOXON RANK-SUM TEST
Nonparametric Methods
Null hypothesis: H0 : l1 l2 ¼ D0 Test statistic value: w ¼
m X
ri ;
i¼1
where ri = rank of ðxi D0 Þ in the combined sample of m + n ðx D0 Þ’s and y’s Alternative Hypothesis Rejection Region for Level a Test w c1 Ha : l1 l2 [ D0 Ha : l1 l2 \D0 w mðm þ n þ 1Þ c1 Ha : l1 l2 6¼ D0 either w c or w mðm þ n þ 1Þ c where P(W c1 when H0 is true) a, P(W c when H0 is true) a/2
Because W has a discrete probability distribution, there will not usually exist a critical value corresponding exactly to one of the usual significance levels. Appendix Table A.13 gives upper-tail critical values for probabilities closest to .05, .025, .01, and .005, from which level .05 or .01 one- and twotailed tests can be obtained. The table gives information only for 3 m n 8. To use the table, the X and Y samples should be labeled so that m n. Ties are handled as suggested for the signedrank test in the previous section. Example 14.8 The urinary fluoride concentration (parts per million) was measured both for a sample of livestock grazing in an area previously exposed to fluoride pollution and for a similar sample grazing in an unpolluted region: Polluted Unpolluted
21.3 [11] 14.2 [1]
18.7 [7] 18.3 [5]
23.0 [12] 17.2 [4]
17.1 [3] 18.4 [6]
16.8 [2] 20.0 [9]
20.9 [10]
19.7 [8]
The values in brackets indicate the rank of each observation in the combined sample of 12 values. Does the data indicate strongly that the true average fluoride concentration for livestock grazing in the polluted region is larger than for the unpolluted region? Let’s use the Wilcoxon rank-sum test at level a = .01. 1. The sample sizes here are 7 and 5. To obtain m n, label the unpolluted observations as the x’s (x1 = 14.2, …, x5 = 20.0) and the polluted observations as the y’s. Thus the parameters are l1 = the true average fluoride concentration without pollution l2 = the true average concentration with pollution 2. The hypotheses are H 0: Ha:
l1 l2 ¼ 0 l1 l2 \ 0 (pollution is associated with an increase in concentration)
3. In order to perform the Wilcoxon rank-sum test, we will assume that the fluoride concentration distributions for these two livestock populations have the same shape and spread, but possibly differ in mean. P 4. The test statistic value is w ¼ 5i¼1 ri , where ri = the rank of xi among all 12 observations. 5. From Appendix Table A.13 with m = 5 and n = 7, P(W 47 when H0 is true) .01. The critical value for the lower-tailed test is therefore m(m + n + 1) − 47 = 5(13) − 47 = 18; H0 will now be rejected if w 18.
14.3
Two-Sample Rank-Based Inference
875
6. The computed W is w ¼ r1 þ r2 þ þ r5 ¼ 1 þ 5 þ 4 þ 6 þ 9 ¼ 25. 7. Since 25 is not 18, H0 is not rejected at (approximately) level .01. The data does not provide convincing statistical evidence at the .01 significance level that average fluoride concentration is higher among livestock grazing in the polluted region. ■ Alternative Versions of the Rank-Sum Test Appendix Table A.13 allows us to perform the Wilcoxon rank-sum test provided that m and n are both 8. For larger sample sizes, a central limit theorem for nonindependent variables can be used to show that W has an approximately normal distribution. (The genesis of a bell-shaped curve can even be seen in Figure 14.5 where m = 3 and n = 4.) When H0 is true, the mean and variance of W (see Exercise 32) are EðWÞ ¼
mðm þ n þ 1Þ 2
VðWÞ ¼
mnðm þ n þ 1Þ 12
These suggest that when m > 8 and n > 8 the rank-sum test may be performed using the test statistic W mðm þ n þ 1Þ=2 Z ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi mnðm þ n þ 1Þ=12 which has approximately a standard normal distribution when H0 is true. Some statistical software packages, including R, use an alternative formulation called the Mann– Whitney U test. Consider all possible pairs ðXi ; Yj Þ, of which there are mn. Define a test statistic U by U ¼ the number of Xi ; Yj pairs for which Xi Yj [ D0
ð14:4Þ
It can be shown (Exercise 34) that the test statistics U and W are related by U ¼ W mðm þ 1Þ=2. The mean and variance of U can thus be obtained from the corresponding expressions for W, and the normal approximation to W applies equally to U. Finally, when using the normal approximation, some slightly tedious algebra can be used to rearrange the standardized version of W so that it looks similar to the two-sample z test statistic: 1 R 2 R W EðWÞ Z ¼ pffiffiffiffiffiffiffiffiffiffiffiffi ¼ ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; 2 VðWÞ r r2 þ m n 2 denote the average ranks for the two samples, and r2 ¼ ðm þ nÞðm þ n þ 1Þ=12. 1 and R where R Efficiency of the Wilcoxon Rank-Sum Test When the distributions being sampled are both normal with r1 = r2 and therefore have the same shapes and spreads, either the pooled t test or the Wilcoxon test can be used. (The two-sample t test assumes normality but not equal standard deviations, so assumptions underlying its use are more restrictive in one sense and less in another than those for Wilcoxon’s test.) In this situation, the pooled t test is best among all possible tests in the sense of maximizing power for any fixed a. However, an investigator can never be absolutely certain that underlying assumptions are satisfied. It is therefore relevant to ask (1) how much is lost by using Wilcoxon’s test rather than the pooled t test when the distributions are normal with equal variances and (2) how W compares to T in nonnormal situations. The notion of asymptotic relative efficiency was discussed in the previous section in connection with the one-sample t test and Wilcoxon signed-rank test. The results for the two-sample tests are the same as those for the one-sample tests. When normality and equal variances both hold, the rank-sum
876
14
Nonparametric Methods
test is approximately 95% as efficient as the pooled t test in large samples. That is, the t test will give the same error probabilities as the Wilcoxon test using slightly smaller sample sizes. On the other hand, the Wilcoxon test will always be at least 86% as efficient as the pooled t test and may be much more efficient if the underlying distributions are very nonnormal, especially with heavy tails. The comparison of the Wilcoxon test with the two-sample (unpooled) t test is less clear-cut. The t test is not known to be the best test in any sense, so it seems safe to conclude that as long as the population distributions have similar shapes and spreads, the behavior of the Wilcoxon test should compare quite favorably to the two-sample t test. Lastly, we note that power calculations for the Wilcoxon test are quite difficult. This is because the distribution of W when H0 is false depends not only on l1 − l2 but also on the shape of the two distributions. For most underlying distributions, the nonnull distribution of W is virtually intractable. This is why statisticians developed asymptotic relative efficiency as a means of comparing tests. With the capabilities of modern-day computer software, another approach to power calculations is to carry out a simulation experiment. The Wilcoxon Rank-Sum Interval Similar to the signed-rank interval of Section 14.2, a CI for l1 l2 based on the Wilcoxon rank-sum test is obtained by determining, for fixed xi’s and yj’s, the set of all D0 values for which H0: l1 l2 ¼ D0 is not rejected. This is easiest to do if we use the Mann–Whitney U statistic (14.4), according to which H0 should be rejected if the number of (xi − yj)’s D0 is either too small or too large. This, in turn, suggests that we compute xi − yj for each i and j and order these mn differences from smallest to largest. Then if the null value D0 is neither smaller than most of the differences nor larger than most, H0 : l1 l2 ¼ D0 is not rejected. Varying D0 now shows that a CI for l1 − l2 will have as its lower endpoint one of the ordered (xi − yj)’s, and similarly for the upper endpoint. PROPOSITION
Let x1, …, xm and y1, …, yn be the observed values in two independent samples from continuous distributions that differ only in location (and not in shape or spread). With dij = xi − yj and the ordered differences denoted by dij(1), dij(2), …, dij(mn), the general form of a 100(1 − a)% CI for l1 − l2 is ðdijðmnc þ 1Þ ; dijðcÞ Þ
ð14:5Þ
where c is the critical value for the two-tailed level a Wilcoxon rank-sum test. Notice that the form of the Wilcoxon rank-sum interval (14.5) is very similar to the Wilcoxon signedrank interval (14.2); (14.2) uses pairwise averages from a single sample, whereas (14.5) uses pairwise differences from two samples. Appendix Table A.14 gives values of c for selected values of m and n. In R, the wilcox.test function applied to vectors x and y containing the sample data will return this signed-rank interval if the user includes the option conf.int = T. Example 14.9 The article “Some Mechanical Properties of Impregnated Bark Board” (Forest Products J.) reports the following data on maximum crushing strength (psi) for a sample of epoxyimpregnated bark board and for a sample of bark board impregnated with another polymer: Epoxy (x’s) Other (y’s)
10,860 4590
11,120 4850
11,340 6510
12,130 5640
14,380 6390
13,070
Let’s obtain a 95% CI for the true average difference in crushing strength between the epoxyimpregnated board and the other type of board.
14.3
Two-Sample Rank-Based Inference
877
From Appendix Table A.14, since the smaller sample size is 5 and the larger sample size is 6, c = 26 for a confidence level of approximately 95%, and mn – c + 1 = (5)(6) − 26 + 1 = 5. All 30 dij’s appear in Table 14.4. The five smallest dij’s are dij(1) = 4350, 4470, 4610, 4730, and dij(5) = 4830; and the five largest dij’s are (in descending order) 9790, 9530, 8740, 8480, and 8220. Thus the CI is (dij(5), dij(26)) = (4830, 8220). Table 14.4 Differences (dij) for the rank-sum interval in Example 14.9
xi
10,860 11,120 11,340 12,130 13,070 14,380
4590
4850
yj 5640
6390
6510
6270 6530 6750 7540 8480 9790
6010 6270 6490 7280 8220 9530
5220 5480 5700 6490 7430 8740
4470 4730 4950 5740 6680 7990
4350 4610 4830 5620 6560 7870
■ When m and n are both large, the aforementioned normal approximation can be used to derive a large-sample approximation for the value c in interval (14.5). The result is mn þ za=2 c 2
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi mnðm þ n þ 1Þ 12
As with the signed-rank interval, the rank-sum interval (14.5) is quite efficient with respect to the t interval; in large samples, (14.5) will tend to be only a bit longer than the t interval when the underlying populations are normal and may be considerably shorter than the t interval if the underlying populations have heavier tails than do normal populations. And once again, the actual confidence level for the t interval may be quite different from the nominal level in the presence of substantial nonnormality. Exercises: Section 14.3 (25–36) 25. In an experiment to compare the bond strength of two different adhesives, each adhesive was used in five bondings of two surfaces, and the force necessary to separate the surfaces was determined for each bonding. For adhesive 1, the resulting values were 229, 286, 245, 299, and 250, whereas the adhesive 2 observations were 213, 179, 163, 247, and 225. Let li denote the true average bond strength of adhesive type i. Use the Wilcoxon rank-sum test at level .05 to test H0 : l1 ¼ l2 versus Ha : l1 [ l2 . 26. The accompanying data shows the alcohol content (percent) for random samples of 7 German beers and 8 domestic beers. Does the data suggest that German beers have a
different average alcohol content than those brewed in the USA? Use the Wilcoxon rank-sum test at a = .05. German 5.00 4.90 3.80 4.82 4.80 5.44 6.60 Domestic 4.85 5.04 4.20 4.10 4.50 4.70 4.30 5.50
27. A modification has been made to one assembly line for a particular automobile chassis. Because the modification involves extra cost, it will be implemented throughout all lines only if sample data strongly indicates that the modification has decreased true average assembly time by more than 1 h. Assuming that the assembly time distributions differ only with respect to location if at all, use the Wilcoxon ranksum test at level .05 on the accompanying
878
14
data (also in hours) to test the appropriate hypotheses. Original process 8.6 5.1 4.5 5.4 6.3 6.6 5.7 8.5 Modified process 5.5 4.0 3.8 6.0 5.8 4.9 7.0 5.7
28. Can video games improve balance among the elderly? The article “The Effect of Virtual Reality Gaming on Dynamic Balance in Older Adults” (Age and Ageing 2012: 549– 552) reported an experiment in which 34 senior citizens were randomly assigned to one of two groups: (1) 16 who engaged in a six-week exercise regimen using the Wii Fit Balance Board (WBB) and (2) 18 who were told not to vary their daily physical activity during that interval. The accompanying data on improvement in 8-foot-up-and-go time (sec), a standard test of agility and balance, is consistent with information in the article. Test whether the true average improvement is greater using the WBB than under control conditions at the .05 significance level. −1.9 −0.8 0.1 0.5 0.6 0.7 0.8 0.9 1.1 1.2 1.5 2.0 2.1 2.7 3.2 3.7 Control −2.6 −2.2 −2.1 −1.8 −1.4 −1.1 −0.7 −0.6 −0.3 −0.1 0.0 0.3 0.4 1.0 1.3 2.3 2.4 4.5
WBB
29. Reconsider the situation described in Exercise 110 of Chapter 10 and the following Minitab output (the Greek letter eta is used to denote a median). Mann-Whitney Confidence Interval and Test good N=8 Median = 0.540 poor N=8 Median = 2.400 Point estimate for ETA1 − ETA2 is −1.155 95.9% CI for ETA1 − ETA2 is (−3.160 − 0.409) W = 41.0 Test of ETA1 = ETA2 versus ETA1 < ETA2 is significant at 0.0027
a. Verify that Minitab’s test statistic value is correct. b. Carry out an appropriate test of hypotheses using a significance level of .01. 30. The article “Opioid Use and Storage Patterns by Patients after Hospital Discharge
Nonparametric Methods
following Surgery” (PLoS ONE 2016) reported a study of 30 women who just had Caesarian sections. The women were classified into two groups: 14 with a high need for pain medicine post-surgery and 16 with a low need. The total oral morphine equivalent prescribed at discharge was determined for each woman, and the resulting Wilcoxon rank-sum test statistic was W = 249.5. (The .5 comes from ties in the data, but that won’t affect the P-value much.) Test the hypotheses H0: l1 = l2 versus Ha: l1 6¼ l2 at the .05 significance level. Does it appear that physicians prescribe opioids in proportion to patients’ pain-control needs? 31. The article “Mutational Landscape Determines Sensitivity to PD-1 Blockade in Non-Small Cell Lung Cancer” (Science, 3 April 2015) described a study of 16 cancer patients taking the drug Keytruda. For each patient, the number of nonsynonymous mutations per tumor was determined; higher numbers indicate better drug effectiveness. The data is separated into patients that showed a durable clinical benefit (partial or stable response lasting >6 months) and those with no durable benefit. Use the methods of this section to determine whether patients experiencing durable clinical benefit tend to have a higher average number of nonsynonymous mutations than those with no durable benefit (use a = .05). Durable 170 228 300 302 315 490 774 benefit No durable 11 28 46 115 148 161 180 300 625 benefit
32. The Wilcoxon rank-sum statistic can be represented as W ¼ R1 þ R2 þ þ Rm , where Ri is the rank of Xi − D0 among all m + n such differences. When H0 is true, each Ri is equally likely to be one of the first m + n positive integers; that is, Ri has a discrete uniform distribution on the values 1, 2, 3, …, m + n.
14.3
Two-Sample Rank-Based Inference
a. Determine the mean value of each Ri when H0 is true and then show that the mean value of W is m(m + n + 1)/2. [Hint: The sum of the first k positive integers is k(k + 1)/2.] b. The variance of each Ri is easily determined. However, the Ri’s are not independent random variables because, for example, if m = n = 10 and we are told that R1 = 5, then R2 must be one of the other 19 integers between 1 and 20. However, if a and b are any two distinct positive integers between 1 and m + n inclusive, it follows that PðRi ¼ a and Rj ¼ bÞ ¼ 1=½ðm þ nÞðm þ n 1Þ since two integers are being sampled without replacement from among 1, 2, …, m + n. Use this fact to show that CovðRi ; Rj Þ ¼ ðm þ n þ 1Þ=12, and then show that the variance of W is mnðm þ n þ 1Þ=12. 33. The article “Controlled Clinical Trial of Canine Therapy Versus Usual Care to Reduce Patient Anxiety in the Emergency Department” (PLoS ONE 2019) reported on an experiment in which 80 adult hospital patients were randomly assigned to either 15 min with a certified therapy dog (m = 40) or usual care (n = 40). Each patient’s change in self-reported pain, depression, and anxiety (pre-treatment minus post treatment) was recorded. The researchers employed a rank-sum test to
14.4
879
compare the two treatment groups on each of these three outcomes; the resulting test statistic values appear below. Change in:
Pain
Depression
Anxiety
w=
1475
1316
1171
Test the hypotheses H0: l1 − l2 = 0 against Ha: l1 − l2 < 0 (1 = dog therapy treatment, 2 = control) for each of the three response variables at the .01 significance level. What can be said about the chance of committing at least one type I error among the three tests? 34. Refer to Exercise 32. Sort the ranks of the Xi’s, so that R1 \R2 \ \Rm . a. In terms of the R’s, how many of the Yj’s are less than the smallest Xi? Less than the second-smallest Xi? b. When D0 ¼ 0, the Mann–Whitney test statistic is U = the number of ðXi ; Yj Þ pairs for which Xi [ Yj . Use part (a) to express U as a sum, then show this sum is equal to W mðm þ 1Þ=2. c. Use the mean and variance of W to determine E(U) and V(U). 35. Obtain the 90% rank-sum CI for l1 − l2 using the data in Exercise 25. 36. Obtain a 95% CI for l1 − l2 using the data in Exercise 27. Is your interval consistent with the result of the hypothesis test in that exercise?
Nonparametric ANOVA
The analysis of variance (ANOVA) procedures in Chapter 11 for comparing I population or treatment means assumed that every population/treatment distribution is normal with the same standard deviation, so that the only potential difference is their means l1 ; . . .; lI . Here we present methods for testing equality of the li’s that apply to a broader class of population distributions. The Kruskal–Wallis Test The Kruskal–Wallis test extends the Wilcoxon rank-sum test of the previous section to the case of three or more populations or treatments. The I population/treatment distributions under consideration are assumed to be continuous, have the same shape and spread, but possibly different means. More formally, with fi denoting the pdf of the ith distribution, we assume that f1 ðx l1 Þ ¼ f2 ðx l2 Þ ¼ ¼ fI ðx lI Þ, so that the distributions differ by (at most) a shift. Following the notation of Chapter 11,
880
14
Nonparametric Methods
P let Ji = the ith sample size, n ¼ Ji = the total number of observations in the data set, and Xij = the jth observation in the ith sample (j ¼ 1; . . .; Ji ; i ¼ 1; . . .; I). As in the rank-sum test, we replace each observation Xij with its rank, Rij, among all n observations. So, the smallest observation across all samples receives rank 1, the next-smallest rank 2, and so on through n. Example 14.10 Diabetes and its associated health issues among children are of ever-increasing concern worldwide. The article “Clinical and Metabolic Characteristics among Mexican Children with Different Types of Diabetes Mellitus” (PLoS ONE, Dec. 16, 2016) reported on an in-depth study of children with one of four types of diabetes: (1) type 1 autoimmune, (2) type 2, (3) type 1 idiopathic (the most common type in children), and (4) what the researchers called “type 1.5.” (The last category describes children who exhibit characteristics consistent with both type 1 and type 2; the authors note that the American Diabetes Association does not recognize such a category.) To illustrate the Kruskal–Wallis method, presented here is a subset of the triglyceride measurements (mmol/L) on these children, along with their associated ranks in brackets. Diabetes group 1 2 3 4
Triglyceride level (mmol/L) 1.06 1.30 1.08 1.39
[5] [9] [7] [10]
1.94 2.08 0.45 0.89
[11] [12] [1] [4]
1.07 [6] 2.15 [13] 0.85 [2]
1.13 [8]
2.28 [14]
0.87 [3]
In this example, I = 4 (four populations), J1 = J2 = 3, J3 = 6, J4 = 2, and n = 14. The first observation is x11 = 1.06 with associated rank r11 = 5. ■ When H0: l1 ¼ ¼ lI is true, the I population/treatment distributions are identical, and so the Xij’s form a random sample from a single population distribution. It follows that each Rij is uniformly distributed on the integers 1, 2, …, n, so that EðRij Þ ¼ ðn þ 1Þ=2 for every i and j when H0 is true. If i denote the mean of the ranks in the ith sample, then we let R i Þ ¼ EðR
Ji 1X nþ1 EðRij Þ ¼ Ji i¼1 2
Moreover, regardless of whether H0 is true, the “grand mean” of all n ranks is (n + 1)/2, the average of the first n positive integers. Similar to the treatment sum of squares SSTr from one-way ANOVA, the Kruskal–Wallis test i ’s statistic, denoted by H, quantifies “between-groups” variability by measuring how much the R differ from the grand mean: H¼
2 Ji I X I X 12 12 X i R Þ2 ¼ 12 i n þ 1 SSTr ¼ ðR Ji R nðn þ 1Þ nðn þ 1Þ i¼1 j¼1 nðn þ 1Þ i¼1 2
ð14:6Þ
i will be close to its expected value (the grand mean), When the null hypothesis is true, each R whereas when H0 is false certain samples should have an overabundance of high ranks and others too many low ranks, resulting in a larger value of H. Even for small Ji’s, the exact null distribution of H in Expression (14.6) is unwieldy. Thankfully, when the sample sizes are not too small, the approximate sampling distribution of H under the null hypothesis is known.
14.4
Nonparametric ANOVA
KRUSKAL–WALLIS TEST
881
Null hypothesis: H0: l1 ¼ ¼ lI Alternative hypothesis: not all li’s are equal I 12 X nþ1 2 Ji ri , Test statistic value: h ¼ nðn þ 1Þ i¼1 2
where ri denotes the average rank within the ith sample. When H0 is true, H has approximately a chi-squared distribution with I − 1 df. This approximation is reasonable provided that all Ji 5.
It should not be surprising that H has an approximate v2I1 distribution. By the Central Limit The i should be approximately normal, and so H is similar to the sum of squares of orem, the averages R I standardized normal rvs. However, as in the distribution of sample variance S2, these rvs are not independent—the sum of all ranks is fixed—and that one constraint costs one degree of freedom. Example 14.11 (Example 14.10 continued) Though the sample sizes in our illustrative example are a bit too small to meet the requirements of the Kruskal–Wallis test, we proceed with the rest of the test procedure on this reduced data set. The rank averages are r1 ¼ ð5 þ 11 þ 6Þ=3 ¼ 7:33, r2 ¼ 11:33, r3 ¼ 5:83, and r4 ¼ 7. The grand mean of all 14 ranks is (n + 1)/2 = 7.5, and the test statistic value is h¼
4 X 12 12 ½3ð7:33 7:5Þ2 þ þ 2ð7 7:5Þ2 ¼ 3:50 Ji ðri 7:5Þ2 ¼ 14ð14 þ 1Þ i¼1 210
Comparing this to the critical value v2:05;41 ¼ 7:815, we would not reject H0 at the .05 significance level. Equivalently, the P-value is P(H 3.50 when H * v23 ) = .321, again indicating no reason to reject H0. Details of the Kruskal–Wallis test for the full sample appear below. The small test statistic of 5.78 and relatively large P-value of .123 indicate that the mean triglyceride levels for these four populations of diabetic children are not statistically significantly different. Diabetes group 1 2 3 4
Ji 25 31 63 17 n = 136
ri 64.0 80.8 62.0 76.6 r ¼ 68:5
h = 5.78 P-value = .123
■ Expression (14.6) is sometimes written in other forms. For example, with Wi denoting the sum of the ranks for the ith sample (analogous to the Wilcoxon statistic W), it can be shown that I I 12 X 1 nþ1 2 12 X Wi2 H¼ W i Ji ¼ 3ðn þ 1Þ nðn þ 1Þ i¼1 Ji 2 nðn þ 1Þ i¼1 Ji
ð14:7Þ
882
14
Nonparametric Methods
The quantity Ji ðn þ 1Þ=2 in the middle expression of (14.7) is the expected rank sum for the ith sample when H0 is true. The far-right expression in (14.7) is computationally quicker than (14.6). P 2 Alternatively, some software packages report the Kruskal–Wallis statistic in the form H ¼ Zi , where i ðn þ 1Þ=2 R pffiffiffiffi Zi ¼ r= Ji and r2 ¼ nðn þ 1Þ=12. In Chapter 11, we emphasized the need for two measures of variability, SSTr and SSE, with the latter measuring the variability within each sample. Why is SSE not required here? The fundamental ANOVA identity SST = SSTr + SSE still applies to the Rij’s, but because the ranks are just a rearrangement of the integers 1 through n, the total sum of squares SST depends only on n and not on the raw data (Exercise 41). Thus, once we know n and SSTr, the other two sums of squares are completely determined. Nonparametric ANOVA for a Randomized Block Design The Kruskal–Wallis test is applicable to data resulting from a completely randomized design (independent random samples from I population or treatment distributions). Suppose instead that we have data from a randomized block experiment and wish to test the null hypothesis that the true population/treatment means are equal (i.e., “no treatment effect”). To test H0: l1 ¼ ¼ lI in this situation, the observations within each block are ranked from 1 to I, and then the average rank ri is computed for each of the I treatments. Example 14.12 The article “Modeling Cycle Times in Production Planning Models for Wafer Fabrication” (IEEE Trans. on Semiconductor Manuf. 2016: 153–167) reports on a study to compare three different linear programming models used in the simulation of factory processes: allocated cleaning function (ACF), fractional lead time (FLT), and simple rounding down (SRD). Simulations were run under five different demand representations, and the profit from the (simulated) manufacture of a particular product was determined. In this study, there are I = 3 treatments being compared using J = 5 blocks. The profit data presented in the article, along with the rank of each observation within its block and the rank average for each treatment, appears below. Demand representation (block) LP model ACF FLT SRD
1
2
3
4
5
ri
$44,379 [1] $47,825 [3] $47,446 [2]
$69,465 [3] $43,354 [1] $53,393 [2]
$18,317 [2] $17,512 [1] $27,554 [3]
$69,981 [3] $48,707 [2] $45,435 [1]
$32,354 [3] $30,993 [2] $25,662 [1]
2.4 1.8 1.8
■ Within each block, the average of the ranks 1, …, I is simply (I + 1)/2, and hence this is also the grand mean. If the null hypothesis of “no treatment effect” is true, then all I! possible arrangements of the ranks within each block are equally likely, so each rank Rij is uniformly distributed on {1, …, I} and has expected value (I + 1)/2. The nonparametric method for analyzing this type of data, known as the Friedman test (developed by the Nobel Prize-winning economist Milton Friedman) relies on the following statistic:
14.4
Nonparametric ANOVA
Fr ¼
883
2 I X J I X 12 12 X i I þ 1 i R Þ2 ¼ 12J SSA ¼ R ðR IðI þ 1Þ IðI þ 1Þ i¼1 j¼1 IðI þ 1Þ i¼1 2
ð14:8Þ
As with the Kruskal–Wallis test, the Friedman test rejects H0 when the computed value of the test statistic is too large (an upper-tailed test). For even small-to-moderate values of J (the number of blocks), the test statistic Fr in (14.8) has approximately a chi-squared distribution with I − 1 df. Example 14.13 (Example 14.12 continued) Applying Expression (14.8) to the profit data, the observed value of the test statistic is fr ¼
i 12ð5Þ h ð2:4 2Þ2 þ ð1:8 2Þ2 þ ð1:8 2Þ2 ¼ 1:2 3ð3 þ 1Þ
Since this is far less than the critical value v2:10;31 ¼ 4:605, the null hypothesis of equal mean profit for all three linear programming models is certainly not rejected. ■ The Friedman test is used frequently to analyze “expert ranking” data (see, for example, Exercise 46). If each of J individuals ranks I items on some criterion, then the data is naturally of the type for which the Friedman test was devised. Each ranker acts as a block, and the test seeks out significant differences in the mean rank received by each of the I items. As with the Kruskal–Wallis test, the total sum of squares for the Friedman test is fixed (in fact, it is a simple function of I and J; see Exercise 47(b)). In randomized block ANOVA in Chapter 11, the block sum of squares SSB gave an indication of whether blocking accounted for a significant amount of the total variation in the response values. In the data of Example 14.12, the blocking variable of demand representation clearly has an impact—for instance, the profits in the j = 3 block (middle column) are far lower than in other blocks. Unfortunately, SSB for Friedman’s test is identically 0 (Exercise 47(a)), and so the effectiveness of blocking remains unquantified. Exercises: Section 14.4 (37–48) 37. The article cited in Example 14.10 also reported the fasting C-peptide levels (FCP, nmol/L) for the children in the study. Diabetes group
Ji
ri
1 2 3 4
26 32 65 17
72.8 79.2 56.4 104.6
(Sample sizes are different here than in Example 14.10 due to missing data in the latter.) Use a Kruskal–Wallis test (as the article’s authors did) to determine whether true average FCP levels differ across these four populations of diabetic children. 38. The article “Analyses of Phenotypic Differentiations among South Georgian
Diving Petrel Populations Reveal an Undescribed and Highly Endangered Species from New Zealand” (PLoS ONE, June 27, 2018) reports the possible discovery of a new species of P. georgicus, distinct from the birds of the same name found in the South Atlantic and South Indian Oceans. The table below summarizes information on the bill length (mm) of birds sampled for the study; bill length distributions are skewed, so a nonparametric method is appropriate. Bird origin S. Atlantic S. Indian New Zealand
Ji
ri
22 38 126
121.0 109.3 83.9
884
14
Test at the .01 significance level to see whether the mean bird length differs across these three geographic groups of P. georgicus. [The researchers performed over a dozen similar tests and found many features in which the New Zealand petrels are starkly different from the others.] 39. The following data on the fracture load (kN) of Plexiglas at three different loading point locations appeared in the article “Evaluating Fracture Behavior of Brittle Polymeric Materials Using an IASCB Specimen” (J. of Engr. Manuf. 2013: 133–140). Distance 31.2 mm 36.0 mm 42.0 mm
Fracture load 4.78 3.47 2.62
4.41 3.85 2.99
4.91 3.77 3.39
5.06 3.63 2.86
Use a rank-based method to determine whether loading point distance affects true mean fracture load, at the a = .01 level. 40. Dental composites used to fill cavities will decay over time. The article “In vitro Aging Behavior of Dental Composites Considering the Influence of Filler Content, Storage Media and Incubation Time” (PLoS ONE, April 9, 2018) reported on a study to measure the hardness (MPa) of a particular type of resin after 14 days stored in artificial saliva, lactic acid, citric acid, or 40% ethanol. Saliva Lactic Citric Ethanol Saliva Lactic Citric Ethanol
542.18 478.99 427.97 482.96 477.46 568.14 433.59 387.59
508.31 501.15 388.59 451.48 501.71 494.15 353.03 322.55
473.44 488.97 378.01 436.69 513.65 494.99 344.90 277.84
514.33 463.68 341.61 424.42 471.46 483.89 387.09 367.36
488.41 471.14 395.12 465.64 421.90 520.33 501.81 385.75
Use a Kruskal–Wallis test with significance level .05 to determine whether true average hardness differs by liquid medium. 41. Let SST denote the total sum of squares for the Kruskal–Wallis test: SST ¼ PP Þ2 . Verify that SST = ðRij R n(n2 − 1)/12. [Hint: The Rij’s are a re-arrangement of the integers 1 through
Nonparametric Methods
n. Use the formulas for the sum and sum of squares of the first n positive integers.] 42. Show that the two formulas for the Kruskal–Wallis test statistic in Expression (14.7) are identical and both equal the original formula for H. 43. Many people suffer back or neck pain due to bulging discs in the lumbar or cervical spine, but the thoracic spine (the section inbetween) is less well-studied. The article “Kinematic analysis of the space available for cord and disc bulging of the thoracic spine using kinematic magnetic resonance imaging (kMRI)” (The Spine J. 2018: 1122–1127) describes a study using kMRI to measure disc bulge (mm) in neutral, flexion, and extension positions. a. Suppose measurements were taken on just 6 subjects. The following bulge measurements at the T11–T12 disc (bottom of the thoracic spine) are consistent with information in the article: Subject Position Neutral Flexion Extension
1
2
3
4
5
6
1.28 1.29 1.51
0.88 0.76 1.12
0.69 0.43 0.23
1.52 2.11 1.54
0.83 1.07 0.20
2.58 2.18 1.67
Convert these measurements into withinblock ranks, and use the Friedman test to determine if the true average disc bulge at T11–T12 varies by position. b. The study actually involved 105 subjects, each serving as her/his own block. The sum of the ranks for the three positions were neutral = 219, flexion = 222, extension = 189. Use these to perform the Friedman test, and report your conclusion at the .05 significance level. c. Similar measurements were also taken on all 105 subjects at the T4–T5 disc (top of the thoracic spine); rank sums consistent with the article are 207, 221, and 202. Repeat the test of part (b) for the T4–T5 disc.
14.4
Nonparametric ANOVA
885
44. Is that Yelp review real or fake? The article “A Framework for Fake Review Detection in Online Consumer Electronics Retailers” (Information Processing and Management 2019: 1234–1244) tested five different classification algorithms on a large corpus of Yelp reviews from New York, Los Angeles, Miami, and San Francisco whose authenticity (real or fake) was known. Since review styles differ greatly by city, the researchers used city as a blocking variable. The table below shows the F1 score, a standard measure of classification accuracy, for each algorithm-city pairing. (F1 scores range from 0 to 1, with higher values implying better accuracy.) Algorithm Logistic regression Decision tree Random forest Gaussian Naïve Bayes AdaBoost
NYC
LA
Miami
SF
.79 .81 .82 .72 .83
.73 .74 .78 .69 .79
.78 .81 .80 .71 .82
.77 .81 .80 .69 .82
Test the null hypothesis that the five algorithms are equally accurate in classifying real and fake Yelp reviews at the .10 significance level. 45. Image segmentation is a key tool in computer vision (i.e., helping computers “see” the meaning in pictures). The article “Efficient Quantum Inspired Meta-Heuristics for Multi-Level True Colour Image Thresholding” (Applied Soft Computing 2017: 472–513) reported a study to compare 10 image segmentation algorithms—six conventional, four inspired by quantum computing. Each algorithm was applied to 10 different images, from an elephant to Mono Lake to the Mona Lisa; the images serve as blocks in this study. Kapur’s method, an entropy measure for image segmentation tools, was applied to each (algorithm, image) pair; lower numbers are better. The article reports the following rank averages for the 10 algorithms.
GA 8.30 QIGAMLTCI 2.15 DE 6.60 QIDEMLTCI 2.35
SA 9.10 QISAMLTCI 3.65 BSA 5.90
PSO 8.90 QIPSOMLTCI 1.85 CoDE 6.20
Does the data indicate that the 10 algorithms are not equally effective at minimizing Kapur’s entropy measure? Test at the .01 significance level. What do the rank averages suggest about quantum-inspired versus conventional image segmentation methods? 46. Sustainability in corporate culture is typically described as having three dimensions: economic, environmental, and social. The article “Development of Indicators for the Social Dimension of Sustainability in a U.S. Business Context” (J. of Cleaner Production 2019: 687–697) reported on the development of a survey instrument for the least-studied of these “three pillars,” social sustainability. The researchers had 26 experts take the final version of the survey. In each survey section, participants were asked to rank a set of possible metrics from most important to least important. a. The four metrics listed below were categorized as “public actualization needs.” Use the mean ranks to test at the .05 significance level whether experts systematically prioritize some of these metrics over others with respect to social sustainability. Metric
Mean rank
Ratio of public contributions (e.g., donations) to market capitalization % of public that says company is making the world a better place % of employees that contribute service for the public good Ratio of minority management to minority workforce
1.80 2.50 2.85 2.85
886
14
b. Repeat part (a) using the following survey metrics for “public safety and security needs.” Metric
Mean rank
% of employees receiving human rights policy/procedure training % of investment agreements that include human rights clauses % of company sales that support human/environmental health/safety
1.69 2.04 2.27
47. a. In Chapter 11, the block sum of squares P j X Þ2 . was defined by SSB ¼ I j ðX Replacing X’s with R’s in this expression, explain why SSB = 0 for the Friedman test. b. The total sum of squares for ranks Rij PP Þ2 : Determine is SST ¼ ðRij R SST for the Friedman test. [Hint: Your answer should depend only on I and J.] 48. In the context of a randomized block experiment, let Wi denote the sample rank sum associated with the ith population/ treatment. Show that the Friedman’s test statistic can be re-expressed as I 12 X JðI þ 1Þ 2 Fr ¼ Wi IðI þ 1ÞJ i¼1 2 I X 12 ¼ W 2 3ðI þ 1ÞJ IðI þ 1ÞJ i¼1 i
Supplementary Exercises: (49–58) 49. In a study described in the article “Hyaluronan Impairs Vascular Function and Drug Delivery in a Mouse Model of Pancreatic Cancer” (Gut 2013: 112–120), hyaluronan was depleted from mice using either PEGPH20 or an equal dose of a standard treatment. The vessel patency (%) for each mouse was recorded. PEGPH20 Standard
62 24
68 29
70 35
76 41
Nonparametric Methods
Use a rank-sum test (as did the article’s authors) to determine if PEGPH20 yields higher vessel patency than the standard treatment at the .05 level. Would you also reject the null hypothesis at the .01 level? Comment on these results in light of the fact that every PEGPH20 measurement is higher than every standard-treatment measurement. 50. The article “Long Telomeres are Associated with Clonality in Wild Populations of … C. tenuispina” (Heredity 2015: 437– 443) reported the following telomere measurements for (1) a normal arm and (2) a regenerating arm of 12 Mediterranean starfish. Normal 11.246 11.493 11.136 11.120 10.928 11.556 Regen. 11.142 11.047 11.004 11.506 11.067 10.875 Normal 11.313 11.164 10.878 12.680 11.937 11.172 Regen. 11.484 11.517 10.756 10.973 11.078 11.182
It is theorized that such measurements should be smaller for regenerating arms, because the above values are inversely related to telomere length and longer telomeres are associated with younger tissue. Use a nonparametric test to see if the data supports this theory at the .05 significance level. 51. Physicians use a variety of quantitative sensory testing (QST) tools to assess pain in patients, but there is concern about the consistency of such tools. The article “TestRetest Reliability of [QST] in Knee Osteoarthritis and Healthy Participants” (Osteoarthritis and Cartilage 2011: 655– 658) describes a study in which participants’ responses to various stimuli were measured and then re-measured one week later. For example, pressure was applied to each subject’s knee, and the level (kPa) at which the patient first experienced pain was recorded. a. Pressure pain measurements were taken twice on each of 50 patients with osteoarthritis in the examined knee. The Wilcoxon signed-rank test statistic value computed from this paired data was
Supplementary Exercises
887
s þ = 616. Use a two-sided, large-sample test to assess the reliability of the sensory test at the .10 significance level. b. The same measurements were made of 50 healthy patients, and the resulting test statistic value was s þ = 814. Carry out the test indicated in part (a) using this information. Does the pain pressure test appear reliable for the population of healthy patients? 52. Adding mileage information to roadside amenity signs (“Motel 6 in 1.2 miles”) can be helpful but might also increase accidents as drivers strain to read the detailed information at a distance. The article “Evaluation of Adding Distance Information to Freeway-Specific Service (Logo) Signs” (Transp. Engr. 2011: 782–788) provides the following information on number of crashes per year before and after mileage information was added to signs at six locations in Virginia. Location Before After
1 15 16
2 26 24
3 66 42
4 115 80
5 62 78
6 64 73
a. Use a one-sample sign test to determine whether more accidents occur after mileage information is added to roadside amenity signs. Be sure to state the hypotheses, and indicate what assumptions are required. b. Use a signed-rank test to determine whether more accidents tend to occur after mileage information is added to roadside amenity signs. What are the hypotheses now, and what additional assumptions are required? 53. The accompanying observations on axial stiffness index resulted from a study of metal-plate connected trusses in which five different plate lengths—4 in., 6 in., 8 in., 10 in., and 12 in.—were used (“Modeling Joints Made with Light-Gauge Metal Connector Plates,” Forest Products J. 1979: 39–44).
i = 1 (4 in.): i = 2 (6 in.): i = 3 (8 in.): i = 4 (10 in.): i = 5 (12 in.):
309.2 326.5 331.0 381.7 351.0 382.0 346.7 433.1 407.4 441.8
309.7 349.8 347.2 402.1 357.1 392.4 362.6 452.9 410.7 465.8
311.0 409.5 348.9 404.5 366.2 409.9 384.2 461.4 419.9 473.4
316.8 361.0 367.3 410.6 441.2
Use the Kruskal–Wallis test to decide at significance level .01 whether the true average axial stiffness index depends somehow on plate length. 54. The article “Production of Gaseous Nitrogen in Human Steady-State Conditions” (J. Appl. Physiol. 1972: 155–159) reports the following observations on the amount of nitrogen expired (in liters) under four dietary regimens: (1) fasting, (2) 23% protein, (3) 32% protein, and (4) 67% protein. Use the Kruskal–Wallis test at level .05 to test equality of the corresponding li’s. 1 2 3 4
4.079 4.679 4.368 4.844 4.169 5.059 4.928 5.038
4.859 2.870 5.668 3.578 5.709 4.403 5.608 4.905
3.540 4.648 3.752 5.393 4.416 4.496 4.940 5.208
5.047 3.847 5.848 4.374 5.666 4.688 5.291 4.806
3.298 3.802 4.123 4.674
55. The article “Physiological Effects During Hypnotically Requested Emotions” (Psychosomatic Med. 1963: 334–343) reports the following data (xij) on skin potential in millivolts when the emotions of fear, happiness, depression, and calmness were requested from each of eight subjects. Blocks (subjects)
Fear Happiness Depression Calmness
Fear Happiness Depression Calmness
1
2
3
4
23.1 22.7 22.5 22.6
57.6 53.2 53.7 53.1
10.5 9.7 10.8 8.3
23.6 19.6 21.1 21.6
5
6
7
8
11.9 13.8 13.7 13.3
54.6 47.1 39.2 37.0
21.0 13.6 13.7 14.8
20.3 23.6 16.3 14.8
888
14
Use Friedman’s test to decide whether emotion has an effect on skin potential. 56. In an experiment to study the way in which different anesthetics affect plasma epinephrine concentration, ten dogs were selected, and concentration was measured while they were under the influence of the anesthetics isoflurane, halothane, and cyclopropane (“Sympathoadrenal and Hemodynamic Effects of Isoflurane, Halothane, and Cyclopropane in Dogs,” Anesthesiology 1974: 465–470). Test at level .05 to see whether there is an anesthetic effect on concentration. Dog
Isoflurane Halothane Cyclopropane
Isoflurane Halothane Cyclopropane
1
2
3
4
5
.28 .30 1.07
.51 .39 1.35
1.00 .63 .69
.39 .38 .28
.29 .21 1.24
6
7
8
9
10
.36 .88 1.53
.32 .39 .49
.69 .51 .56
.17 .32 1.02
.33 .42 .30
H a:
The accompanying figure pictures X and Y distributions for which Ha is true. The Wilcoxon rank-sum test is not appropriate in this situation because when Ha is true as pictured, the Y’s will tend to be at the extreme ends of the combined sample (resulting in small and large Y ranks), so the sum of X ranks will result in a W value that is neither large nor small. X distribution Y distribution
1
3
5
4.0 3.7
4.4 4.1
4.8 4.3
4.9 5.1
5.6
[Note: Consult the Lehmann book in the bibliography for more information on this test, called the Siegel–Tukey test.]
the X and Y distributions are identical versus the X distribution is less spread out than the Y distribution
“Ranks” :
ordered, the smallest observation is given rank 1, the largest observation is given rank 2, the second smallest is given rank 3, the second largest is given rank 4, and so on. Then if Ha is true as pictured, the X values will tend to be in the middle of the sample and thus receive large ranks. Let W′ denote the sum of the X ranks and consider rejecting H0 in favor of Ha when w′ c. When H0 is true, every possible set of X ranks has the same probability, so W′ has the same distribution as does W when H0 is true. Thus c can be chosen from Appendix Table A.13 to yield a level a test. The accompanying data refers to medial muscle thickness for arterioles from the lungs of children who died from sudden infant death syndrome (x’s) and a control group of children (y’s). Carry out the test of H0 versus Ha at level .05. SIDS Control
57. Suppose we wish to test H 0:
Nonparametric Methods
6
4
2
Consider modifying the procedure for assigning ranks as follows: After the combined sample of m + n observations is
58. The ranking procedure described in the previous exercise is somewhat asymmetric, because the smallest observation receives rank 1 whereas the largest receives rank 2, and so on. Suppose both the smallest and the largest receive rank 1, the second smallest and second largest receive rank 2, and so on, and let W″ be the sum of the X ranks. The null distribution of W″ is not identical to the null distribution of W, so different tables are needed. Consider the case m = 3, n = 4. List all 35 possible orderings of the three X values among the seven observations (e.g., 1, 3, 7 or 4, 5, 6), assign ranks in the manner described, compute the value of W″ for each possibility, and then tabulate the null distribution of W″. For the test that rejects if w″ c, what value of c prescribes approximately a level .10 test? [Note: This is the Ansari– Bradley test; for additional information, see the book by Hollander and Wolfe in the bibliography.]
Introduction to Bayesian Estimation
15
Introduction In this final chapter, we briefly introduce the Bayesian approach to parameter estimation. The standard frequentist view of inference is that the parameter of interest, h, has a fixed but unknown value. Bayesians, however, regard h as a random variable having a prior probability distribution that incorporates whatever is known about its value. Then to learn more about h, a sample from the conditional distribution f(x|h) is obtained, and Bayes’ theorem is used to produce the posterior distribution of h given the data x1, …, xn. All Bayesian methods are based on this posterior distribution.
15.1
Prior and Posterior Distributions
Throughout this book, we have regarded parameters such as µ, r, p, and k as having an unknown but single, fixed value. This is often referred to as the classical or frequentist approach to statistical inference. However, there is a different paradigm, called subjective or Bayesian inference, in which an unknown parameter is assigned a distribution of possible values, analogous to a probability distribution. This distribution reflects all available information—past experience, intuition, common sense —about the value of the parameter prior to observing the data. For this reason, it is called the prior distribution of the parameter. DEFINITION
A prior distribution for a parameter h, denoted p(h), is a probability distribution on the set of possible values for h. In particular, if the possible values of the parameter h form an interval I, then p(h) is a pdf that must satisfy Z I
pðhÞ dh ¼ 1
Similarly, if h is potentially any value in a discrete set D, then p(h) is a pmf that must satisfy X
pðhÞ ¼ 1
h2D
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. L. Devore et al., Modern Mathematical Statistics with Applications, Springer Texts in Statistics, https://doi.org/10.1007/978-3-030-55156-8_15
889
890
15
Introduction to Bayesian Estimation
Example 15.1 Consider the parameter µ = the mean GPA of all students at your university. Since GPAs are always between 0.0 and 4.0, µ must also lie in this interval. But common sense tells you that µ is almost certainly not below 2.0, or very few people would graduate, and it would be likewise surprising to find µ above 3.5. This “prior belief” can be expressed mathematically as a prior distribution for µ on the interval I = [0, 4]. If our best guess a priori is that µ 2.5, then our prior distribution p(µ) should be centered around 2.5. The variability of the prior distribution we select should reflect how sure we feel about our initial information. If we feel very sure that µ is near 2.5, then we should select a prior distribution for µ that has less variation around that value. On the other hand, if we are less certain, this can be reflected by a prior distribution with much greater variability. Figure 15.1 illustrates these two cases; both of the pdfs depicted are beta distributions with A = 0 and B = 4. π(μ) 3.5 3.0 narrower prior
2.5 2.0 1.5 wider prior
1.0 0.5
μ
0.0 0
1
2
3
4
Figure 15.1 Two prior distributions for a parameter: a more diffuse prior (less certainty) and a more concentrated prior (more certainty)
■
The Posterior Distribution of a Parameter The key to Bayesian inference is having a mathematically rigorous way to combine the sample data with prior belief. Suppose we observe values x1, …, xn from a distribution depending on the unknown parameter h for which we have selected some prior distribution. Then a Bayesian statistician wants to “update” her or his belief about the distribution of h, taking into account both prior belief and the observed xi’s. Recall from Chapter 2 that Bayes’ theorem was used to obtain posterior probabilities of partitioning events A1 ; . . .; Ak conditional on the occurrence of some other event B. The following definition relies on the analogous result for random variables. DEFINITION
Suppose X1, …, Xn have joint pdf f(x1, …, xn; h) and the unknown parameter h has been assigned a continuous prior distribution p(h). Then the posterior distribution of h, given the observations X1 = x1, …, Xn = xn, is
15.1
Prior and Posterior Distributions
891
pðhÞf ðx1 ; . . .; xn ; hÞ 1 pðhÞf ðx1 ; . . .; xn ; hÞ dh
pðhjx1 ; . . .; xn Þ ¼ R 1
ð15:1Þ
The integral in the denominator of (15.1) insures that the posterior distribution is a valid probability density with respect to h. If X1, …, Xn are discrete, the joint pdf is replaced by their joint pmf. Notice that constructing the posterior distribution of a parameter requires a specific probability model f(x1, …, xn; h) for the observed data. In Example 15.1, it would not be enough to simply observe the GPAs of a random sample of n students; one must specify the underlying distribution, with mean µ, from which those GPAs are drawn. Example 15.2 Emissions of subatomic particles from a radiation source are often modeled as a Poisson process. This implies that the time between successive emissions follows an exponential distribution. In practice, the parameter k of this distribution is typically unknown. If researchers believe a priori that the average time between emissions is about half a second, so k 2, a prior distribution with a mean around 2 might be selected for k. One example is the following gamma distribution, which has mean (and variance) of 2: pðkÞ ¼ kek
k[0
Notice that the gamma distribution has support equal to 0; 1, which is also the set of possible values for the unknown parameter k. The times X1, …, X5 between five particle emissions will be recorded; it is these variables that have an exponential distribution with the unknown parameter k (equivalently, mean 1/k). Because the Xi’s are also independent, their joint pdf is f ðx1 ; . . .; x5 ; kÞ ¼ f ðx1 ; kÞ f ðx5 ; kÞ ¼ kekx1 kekx5 ¼ k5 ek
P
xi
Applying (15.1) with these two components, the posterior distribution of k given the observed data is P P pðkÞf ðx1 ; . . .; x5 ; kÞ kek k5 ek xi k6 ekð1 þ xi Þ P P pðkjx1 ; . . .; x5 Þ ¼ R 1 ¼ R1 ¼ R1 kek k5 ek xi dk k6 ekð1 þ xi Þ dk 1 pðkÞf ðx1 ; . . .; x5 ; kÞ dk 0
0
Suppose the five observed interemission times are x1 = 0.66, x2 = 0.48, x3 = 0.44, x4 = 0.71, P x5 = 0.56. The sum of these five times is xi ¼ 2:85, and so the posterior distribution simplifies to pðkj0:66; . . .; 0:56Þ ¼ R 1 0
k6 e3:85k 3:857 6 3:85k ke ¼ 6 3:85k 6! k e dk
k[0
The integral in the denominator was evaluated using the gamma integral formula (4.5) from Chapter 4; as noted previously, the purpose of this integral is to guarantee that the posterior distribution of k is a valid probability density. As a function of k, we recognize this as a gamma distribution with parameters a = 7 and b = 1/3.85. The prior and posterior density curves of k appear in Figure 15.2.
892
15
Introduction to Bayesian Estimation
Density 0.6 0.5 posterior 0.4 0.3 0.2 prior
0.1 0 0
2
4
6
8
10
λ
Figure 15.2 Prior and posterior distributions of k for Example 15.2
■
Example 15.3 A 2010 National Science Foundation study found that 488 out of 939 surveyed adults incorrectly believe that antibiotics kill viruses (they only kill bacteria). Let h denote the proportion of all U.S. adults that hold this mistaken view. Imagine that an NSF researcher, in advance of administering the survey, believed (hoped?) the value of h was roughly 1 in 3, but he was very uncertain about this belief. Since any proportion must lie between 0 and 1, the standard beta family of distributions from Section 4.5 provides a natural source of priors for h. One such beta distribution, with an expected value of 1/3, is the Beta(2, 4) model whose pdf is pðhÞ ¼ 20hð1 hÞ3
0\h\1
The data mentioned at the beginning of the example can be considered either a random sample of size 939 from the Bernoulli distribution or, equivalently, a single observation from the binomial distribution with n = 939. Let Y = the number of U.S. adults in a random sample of 939 that believe antibiotics kill viruses. Then Y * Bin(939, h), and the pmf of Y is p(x; h) = 939 hy ð1 hÞ939y . y Substituting the observed value y = 488, (15.1) gives the posterior distribution of h as
pðhjY ¼ 488Þ ¼ R 1 0
¼ R1 0
pðhÞpð488; hÞ pðhÞpð488; hÞ dh h489 ð1 hÞ454 h
489
ð1 hÞ
454
dh
¼ R1 0
20hð1 hÞ3 20hð1 hÞ3
¼ c h489 ð1 hÞ454
939 488 939 488
!
h488 ð1 hÞ451
!
h488 ð1 hÞ451 dh
0\h\1
Recall that the constant c, which equals the reciprocal of the integral in the denominator, serves to insure that the posterior distribution p(h|Y = 488) integrates to 1. Rather than evaluating the integral, we can simply recognize the expression h489(1 − h)454 as a standard beta distribution, specifically with parameters a = 490 and b = 455, that’s just missing the constant of integration in front.
15.1
Prior and Posterior Distributions
893
It follows that the posterior distribution of h given Y = 488 must be Beta(490, 455); if we require c, it can be determined directly from the beta pdf. This trick comes in handy quite often in Bayesian statistics: if we can recognize a posterior distribution as being proportional to a particular probability distribution, then it must necessarily be that distribution. The prior and posterior density curves for h are displayed in Figure 15.3. While the prior distribution is centered around 1/3 and exhibits a great deal of uncertainty (variability), the posterior distribution of h is centered much closer to the sample proportion of incorrect answers, 488/939 .52, with considerably less uncertainty.
a
b θ|y)
θ)
θ 0
0.2
0.4
0.6
0.8
1
θ 0
0.2
0.4
0.6
0.8
1
Figure 15.3 Density curves for the parameter h in Example 15.3: (a) prior Beta(2, 4), (b) posterior Beta(490, 455) ■
Conjugate Priors In the examples of this section, prior distributions were chosen partially by matching the mean of a distribution to someone’s a priori “best guess” about the value of the parameter. We also mentioned at the beginning of the section that the variance of the prior distribution often reflects the strength of that belief. In practice, there is a third consideration for choosing a prior distribution: the ability to apply (15.1) in a simple fashion. Ideally, we would like to choose a prior distribution from a family (gamma, beta, etc.) such that the posterior distribution is from that same family. When this happens we say that the prior distribution is conjugate to the data distribution. In Example 15.2, the prior p(k) is the Gamma(2, 1) pdf; we determined, using (15.1), that the posterior distribution was Gamma(7, 1/3.85). It can be shown in general (Exercise 6) that any gamma distribution is conjugate to an exponential data distribution. Similarly, the prior and posterior distributions of h in Example 15.3 were Beta(2, 4) and Beta(490, 455), respectively. The following proposition generalizes the result of Example 15.3. PROPOSITION
Let X1 ; . . .; Xn be a random sample from a P Bernoulli distribution with unknown parameter value p. (Equivalently, let Y ¼ Xi be a single observation from a Bin(n, p) distribution). If p is assigned a beta prior distribution with parameters a0 and b0, then the posterior distribution of p given the xi’s is the beta distribution with parameters a ¼ a0 þ y and b ¼ b0 þ n y. That is, the beta distribution is a conjugate prior to the Bernoulli (or binomial) data model.
894
15
Introduction to Bayesian Estimation
Proof The joint Bernoulli pmf of X1 ; . . .; Xn is px1 ð1 pÞ1x1 pxn ð1 pÞ1xn ¼ p
P
xi
ð1 pÞn
P
xi
¼ py ð1 pÞny
The prior distribution assigned to p is pðpÞ / pa0 1 ð1 pÞb0 1 . Apply (15.1): P P pðpjx1 ; . . .; xn Þ / pa0 1 ð1 pÞb0 1 p xi ð1 pÞn xi P P ¼ pa0 1 þ xi ð1 pÞb0 1 þ n xi ¼ pa0 þ y1 ð1 pÞb0 þ ny1 As a function of p, we recognize this last expression as proportional to the beta pdf with parameters a ¼ a0 þ y and b ¼ b0 þ n y. ■ The values a0 and b0 in the foregoing proposition are called hyperparameters; they are the parameters of the distribution assigned to the original parameter, p. In Example 15.2, the prior distribution pðkÞ ¼ kek is the Gamma(2, 1) pdf, so the hyperparameters of that distribution are a0 = 2 and b0 = 1. Conjugate priors have been determined for several of the named data distributions, including binomial, Poisson, gamma, and normal. For two-parameter families such as gamma and normal, it is sometimes reasonable to assume one parameter has a known value and then assign a prior distribution to the other. In some instances, a joint prior distribution for the two parameters can be found such that the posterior distribution is tractable, but these are less common. PROPOSITION
Let X1 ; . . .; Xn be a random sample from a N(l, r) distribution with r known. If l is assigned a normal prior distribution with hyperparameters l0 and r0, then the posterior distribution of l is also normal, with posterior hyperparameters r21
¼
1
n 1 þ 2 2 r r0
l1 ¼
nx l0 þ r2 r20
r21
That is, the normal distribution is a conjugate prior for l in the normal data model when r is assumed known. Proof To determine the posterior distribution of l, apply (15.1): 2 2 2 1 1 1 2 2 2 pðljx1 ; . . .; xn Þ / pðlÞf ðx1 ; . . .; xn ; lÞ ¼ pffiffiffiffiffiffi eðll0 Þ =2r0 pffiffiffiffiffiffi eðx1 lÞ =2r pffiffiffiffiffiffi eðxn lÞ =2r 2pr0 2pr 2pr 2
/ eð1=2Þ½ðx1 lÞ =r
2
þ þ ðxn lÞ2 =r2 þ ðll0 Þ2 =r20
15.1
Prior and Posterior Distributions
895
The trick here is to complete the square in the exponent, which yields 2
2
pðljx1 ; . . .; xn Þ / eðll1 Þ =2r1 þ C ; where C does not involve l and
r21 ¼
P
1 n 1 þ r2 r20
l0 nx l0 þ r20 r2 r20 l1 ¼ ¼ n 1 n 1 þ þ 2 r2 r0 r2 r20 xi
r2
þ
With respect to l, the last expression for pðljx1 ; . . .; xn Þ is proportional to the normal pdf with parameters l1 and r1. ■ To make sense of these messy parameter expressions we define the precision, denoted by s, as the reciprocal of the variance (because a lower variance implies a more precise measurement), and the weights then are the corresponding precisions. If we let s ¼ 1=r2 , s0 ¼ 1=r20 , and sX ¼ 1=r2X ¼ n=r2 , then the posterior hyperparameters in the previous proposition can be restated as l1 ¼
sX x þ s0 l0 sX þ s0
and
s1 ¼ sX þ s0
The posterior mean l1 is a weighted average of the prior mean l0 and the data mean x, and the posterior precision is the sum of the prior precision plus the precision of the sample mean. Exercises: Section 15.1 (1–10) 1. A certain type of novelty coin is manufactured so that 80% of the coins are fair while the rest have a .75 chance of landing heads. Let h denote the probability of heads for a novelty coin randomly selected from this population. a. Express the given information as a prior distribution for the parameter h. b. Five tosses of the randomly selected coin result in the sequence HHHTH. Use this data to determine the posterior distribution of h. 2. Three assembly lines for the same product have different nonconformance rates: p = .1 for Line A, p = .15 for Line B, and p = .2 for Line C. One of the three lines will be selected at random (but you don’t know which). Let X = the number of items inspected from the selected line until a nonconforming one is found. a. What is the distribution of X, as a function of the unknown p?
b. Express the given information as a prior distribution for the parameter p. [Hint: There are three possible values for p. What should be their a priori likelihoods?] c. It is determined that the 8th item coming off the randomly selected line is the first nonconforming one. Use this information to determine the posterior distribution of p. 3. The number of customers arriving during a one-hour period at an ice cream shop is modeled by a Poisson distribution with unknown parameter µ. Based on past experience, the owner believes that the average number of customers in one hour is about 15. a. Assign a prior to µ from the gamma family of distributions, such that the mean of the prior is 15 and the standard deviation is 5 (reflecting moderate uncertainty).
896
15
b. The number of customers in ten randomly selected one-hour intervals is recorded: 16 9
11 13 17
17 8
15
14 16
Determine the posterior distribution of µ. 4. At children’s party, kids toss ping-pong balls into the swimming pool hoping to land inside a plastic ring (picture a small hula hoop). Let p denote the probability of successfully tossing a ball into the ring, which we will assign a Beta(1, 3) prior distribution. The variable X = number of tosses required to get 5 balls in the ring (at which point the child wins a prize) will be observed for a sample of children. a. What is a reasonable distribution for the rv X? What are its parameters? b. The number of tosses required by eight kids was 12
8
14
12
12
6
8
27
Determine the posterior distribution of p. 5. Consider a random sample X1, X2, …, Xn from the Poisson distribution with mean l. If the prior distribution for l is a gamma distribution with hyperparameters a0 and b0, show that the posterior distribution is also gamma distributed. What are its hyperparameters? 6. Suppose you have a random sample X1, X2, …, Xn from the exponential distribution with parameter k. If a gamma distribution with hyperparameters a0 and b0 is assigned as the prior distribution for k, show that the posterior distribution is also gamma distributed. What are its hyperparameters?
15.2
Introduction to Bayesian Estimation
7. Let X1 ; . . .; Xn be a random sample from a negative binomial distribution with r known and p unknown. Assume a Beta (a0, b0) prior distribution for p. Show that the posterior distribution of p is also a beta distribution, and identify the updated hyperparameters. 8. Consider a random sample X1, X2, …, Xn from the normal distribution with mean 0 and precision s (use s as a parameter instead of r2 = 1/s). Assume a gammadistributed prior for s and show that the posterior distribution of s is also gamma. What are its parameters? 9. Wind speeds in a certain area are modeled using a lognormal distribution, with unknown first parameter l and known second parameter r = 1. Suppose l is assigned a normal prior distribution with mean l0 and precision s0. Based on observing a random sample of wind speeds x1 ; . . .; xn in this area, determine the posterior distribution of l. 10. Wait times for Uber rides as people exit a certain sports arena are uniformly distributed on the interval [0, h] with h unknown. Suppose the following Pareto prior distribution is assigned to h: pðhÞ ¼
24; 000 h4
h 20
Based on observing the wait times x1 ; . . .; xn of n Uber customers, determine the posterior distribution of h. [Hint: Some care must be taken to address the boundaries h 20 and xi h.]
Bayesian Point and Interval Estimation
The previous section introduced the paradigm of Bayesian inference, wherein parameters are not just regarded as unknown but as having a distribution of possible values prior to observing any data. Such prior distributions are, by definition, valid probability distributions with respect to the parameter h. The key to Bayesian inference is Equation (15.1), which applies Bayes’ theorem for random variables to h and the sample X1 ; . . .; Xn . The result is an update to our belief about h, called the posterior distribution. From a Bayesian perspective, the posterior distribution of h represents the most complete expression of what can be inferred from the sample data. But the posterior distribution can give rise to
15.2
Bayesian Point and Interval Estimation
897
point and interval estimates for the parameter h, although the interpretation of the latter differs from that of our earlier confidence intervals. Bayesian Point Estimators Although specifying a single-value estimate of an unknown parameter conflicts somewhat with the Bayesian philosophy, there are occasions when such an estimate is desired. The most common Bayesian point estimator of a parameter h is the mean of its posterior distribution: ^h ¼ EðhjX1 ; . . .; Xn Þ
ð15:2Þ
Hereafter we shall refer to Expression (15.2) as the Bayes estimator of h. A Bayes estimate is obtained by plugging in the observed values of the Xi’s, resulting in the numerical value ^h ¼ Eðhjx1 ; . . .; xn Þ. Example 15.4 (Example 15.2 continued) The posterior distribution of the exponential parameter k given the five observed interemission times was determined to be gamma with parameters a = 7 and b = 1/3.85. Since the mean of a gamma distribution is ab, the Bayes estimate of k here is ^k ¼ Eðkj0:66; . . .; 0:56Þ ¼ ab ¼ 7ð1=3:85Þ ¼ 1:82 This isn’t too different from the researchers’ prior belief that k 2. If we retrace the steps that led to this posterior distribution, we find more generally that for data model X1 ; . . .; Xn * exponential(k) and prior k * gamma(a0, b0), the posterior distribution of k is P the gamma pdf with a = a0 + n and 1=b ¼ 1=b0 þ Xi . Therefore, the Bayes estimator for k in this scenario is ^k ¼ EðkjX1 ; . . .; Xn Þ ¼ ab ¼
a0 þ n P 1=b0 þ Xi
■
Although the mean of the posterior distribution is commonly used as a point estimate for a parameter in Bayesian inference, that is not the only available choice. Some practitioners prefer to use the mode of the posterior distribution rather than the mean; this choice is called the maximum a posteriori (MAP) estimate of h. For small samples, the Bayes estimate (i.e., mean) and MAP estimate can differ considerably. Typically, though, when n is large these estimates for the parameter will be reasonably close. This makes sense intuitively, since as n increases any sensible estimator should converge to the true, single value of the parameter (i.e., be consistent). Example 15.5 (Example 15.3 continued) The posterior distribution of the parameter h = the proportion of all U.S. adults that incorrectly believe antibiotics kill viruses was determined to have a Beta(490, 455) distribution. Since the mean of a Beta(a, b) distribution is a=ða þ bÞ, a point estimate of h is ^h ¼ Eðhjy ¼ 488Þ ¼
490 490 ¼ ¼ :5185 490 þ 455 945
It can be shown that the mode of a beta distribution occurs at ða 1Þ=ða þ b 2Þ provided that a > 1 and b > 1. Hence the MAP estimate of h here is (490 – 1)/(490 + 455 – 2) = 489/943 = .5186. Notice these are both quite close to the frequentist estimate y/n = 488/939 = .5197. ■
898
15
Introduction to Bayesian Estimation
Properties of Bayes Estimators In most cases, the contribution of the observed values x1 ; . . .; xn in shaping the posterior distribution of a parameter h increases as the sample size n increases. Equivalently, the choice of prior distribution is less impactful for large samples, because the data “dominates” that original choice. It can be shown that under very general conditions, as n ! 1 (1) the mean of the posterior distribution will converge to the true value of h, and (2) the variance of the posterior distribution of h converges to zero. The second property manifests itself in our two previous examples: the variability of the posterior distribution of k based on n = 5 observations was still rather substantial, while the posterior distribution of h based on a sample of size n = 939 was quite concentrated. In the language of Chapter 7, these two properties imply that Bayes estimators are generally consistent. b and X converge to the true values of corresponding Since traditional estimators such as P parameters (e.g., p or µ) by the Law of Large Numbers, it follows that Bayesian and frequentist estimates will typically be quite close when n is large. This is true both for the point estimates and the interval estimates (Bayesian intervals will be introduced shortly). But when n is small—a common occurrence in Bayesian methodology—parameter estimates based on the two methods can differ drastically. This is especially true if the researcher’s prior belief is very far from what’s actually true (e.g., believing a proportion is around 1/3 when it’s really greater than .5). Example 15.6 Consider a Bernoulli(p) random sample X1 ; . . .; Xn or, equivalently, a single binomial P observation Y ¼ Xi . If we assign a Beta(a0, b0) prior to p, a proposition from the previous section establishes that the posterior distribution of p given that Y = y is Betaða0 þ y; b0 þ n yÞ. Hence the Bayes estimator of p is ^p ¼ EðpjX1 ; . . .; Xn Þ ¼ EðpjYÞ ¼
ða0 þ YÞ a0 þ Y ¼ ða0 þ YÞ þ ðb0 þ n YÞ a0 þ b0 þ n
One way to think about the prior distribution here is that it “seeds” the sample with a0 successes and b0 failures before data is obtained. The quantities a0 þ Y and b0 þ n Y then represent the number of successes and failures after sampling, and the Bayes estimator ^ p is the sample proportion of successes from this perspective. With a little algebra, we can re-express the Bayes estimator as ^p ¼
a0 Y a0 þ b 0 a0 n Y þ ¼ þ a 0 þ b0 þ n a 0 þ b 0 þ n a 0 þ b 0 þ n a 0 þ b 0 a 0 þ b0 þ n n
ð15:3Þ
^ as a weighted average of the prior expecExpression (15.3) represents the Bayesian estimator p P tation of p, a0 =ða0 þ b0 Þ, and the sample proportion of successes Y=n ¼ Xi =n. By the Law of Large Numbers, the sample proportion of successes Y/n converges to the true value of the Bernoulli parameter, which we will denote by p . Taking the limit of (15.3) as n ! 1 yields ^p ! 0
a0 þ 1 p ¼ p ; a 0 þ b0
so that the Bayes estimator is indeed consistent.
■
15.2
Bayesian Point and Interval Estimation
899
Bayesian Interval Estimation The Bayes estimate ^h provides a single-value “best guess” for the true value of the parameter h based on its posterior distribution. An interval [a, b] having posterior probability .95 gives a 95% credible interval, the Bayesian analogue of a 95% confidence interval (but the interpretation is different). Typically one selects the middle 95% of the posterior distribution; i.e., the endpoints of a 95% credible interval are ordinarily the .025 and .975 quantiles of the posterior distribution. Example 15.7 (Example 15.4 continued) Given the observed values of X1, …, X5, we previously found that the emission rate parameter k has a Gamma(7, 1/3.85) posterior distribution. A 95% credible interval for k requires determining the .025 and .975 quantiles of the Gamma(7, 1/3.85) model. Using statistical software, η.025 = 0.7310 and η.975 = 3.3921, so the 95% credible interval for k is (0.7310, 3.3921). Under the Bayesian interpretation, having observed the five aforementioned interemission times, there is a 95% posterior probability that k is between 0.7310 and 3.3921 emissions per second. Taking reciprocals, the mean time between emissions (i.e., 1/k) is estimated to lie in the interval (1/3.3921, 1/0.7310) = (0.295, 1.368) seconds with posterior probability .95. ■ Example 15.8 (Example 15.5 continued) The posterior distribution of the parameter h = the proportion of all U.S. adults that incorrectly believe antibiotics kill viruses was a Beta(490, 455) distribution. The .025 and .975 quantiles of this beta distribution are η.025 = .4866 and η.975 = .5503. So, after observing the results of the NSF survey, there is a 95% posterior probability that h is between .4866 and .5503. For comparison, the one-proportion z interval based on y/n = 488/939 = .5197 is rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :5197ð1 :5197Þ :5197 1:96 ¼ ð:4877; :5517Þ 939 ■
Due to the large sample size, the two intervals are quite similar.
It must be emphasized that, even if the confidence interval is nearly the same as the credible interval for a parameter, they have different interpretations. To interpret the Bayesian credible interval, we say that there is a 95% probability that the parameter h is in the interval. However, for the frequentist confidence interval such a probability statement does not make sense: as we discussed in Section 8.1, neither the parameter h nor the endpoints of the interval are considered random under the frequentist view. (Instead, the confidence level is the long-run capture frequency if the formula is used repeatedly on different samples.) Example 15.9 Consider the IQ scores of 18 first-grade boys, from the private speech data introduced in Exercise 81 from Chapter 1: 113
108
140
113
115
146
136
107
108
119
132
127
118
108
103
103
122
111
IQ scores are generally found to be normally distributed, and because IQs have a standard deviation of 15 nationwide, we can assume r = 15 is known and valid here. Let’s perform a Bayesian analysis on the mean IQ l of all first-grade boys at the school. For the normal prior distribution it is reasonable to use a mean of l0 = 110, a ballpark figure for previous years in this school. It is harder to prescribe a standard deviation for the prior, but we will use r0 = 7.5. (This is the standard deviation for the average of four independent observations if the individual standard deviation is 15. As a result, the effect on the posterior mean will turn out to be the same as if there were four additional observations with average 110.)
900
15
Introduction to Bayesian Estimation
The last proposition from Section 15.1 states that the posterior distribution of l is also normal and specifies the posterior hyperparameters. Numerically, we have 1 1 1 1 1 1 1 ¼ þ ¼ 2 þ 2¼ 2 ¼ :09778 ¼ 2 2 10:227 3:1982 r1 r =n r0 15 =18 7:5 nx l0 18ð118:28Þ 110 þ þ r2 r20 2 15 7:52 ¼ 116:77 l1 ¼ ¼ n 1 18 1 þ þ r2 r20 152 7:52 The posterior distribution is normal with mean l1 = 116.77 and standard deviation r1 = 3.198. The mean l1 is a weighted average of x ¼ 118:28 and l0 = 110, so l1 is necessarily between them. As n becomes large the weight given to l0 declines, and l1 will be closer to x. The 95% credible interval for l is the middle 95% of the N(116.77, 3.198) distribution, which works out to be (110.502, 123.038). For comparison, the 95% confidence interval using x ¼ 118:28 pffiffiffi and r = 15 is x 1:96r= n ¼ ð111:35; 125:21Þ. Notice that the one-sample z interval must be wider: because the precisions add to give the posterior precision, the posterior precision is greater than the prior precision and it is greater than the data precision. Therefore, it is guaranteed that the pffiffiffi posterior standard deviation r1 will be less than both r0 and r= n. Both the credible interval and the confidence interval exclude 110, so we can be pretty sure that l exceeds 110. Another way of looking at this is to calculate the posterior probability of l being less than or equal to 110 (the Bayesian approach to hypothesis testing). Using l1 = 116.77 and r1 = 3.198, we obtain the probability .0171, supporting the claim that l exceeds 110. What should be done if there are no prior observations and there are no strong opinions about the prior mean l0? In this case the prior standard deviation r0 can be taken as some number much larger than r, such as r0 = 1000 in our example. The result is that the prior will have essentially no effect, and the posterior distribution will be based on the data: l1 x ¼ 118:28 and r1 r = 15. The 95% credible interval will be virtually the same as the 95% confidence interval based on the 18 observations, (111.35, 125.21), but of course the interpretation is different. ■ Exercises: Section 15.2 (11–20) 11. Refer back to Exercise 3. a. Calculate the Bayes estimate of the Poisson mean parameter l. b. Calculate and interpret a 95% credible interval for l. 12. Refer back to Exercise 4. a. Calculate the Bayes estimate of the probability p. b. Calculate and interpret a 95% credible interval for p. 13. Laplace’s rule of succession says that if all n Bernoulli trials have been successes, then the probability of a success on the next trial is (n + 1)/(n + 2). For the derivation, Laplace used a Beta(1, 1) prior for the parameter p.
a. Show that, if a Beta(1, 1) prior is assigned to p and there are n successes in n trials, then the posterior mean of p is (n + 1)/(n + 2). b. Explain (a) in terms of total successes and failures; that is, explain the result in terms of two prior trials plus n later trials. c. Laplace applied his rule of succession to compute the probability that the sun will rise tomorrow using 5000 years, or n = 1,826,214 days of history in which the sun rose every day. Is Laplace’s method equivalent to including two prior days when the sun rose once and failed to rise once? Criticize the answer in terms of total successes and failures.
15.2
Bayesian Point and Interval Estimation
901
14. In a study of 70 restaurant bills, 40 of the 70 were paid using cash. Let p denote the population proportion paying cash. a. Assuming a beta prior distribution for p with a0 = 2 and b0 = 2, obtain the posterior distribution of p. b. Repeat (a) with a0 and b0 positive and close to 0. c. Calculate a 95% credible interval for p using (b). Is your interval compatible with p = .5? d. Calculate a 95% confidence interval for p using Equation (5.3), and compare with the result of (c). e. Compare the interpretations of the credible interval and the confidence interval. f. Based on the prior in (b), test the hypothesis p .5 by using the posterior distribution to find P(p .5). 15. For the scenario of Example 15.9, assume the same normal prior distribution but suppose that the data set is just one observation x ¼ 118:28 with standard deviation pffiffiffiffiffi pffiffiffi r= n ¼ 15= 18 ¼ 3:5355. Use Equation (15.1) to derive the posterior distribution, and compare your answer with the result of Example 15.9. 16. Here are the IQ scores for the 15 first-grade girls from the study mentioned in Example 15.9. 102 109
96 113
106 82
118 110
108 121
122 110
115 99
113
Assume that the data is a random sample from a normal distribution with mean l and r = 15, and assign to l the same N(110, 7.5) prior distribution used in Example 15.9. a. Determine the posterior distribution of l. b. Calculate and interpret a 95% credible interval for l. c. Add four observations with average 110 to the data and compute a 95% confidence one-sample z interval for l using the 19 observations. Compare with the result of (b). d. Change the prior so the prior precision is very small but positive, and then recompute (a) and (b).
e. Calculate a 95% confidence one-sample z interval for l using the 15 observations and compare with the credible interval of (d). 17. If a and b are large, then the beta distribution can be approximated by the normal distribution using the beta mean and variance given in Section 4.5. This is useful in case beta distribution software is unavailable. Use the approximation to compute the credible interval in Example 15.8. 18. Two political scientists wish to forecast the proportion of votes that a certain U.S. senator will earn in her upcoming reelection contest. The first political scientist assigns a Beta(3, 2) prior to p = the true support rate for the senator, while the second assigns a Beta(6, 6) prior. a. Determine the expectations of both prior distributions. b. Which political scientist appears to feel more sure about this prior belief? How can you tell? c. Determine both Bayes estimates in this scenario, assuming that y out of n randomly selected voters indicate they will vote to reelect the senator. d. For what survey size n are the two Bayes estimates guaranteed to be within .005 of each other, no matter the value of y? 19. Consider a random sample X1 ; . . .; Xn from a Poisson distribution with unknown mean l, and assign to l a Gamma(a0, b0) prior distribution. a. What is the prior expectation of l? ^. b. Determine the Bayes estimator l c. Let l* denote the true value of l. Show ^ is a consistent estimator of l*. that l [Hint: Look back at Example 15.6.] 20. Consider a random sample X1 ; . . .; Xn from an exponential distribution with parameter k, and assign to k a Gamma(a0, b0) prior distribution. a. What is the prior expectation of k? b. Determine the Bayes estimator ^ k. c. Let k* denote the true value of k. Show that ^ k is a consistent estimator of k*.
Appendix
Table A.1 Cumulative binomial probabilities Bðx; n; pÞ ¼
x P y¼0
bðy; n; pÞ
p 0.01
0.05
0.10
0.20
0.25
.951 .999 1.000 1.000 1.000
.774 .977 .999 1.000 1.000
.590 .919 .991 1.000 1.000
.328 .737 .942 .993 1.000
.237 .633 .896 .984 .999
.904 .996 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.599 .914 .988 .999 1.000 1.000 1.000 1.000 1.000 1.000
.349 .736 .930 .987 .998 1.000 1.000 1.000 1.000 1.000
.107 .376 .678 .879 .967 .994 .999 1.000 1.000 1.000
.860 .990 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.463 .829 .964 .995 .999 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.206 .549 .816 .944 .987 .998 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.035 .167 .398 .648 .836 .939 .982 .996 .999 1.000 1.000 1.000 1.000 1.000 1.000
0.30
0.40
0.50
0.60
0.70
0.75
0.80
0.90
0.95
0.99
.168 .528 .837 .969 .998
.078 .337 .683 .913 .990
.031 .188 .500 .812 .969
.010 .087 .317 .663 .922
.002 .031 .163 .472 .832
.001 .016 .104 .367 .763
.000 .007 .058 .263 .672
.000 .000 .009 .081 .410
.000 .000 .001 .023 .226
.000 .000 .000 .001 .049
.056 .244 .526 .776 .922 .980 .996 1.000 1.000 1.000
.028 .149 .383 .650 .850 .953 .989 .998 1.000 1.000
.006 .046 .167 .382 .633 .834 .945 .988 .998 1.000
.001 .011 .055 .172 .377 .623 .828 .945 .989 .999
.000 .002 .012 .055 .166 .367 .618 .833 .954 .994
.000 .000 .002 .011 .047 .150 .350 .617 .851 .972
.000 .000 .000 .004 .020 .078 .224 .474 .756 .944
.000 .000 .000 .001 .006 .033 .121 .322 .624 .893
.000 .000 .000 .000 .000 .002 .013 .070 .264 .651
.000 .000 .000 .000 .000 .000 .001 .012 .086 .401
.000 .000 .000 .000 .000 .000 .000 .000 .004 .096
.013 .080 .236 .461 .686 .852 .943 .983 .996 .999 1.000 1.000 1.000 1.000 1.000
.005 .035 .127 .297 .515 .722 .869 .950 .985 .996 .999 1.000 1.000 1.000 1.000
.000 .005 .027 .091 .217 .402 .610 .787 .905 .966 .991 .998 1.000 1.000 1.000
.000 .000 .004 .018 .059 .151 .304 .500 .696 .849 .941 .982 .996 1.000 1.000
.000 .000 .000 .002 .009 .034 .095 .213 .390 .597 .783 .909 .973 .995 1.000
.000 .000 .000 .000 .001 .004 .015 .050 .131 .278 .485 .703 .873 .965 .995
.000 .000 .000 .000 .000 .001 .004 .017 .057 .148 .314 .539 .764 .920 .987
.000 .000 .000 .000 .000 .000 .001 .004 .018 .061 .164 .352 .602 .833 .965
.000 .000 .000 .000 .000 .000 .000 .000 .000 .002 .013 .056 .184 .451 .794
.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .005 .036 .171 .537
.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .010 .140
a. n = 5 x
0 1 2 3 4
b. n = 10 x
0 1 2 3 4 5 6 7 8 9
c. n = 15 x
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
(continued) © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. L. Devore et al., Modern Mathematical Statistics with Applications, Springer Texts in Statistics, https://doi.org/10.1007/978-3-030-55156-8
903
904
Appendix
Table A.1 (continued) Bðx; n; pÞ ¼
x P y¼0
bðy; n; pÞ
p 0.01
0.05
0.10
0.20
0.25
0.30
0.40
0.50
0.60
0.70
0.75
0.80
0.90
0.95
0.99
.818 .983 .999 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.358 .736 .925 .984 .997 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.122 .392 .677 .867 .957 .989 .998 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.012 .069 .206 .411 .630 .804 .913 .968 .990 .997 .999 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.003 .024 .091 .225 .415 .617 .786 .898 .959 .986 .996 .999 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.001 .008 .035 .107 .238 .416 .608 .772 .887 .952 .983 .995 .999 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.000 .001 .004 .016 .051 .126 .250 .416 .596 .755 .872 .943 .979 .994 .998 1.000 1.000 1.000 1.000 1.000
.000 .000 .000 .001 .006 .021 .058 .132 .252 .412 .588 .748 .868 .942 .979 .994 .999 1.000 1.000 1.000
.000 .000 .000 .000 .000 .002 .006 .021 .057 .128 .245 .404 .584 .750 .874 .949 .984 .996 .999 1.000
.000 .000 .000 .000 .000 .000 .000 .001 .005 .017 .048 .113 .228 .392 .584 .762 .893 .965 .992 .999
.000 .000 .000 .000 .000 .000 .000 .000 .001 .004 .014 .041 .102 .214 .383 .585 .775 .909 .976 .997
.000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .003 .010 .032 .087 .196 .370 .589 .794 .931 .988
.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .002 .011 .043 .133 .323 .608 .878
.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .003 .016 .075 .264 .642
.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .017 .182
.778 .974 .998 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.277 .642 .873 .966 .993 .999 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.072 .271 .537 .764 .902 .967 .991 .998 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.004 .027 .098 .234 .421 .617 .780 .891 .953 .983 .994 .998 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.001 .007 .032 .096 .214 .378 .561 .727 .851 .929 .970 .980 .997 .999 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.000 .002 .009 .033 .090 .193 .341 .512 .677 .811 .902 .956 .983 .994 .998 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.000 .000 .000 .002 .009 .029 .074 .154 .274 .425 .586 .732 .846 .922 .966 .987 .996 .999 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.000 .000 .000 .000 .000 .002 .007 .022 .054 .115 .212 .345 .500 .655 .788 .885 .946 .978 .993 .998 1.000 1.000 1.000 1.000 1.000
.000 .000 .000 .000 .000 .000 .000 .001 .004 .013 .034 .078 .154 .268 .414 .575 .726 .846 .926 .971 .991 .998 1.000 1.000 1.000
.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .002 .006 .017 .044 .098 .189 .323 .488 .659 .807 .910 .967 .991 .998 1.000
.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .003 .020 .030 .071 .149 .273 .439 .622 .786 .904 .968 .993 .999
.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .002 .006 .017 .047 .109 .220 .383 .579 .766 .902 .973 .996
.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .002 .009 .033 .098 .236 .463 .729 .928
.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .007 .034 .127 .358 .723
.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .002 .026 .222
d. n = 20 x
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
e. n = 25 x
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Appendix
905
Table A.2 Cumulative Poisson probabilities Pðx; lÞ ¼
x el ly P y¼0 y!
l x
0 1 2 3 4 5 6
.1 .905 .995 1.000
.2 .819 .982 .999 1.000
.3 .741 .963 .996 1.000
.4 .670 .938 .992 .999 1.000
.5 .607 .910 .986 .998 1.000
.6 .549 .878 .977 .997 1.000
.7 .497 .844 .966 .994 .999 1.000
.8 .449 .809 .953 .991 .999 1.000
.9 .407 .772 .937 .987 .998 1.000
1.0 .368 .736 .920 .981 .996 .999 1.000
l x
0 l 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
2.0 .135 .406 .677 .857 .947 .983 .995 .999 1.000
3.0 .050 .199 .423 .647 .815 .916 .966 .988 .996 .999 1.000
4.0 .018 .092 .238 .433 .629 .785 .889 .949 .979 .992 .997 .999 1.000
5.0 .007 .040 .125 .265 .440 .616 .762 .867 .932 .968 .986 .995 .998 .999 1.000
6.0 .002 .017 .062 .151 .285 .446 .606 .744 .847 .916 .957 .980 .991 .996 .999 .999 1.000
7.0 .001 .007 .030 .082 .173 .301 .450 .599 .729 .830 .901 .947 .973 .987 .994 .998 .999 1.000
8.0 .000 .003 .014 .042 .100 .191 .313 .453 .593 .717 .816 .888 .936 .966 .983 .992 .996 .998 .999 1.000
9.0 .000 .001 .006 .021 .055 .116 .207 .324 .456 .587 .706 .803 .876 .926 .959 .978 .989 .995 .998 .999 1.000
10.0 .000 .000 .003 .010 .029 .067 .130 .220 .333 .458 .583 .697 .792 .864 .917 .951 .973 .986 .993 .997 .998 .999 1.000
15.0 .000 .000 .000 .000 .001 .003 .008 .018 .037 .070 .118 .185 .268 .363 .466 .568 .664 .749 .819 .875 .917 .947 .967 .981 .989 .994 .997 .998 .999 1.000
20.0 .000 .000 .000 .000 .000 .000 .000 .001 .002 .005 .011 .021 .039 .066 .105 .157 .221 .297 .381 .470 .559 .644 .721 .787 .843 .888 .922 .948 .966 .978 .987 .992 .995 .997 .999 .999 1.000
906
Appendix
Table A.3 Standard normal curve areas Φ(z) = P(Z ≤ z) Standard normal density curve
Shaded area = Φ(z)
0
z −3.4 −3.3 −3.2 −3.1 −3.0 −2.9 −2.8 −2.7 −2.6 −2.5 −2.4 −2.3 −2.2 −2.1 −2.0 −1.9 −1.8 −1.7 −1.6 −1.5 −1.4 −1.3 −1.2 −1.1 −1.0 −0.9 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 −0.0
.00 .0003 .0005 .0007 .0010 .0013 .0019 .0026 .0035 .0047 .0062 .0082 .0107 .0139 .0179 .0228 .0287 .0359 .0446 .0548 .0668 .0808 .0968 .1151 .1357 .1587 .1841 .2119 .2420 .2743 .3085 .3446 .3821 .4207 .4602 .5000
.01 .0003 .0005 .0007 .0009 .0013 .0018 .0025 .0034 .0045 .0060 .0080 .0104 .0136 .0174 .0222 .0281 .0352 .0436 .0537 .0655 .0793 .0951 .1131 .1335 .1562 .1814 .2090 .2389 .2709 .3050 .3409 .3783 .4168 .4562 .4960
.02 .0003 .0005 .0006 .0009 .0013 .0017 .0024 .0033 .0044 .0059 .0078 .0102 .0132 .0170 .0217 .0274 .0344 .0427 .0526 .0643 .0778 .0934 .1112 .1314 .1539 .1788 .2061 .2358 .2676 .3015 .3372 .3745 .4129 .4522 .4920
.03 .0003 .0004 .0006 .0009 .0012 .0017 .0023 .0032 .0043 .0057 .0075 .0099 .0129 .0166 .0212 .0268 .0336 .0418 .0516 .0630 .0764 .0918 .1093 .1292 .1515 .1762 .2033 .2327 .2643 .2981 .3336 .3707 .4090 .4483 .4880
.04 .0003 .0004 .0006 .0008 .0012 .0016 .0023 .0031 .0041 .0055 .0073 .0096 .0125 .0162 .0207 .0262 .0329 .0409 .0505 .0618 .0749 .0901 .1075 .1271 .1492 .1736 .2005 .2296 .2611 .2946 .3300 .3669 .4052 .4443 .4840
.05 .0003 .0004 .0006 .0008 .0011 .0016 .0022 .0030 .0040 .0054 .0071 .0094 .0122 .0158 .0202 .0256 .0322 .0401 .0495 .0606 .0735 .0885 .1056 .1251 .1469 .1711 .1977 .2266 .2578 .2912 .3264 .3632 .4013 .4404 .4801
.06 .0003 .0004 .0006 .0008 .0011 .0015 .0021 .0029 .0039 .0052 .0069 .0091 .0119 .0154 .0197 .0250 .0314 .0392 .0485 .0594 .0722 .0869 .1038 .1230 .1446 .1685 .1949 .2236 .2546 .2877 .3228 .3594 .3974 .4364 .4761
z
.07 .0003 .0004 .0005 .0008 .0011 .0015 .0021 .0028 .0038 .0051 .0068 .0089 .0116 .0150 .0192 .0244 .0307 .0384 .0475 .0582 .0708 .0853 .1020 .1210 .1423 .1660 .1922 .2206 .2514 .2843 .3192 .3557 .3936 .4325 .4721
.08 .0003 .0004 .0005 .0007 .0010 .0014 .0020 .0027 .0037 .0049 .0066 .0087 .0113 .0146 .0188 .0239 .0301 .0375 .0465 .0571 .0694 .0838 .1003 .1190 .1401 .1635 .1894 .2177 .2483 .2810 .3156 .3520 .3897 .4286 .4681
.09 .0002 .0003 .0005 .0007 .0010 .0014 .0019 .0026 .0036 .0048 .0064 .0084 .0110 .0143 .0183 .0233 .0294 .0367 .0455 .0559 .0681 .0823 .0985 .1170 .1379 .1611 .1867 .2148 .2451 .2776 .3121 .3482 .3859 .4247 .4641 (continued)
Appendix
907
Table A.3 (continued)
UðzÞ ¼ PðZ zÞ z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4
.00
.01
.02
.03
.04
.05
.06
.07
.08
.09
.5000 .5398 .5793 .6179 .6554 .6915 .7257 .7580 .7881 .8159 .8413 .8643 .8849 .9032 .9192 .9332 .9452 .9554 .9641 .9713 .9772 .9821 .9861 .9893 .9918 .9938 .9953 .9965 .9974 .9981 .9987 .9990 .9993 .9995 .9997
.5040 .5438 .5832 .6217 .6591 .6950 .7291 .7611 .7910 .8186 .8438 .8665 .8869 .9049 .9207 .9345 .9463 .9564 .9649 .9719 .9778 .9826 .9864 .9896 .9920 .9940 .9955 .9966 .9975 .9982 .9987 .9991 .9993 .9995 .9997
.5080 .5478 .5871 .6255 .6628 .6985 .7324 .7642 .7939 .8212 .8461 .8686 .8888 .9066 .9222 .9357 .9474 .9573 .9656 .9726 .9783 .9830 .9868 .9898 .9922 .9941 .9956 .9967 .9976 .9982 .9987 .9991 .9994 .9995 .9997
.5120 .5517 .5910 .6293 .6664 .7019 .7357 .7673 .7967 .8238 .8485 .8708 .8907 .9082 .9236 .9370 .9484 .9582 .9664 .9732 .9788 .9834 .9871 .9901 .9925 .9943 .9957 .9968 .9977 .9983 .9988 .9991 .9994 .9996 .9997
.5160 .5557 .5948 .6331 .6700 .7054 .7389 .7704 .7995 .8264 .8508 .8729 .8925 .9099 .9251 .9382 .9495 .9591 .9671 .9738 .9793 .9838 .9875 .9904 .9927 .9945 .9959 .9969 .9977 .9984 .9988 .9992 .9994 .9996 .9997
.5199 .5596 .5987 .6368 .6736 .7088 .7422 .7734 .8023 .8289 .8531 .8749 .8944 .9115 .9265 .9394 .9505 .9599 .9678 .9744 .9798 .9842 .9878 .9906 .9929 .9946 .9960 .9970 .9978 .9984 .9989 .9992 .9994 .9996 .9997
.5239 .5636 .6026 .6406 .6772 .7123 .7454 .7764 .8051 .8315 .8554 .8770 .8962 .9131 .9278 .9406 .9515 .9608 .9686 .9750 .9803 .9846 .9881 .9909 .9931 .9948 .9961 .9971 .9979 .9985 .9989 .9992 .9994 .9996 .9997
.5279 .5675 .6064 .6443 .6808 .7157 .7486 .7794 .8078 .8340 .8577 .8790 .8980 .9147 .9292 .9418 .9525 .9616 .9693 .9756 .9808 .9850 .9884 .9911 .9932 .9949 .9962 .9972 .9979 .9985 .9989 .9992 .9995 .9996 .9997
.5319 .5714 .6103 .6480 .6844 .7190 .7517 .7823 .8106 .8365 .8599 .8810 .8997 .9162 .9306 .9429 .9535 .9625 .9699 .9761 .9812 .9854 .9887 .9913 .9934 .9951 .9963 .9973 .9980 .9986 .9990 .9993 .9995 .9996 .9997
.5359 .5753 .6141 .6517 .6879 .7224 .7549 .7852 .8133 .8389 .8621 .8830 .9015 .9177 .9319 .9441 .9545 .9633 .9706 .9767 .9817 .9857 .9890 .9916 .9936 .9952 .9964 .9974 .9981 .9986 .9990 .9993 .9995 .9997 .9998
908
Appendix
Table A.4 The incomplete gamma function Gðx; aÞ ¼
Rx 0
1 a1 y e dy CðaÞ y
a x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 .632 .865 .950 .982 .993 .998 .999 1.000
2 .264 .594 .801 .908 .960 .983 .993 .997 .999 1.000
3 .080 .323 .577 .762 .875 .938 .970 .986 .994 .997 .999 1.000
4 .019 .143 .353 .567 .735 .849 .918 .958 .979 .990 .995 .998 .999 1.000
5 .004 .053 .185 .371 .560 .715 .827 .900 .945 .971 .985 .992 .996 .998 .999
6 .001 .017 .084 .215 .384 .554 .699 .809 .884 .933 .962 .980 .989 .994 .997
7 .000 .005 .034 .111 .238 .394 .550 .687 .793 .870 .921 .954 .974 .986 .992
8 .000 .001 .012 .051 .133 .256 .401 .547 .676 .780 .857 .911 .946 .968 .982
9 .000 .000 .004 .021 .068 .153 .271 .407 .544 .667 .768 .845 .900 .938 .963
10 .000 .000 .001 .008 .032 .084 .170 .283 .413 .542 .659 .758 .834 .891 .930
Appendix
909
Table A.5 Critical values for chi-squared distributions
χ v2 density curve Shaded area = α
χα2,v
0
a v 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
.995 0.000 0.010 0.072 0.207 0.412 0.676 0.989 1.344 1.735 2.156 2.603 3.074 3.565 4.075 4.600 5.142 5.697 6.265 6.843 7.434 8.033 8.643 9.260 9.886 10.519 11.160 11.807 12.461 13.120 13.787 14.457 15.134 15.814 16.501 17.191 17.887 18.584 19.289 19.994 20.706
For m > 40; v2a;v
.99 0.000 0.020 0.115 0.297 0.554 0.872 1.239 1.646 2.088 2.558 3.053 3.571 4.107 4.660 5.229 5.812 6.407 7.015 7.632 8.260 8.897 9.542 10.195 10.856 11.523 12.198 12.878 13.565 14.256 14.954 15.655 16.362 17.073 17.789 18.508 19.233 19.960 20.691 21.425 22.164
.975 0.001 0.051 0.216 0.484 0.831 1.237 1.690 2.180 2.700 3.247 3.816 4.404 5.009 5.629 6.262 6.908 7.564 8.231 8.906 9.591 10.283 10.982 11.688 12.401 13.120 13.844 14.573 15.308 16.047 16.791 17.538 18.291 19.046 19.806 20.569 21.336 22.105 22.878 23.654 24.433 rffiffiffiffiffi!3 2 2 v 1 þ za 9m 9m
.95 0.004 0.103 0.352 0.711 1.145 1.635 2.167 2.733 3.325 3.940 4.575 5.226 5.892 6.571 7.261 7.962 8.682 9.390 10.117 10.851 11.591 12.338 13.090 13.848 14.611 15.379 16.151 16.928 17.708 18.493 19.280 20.072 20.866 21.664 22.465 23.269 24.075 24.884 25.695 26.509
.90 0.016 0.211 0.584 1.064 1.610 2.204 2.833 3.490 4.168 4.865 5.578 6.304 7.041 7.790 8.547 9.312 10.085 10.865 11.651 12.443 13.240 14.042 14.848 15.659 16.473 17.292 18.114 18.939 19.768 20.599 21.433 22.271 23.110 23.952 24.796 25.643 26.492 27.343 28.196 29.050
.10 2.706 4.605 6.251 7.779 9.236 10.645 12.017 13.362 14.684 15.987 17.275 18.549 19.812 21.064 22.307 23.542 24.769 25.989 27.203 28.412 29.615 30.813 32.007 33.196 34.381 35.563 36.741 37.916 39.087 40.256 41.422 42.585 43.745 44.903 46.059 47.212 48.363 49.513 50.660 51.805
.05 3.843 5.992 7.815 9.488 11.070 12.592 14.067 15.507 16.919 18.307 19.675 21.026 22.362 23.685 24.996 26.296 27.587 28.869 30.143 31.410 32.670 33.924 35.172 36.415 37.652 38.885 40.113 41.337 42.557 43.773 44.985 46.194 47.400 48.602 49.802 50.998 52.192 53.384 54.572 55.758
.025 5.025 7.378 9.348 11.143 12.832 14.440 16.012 17.534 19.022 20.483 21.920 23.337 24.735 26.119 27.488 28.845 30.190 31.526 32.852 34.170 35.478 36.781 38.075 39.364 40.646 41.923 43.194 44.461 45.772 46.979 48.231 49.480 50.724 51.966 53.203 54.437 55.667 56.896 58.119 59.342
.01 6.637 9.210 11.344 13.277 15.085 16.812 18.474 20.090 21.665 23.209 24.724 26.217 27.687 29.141 30.577 32.000 33.408 34.805 36.190 37.566 38.930 40.289 41.637 42.980 44.313 45.642 46.962 48.278 49.586 50.892 52.190 53.486 54.774 56.061 57.340 58.619 59.891 61.162 62.426 63.691
.005 7.882 10.597 12.837 14.860 16.748 18.548 20.276 21.954 23.587 25.188 26.755 28.300 29.817 31.319 32.799 34.267 35.716 37.156 38.580 39.997 41.399 42.796 44.179 45.558 46.925 48.290 49.642 50.993 52.333 53.672 55.000 56.328 57.646 58.964 60.272 61.581 62.880 64.181 65.473 66.766
910
Appendix
Table A.6 Critical values for t distributions tv density curve
Shaded area = α
0
tα ,v
a v 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 34 36 38 40 50 60 120 1
.10 3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341 1.337 1.333 1.330 1.328 1.325 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310 1.309 1.307 1.306 1.304 1.303 1.299 1.296 1.289 1.282
.05 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.694 1.691 1.688 1.686 1.684 1.676 1.671 1.658 1.645
.025 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.037 2.032 2.028 2.024 2.021 2.009 2.000 1.980 1.960
.01 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.449 2.441 2.434 2.429 2.423 2.403 2.390 2.358 2.326
.005 63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.738 2.728 2.719 2.712 2.704 2.678 2.660 2.617 2.576
.001 318.31 22.327 10.215 7.173 5.893 5.208 4.785 4.501 4.297 4.144 4.025 3.930 3.852 3.787 3.733 3.686 3.646 3.610 3.579 3.552 3.527 3.505 3.485 3.467 3.450 3.435 3.421 3.408 3.396 3.385 3.365 3.348 3.333 3.319 3.307 3.262 3.232 3.160 3.090
.0005 636.62 31.599 12.924 8.610 6.869 5.959 5.408 5.041 4.781 4.587 4.437 4.318 4.221 4.140 4.073 4.015 3.965 3.922 3.883 3.850 3.819 3.792 3.767 3.745 3.725 3.707 3.690 3.674 3.659 3.646 3.622 3.601 3.582 3.566 3.551 3.496 3.460 3.373 3.291
Appendix
911
Table A.7 t curve tail areas tv density curve
Area to the right of t 0 t
v t 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0
1 .500 .468 .437 .407 .379 .352 .328 .306 .285 .267 .250 .235 .221 .209 .197 .187 .178 .169 .161 .154 .148 .141 .136 .131 .126 .121 .117 .113 .109 .106 .102 .099 .096 .094 .091 .089 .086 .084 .082 .080 .078
2 .500 .465 .430 .396 .364 .333 .305 .278 .254 .232 .211 .193 .177 .162 .148 .136 .125 .116 .107 .099 .092 .085 .079 .074 .069 .065 .061 .057 .054 .051 .048 .045 .043 .040 .038 .036 .035 .033 .031 .030 .029
3 .500 .463 .427 .392 .358 .326 .295 .267 .241 .217 .196 .176 .158 .142 .128 .115 .104 .094 .085 .077 .070 .063 .058 .052 .048 .044 .040 .037 .034 .031 .029 .027 .025 .023 .021 .020 .018 .017 .016 .015 .014
4 .500 .463 .426 .390 .355 .322 .290 .261 .234 .210 .187 .167 .148 .132 .117 .104 .092 .082 .073 .065 .058 .052 .046 .041 .037 .033 .030 .027 .024 .022 .020 .018 .016 .015 .014 .012 .011 .010 .010 .009 .008
5 .500 .462. .425 .388 .353 .319 .287 .258 .230 .205 .182 .162 .142 .125 .110 .097 .085 .075 .066 .058 .051 .045 .040 .035 .031 .027 .024 .021 .019 .017 .015 .013 .012 .011 .010 .009 .008 .007 .006 .006 .005
6 .500 .462 .424 .387 .352 .317 .285 .255 .227 .201 .178 .157 .138 .121 .106 .092 .080 .070 .061 .053 .046 .040 .035 .031 .027 .023 .020 .018 .016 .014 .012 .011 .009 .008 .007 .006 .006 .005 .004 .004 .004
7 .500 .462 .424 .386 .351 .316 .284 .253 .225 .199 .175 .154 .135 .117 .102 .089 .077 .065 .057 .050 .043 .037 .032 .027 .024 .020 .018 .015 .013 .011 .010 .009 .008 .007 .006 .005 .004 .004 .003 .003 .003
8 .500 .461 .423 .386 .350 .315 .283 .252 .223 .197 .173 .152 .132 .115 .100 .086 .074 .064 .055 .047 .040 .034 .029 .025 .022 .018 .016 .014 .012 .010 .009 .007 .006 .005 .005 .004 .004 .003 .003 .002 .002
9 .500 .461 .423 .386 .349 .315 .282 .251 .222 .196 .172 .150 .130 .113 .098 .084 .072 .062 .053 .045 .038 .033 .028 .023 .020 .017 .014 .012 .010 .009 .007 .006 .005 .005 .004 .003 .003 .002 .002 .002 .002
10 .500 .461 .423 .385 .349 .314 .281 .250 .221 .195 .170 .149 .129 .111 .096 .082 .070 .060 .051 .043 .037 .031 .026 .022 .019 .016 .013 .011 .009 .008 .007 .006 .005 .004 .003 .003 .002 .002 .002 .001 .001
11 .500 .461 .423 .385 .348 .313 .280 .249 .220 .194 .169 .147 .128 .110 .095 .081 .069 .059 .050 .042 .035 .030 .025 .021 .018 .015 .012 .010 .009 .007 .006 .005 .004 .004 .003 .002 .002 .002 .001 .001 .001
12 .500 .461 .422 .385 .348 .313 .280 .249 .220 .193 .169 .146 .127 .109 .093 .080 .068 .057 .049 .041 .034 .029 .024 .020 .017 .014 .012 .010 .008 .007 .006 .005 .004 .003 .003 .002 .002 .002 .001 .001 .001
13 .500 .461 .422 .384 .348 .313 .279 .248 .219 .192 .168 .146 .126 .108 .092 .079 .067 .056 .048 .040 .033 .028 .023 .019 .016 .013 .011 .009 .008 .006 .005 .004 .003 .003 .002 .002 .002 .001 .001 .001 .001
14 .500 .461 .422 .384 .347 .312 .279 .247 .218 .191 .167 .144 .124 .107 .091 .077 .065 .055 .046 .038 .032 .027 .022 .018 .015 .012 .010 .008 .007 .005 .004 .004 .003 .002 .002 .002 .001 .001 .001 .001 .001
15 .500 .461 .422 .384 .347 .312 .279 .247 .218 .191 .167 .144 .124 .107 .091 .077 .065 .055 .046 .038 .032 .027 .022 .018 .015 .012 .010 .008 .007 .005 .004 .004 .003 .002 .002 .002 .001 .001 .001 .001 .001
16 .500 .461 .422 .384 .347 .312 .278 .247 .218 .191 .166 .144 .124 .106 .090 .077 .065 .054 .045 .038 .031 .026 .021 .018 .014 .012 .010 .008 .006 .005 .004 .003 .003 .002 .002 .001 .00l .001 .001 .001 .001
17 .500 .461 .422 .384 .347 .312 .278 .247 .217 .190 .166 .143 .123 .105 .090 .076 .064 .054 .045 .037 .031 .025 .021 .017 .014 .011 .009 .008 .006 .005 .004 .003 .003 .002 .002 .001 .001 .001 .001 .001 .000
18 .500 .461 .422 .384 .347 .312 .278 .246 .217 .190 .165 .143 .123 .105 .089 .075 .064 .053 .044 .037 .030 .025 .021 .017 .014 .011 .009 .007 .006 .005 .004 .003 .002 .002 .002 .001 .001 .001 .001 .001 .000
(continued)
912
Appendix
Table A.7 (continued) v t 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0
19 .500 .461 .422 .384 .347 .311 .278 .246 .217 .190 .165 .143 .122 .105 .089 .075 .063 .053 .044 .036 .030 .025 .020 .016 .013 .011 .009 .007 .006 .005 .004 .003 .002 .002 .002 .001 .001 .001 .001 .000 .000
20 .500 .461 .422 .384 .347 .311 .278 .246 .217 .189 .165 .142 .122 .104 .089 .075 .063 .052 .043 .036 .030 .024 .020 .016 .013 .011 .009 .007 .006 .004 .004 .003 .002 .002 .001 .001 .001 .001 .001 .000 .000
21 .500 .461 .422 .384 .347 .311 .278 .246 .216 .189 .164 .142 .122 .104 .088 .074 .062 .052 .043 .036 .029 .024 .020 .016 .013 .010 .008 .007 .005 .004 .003 .003 .002 .002 .001 .001 .001 .001 .001 .000 .000
22 .500 .461 .422 .383 .347 .311 .277 .246 .216 .189 .164 .142 .121 .104 .088 .074 .062 .052 .043 .035 .029 .024 .019 .016 .013 .010 .008 .007 .005 .004 .003 .003 .002 .002 .001 .001 .001 .001 .000 .000 .000
23 .500 .461 .422 .383 .346 .311 .277 .245 .216 .189 .164 .141 .121 .103 .087 .074 .062 .051 .042 .035 .029 .023 .019 .015 .012 .010 .008 .006 .005 .004 .003 .003 .002 .002 .001 .001 .001 .001 .000 .000 .000
24 .500 .461 .422 .383 .346 .311 .277 .245 .216 .189 .164 .141 .121 .103 .087 .073 .061 .051 .042 .035 .028 .023 .019 .015 .012 .010 .008 .006 .005 .004 .003 .002 .002 .001 .001 .001 .001 .001 .000 .000 .000
25 .500 .461 .422 .383 .346 .311 .277 .245 .216 .188 .163 .141 .121 .103 .087 .073 .061 .051 .042 .035 .028 .023 .019 .015 .012 .010 .008 .006 .005 .004 .003 .002 .002 .001 .001 .001 .001 .001 .000 .000 .000
26 .500 .461 .422 .383 .346 .311 .277 .245 .215 .188 .163 .141 .120 .103 .087 .073 .061 .051 .042 .034 .028 .023 .018 .015 .012 .010 .008 .006 .005 .004 .003 .002 .002 .001 .001 .001 .001 .001 .000 .000 .000
27 .500 .461 .421 .383 .346 .311 .277 .245 .215 .188 .163 .141 .120 .102 .086 .073 .061 .050 .042 .034 .028 .023 .018 .015 .012 .009 .007 .006 .005 .004 .003 .002 .002 .001 .001 .001 .001 .000 .000 .000 .000
28 .500 .461 .421 .383 .346 .310 .277 .245 .215 .188 .163 .140 .120 .102 .086 .072 .060 .050 .041 .034 .028 .022 .018 .015 .012 .009 .007 .006 .005 .004 .003 .002 .002 .001 .001 .001 .001 .000 .000 .000 .000
29 .500 .461 .421 .383 .346 .310 .277 .245 .215 .188 .163 .140 .120 .102 .086 .072 .060 .050 .041 .034 .027 .022 .018 .014 .012 .009 .007 .006 .005 .004 .003 .002 .002 .001 .001 .001 .001 .000 .000 .000 .000
30 .500 .461 .421 .383 .346 .310 .277 .245 .215 .188 .163 .140 .120 .102 .086 .072 .060 .050 .041 .034 .027 .022 .018 .014 .011 .009 .007 .006 .004 .003 .003 .002 .002 .001 .001 .001 .001 .000 .000 .000 .000
35 .500 .460 .421 .383 .346 .310 .276 .244 .215 .187 .162 .139 .119 .101 .085 .071 .059 .049 .040 .033 .027 .022 .017 .014 .011 .009 .007 .005 .004 .003 .002 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000
40 .500 .460 .421 .383 .346 .310 .276 .244 .214 .187 .162 .139 .119 .101 .085 .071 .059 .048 .040 .032 .026 .021 .017 .013 .011 .008 .007 .005 .004 .003 .002 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000
60 .500 .460 .421 .383 .345 .309 .275 .243 .213 .186 .161 .138 .117 .099 .083 .069 .057 .047 .038 .031 .025 .020 .016 .012 .010 .008 .006 .004 .003 .003 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000
120 .500 .460 .421 .382 .345 .309 .275 .243 .213 .185 .160 .137 .116 .098 .082 .068 .056 .046 .037 .030 .024 .019 .015 .012 .009 .007 .005 .004 .003 .002 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000
1 (=z) .500 .460 .421 .382 .345 .309 .274 .242 .212 .184 .159 .136 .115 .097 .081 .067 .055 .045 .036 .029 .023 .018 .014 .011 .008 .006 .005 .003 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000
Appendix
913
Table A.8 Critical values for F distributions v1 = numerator df a
m2 = denominator df
1
2
3
4
5
6
7
8
9
1
.100 .050 .010 .001
39.86 161.45 4052.2 405284
49.50 199.50 4999.5 500000
53.59 215.71 5403.4 540379
55.83 224.58 5624.6 562500
57.24 230.16 5763.6 576405
58.20 233.99 5859.0 585937
58.91 236.77 5928.4 592873
59.44 238.88 5981.1 598144
59.86 240.54 6022.5 602284
2
.100 .050 .010 .001
8.53 18.51 98.50 998.50
9.00 19.00 99.00 999.00
9.16 19.16 99.17 999.17
9.24 19.25 99.25 999.25
9.29 19.30 99.30 999.30
9.33 19.33 99.33 999.33
9.35 19.35 99.36 999.36
9.37 19.37 99.37 999.37
9.38 19.38 99.39 999.39
3
.100 .050 .010 .001
5.54 10.13 34.12 167.03
5.46 9.55 30.82 148.50
5.39 9.28 29.46 141.11
5.34 9.12 28.71 137.10
5.31 9.01 28.24 134.58
5.28 8.94 27.91 132.85
5.27 8.89 27.67 131.58
5.25 8.85 27.49 130.62
5.24 8.81 27.35 129.86
4
.100 .050 .010 .001
4.54 7.71 21.20 74.14
4.32 6.94 18.00 61.25
4.19 6.59 16.69 56.18
4.11 6.39 15.98 53.44
4.05 6.26 15.52 51.71
4.01 6.16 15.21 50.53
3.98 6.09 14.98 49.66
3.95 6.04 14.80 49.00
3.94 6.00 14.66 48.47
5
.100 .050 .010 .001
4.06 6.61 16.26 47.18
3.78 5.79 13.27 37.12
3.62 5.41 12.06 33.20
3.52 5.19 11.39 31.09
3.45 5.05 10.97 29.75
3.40 4.95 10.67 28.83
3.37 4.88 10.46 28.16
3.34 4.82 10.29 27.65
3.32 4.77 10.16 27.24
6
.100 .050 .010 .001
3.78 5.99 13.75 35.51
3.46 5.14 10.92 27.00
3.29 4.76 9.78 23.70
3.18 4.53 9.15 21.92
3.11 4.39 8.75 20.80
3.05 4.28 8.47 20.03
3.01 4.21 8.26 19.46
2.98 4.15 8.10 19.03
2.96 4.10 7.98 18.69
7
.100 .050 .010 .001
3.59 5.59 12.25 29.25
3.26 4.74 9.55 21.69
3.07 4.35 8.45 18.77
2.96 4.12 7.85 17.20
2.88 3.97 7.46 16.21
2.83 3.87 7.19 15.52
2.78 3.79 6.99 15.02
2.75 3.73 6.84 14.63
2.72 3.68 6.72 14.33
8
.100 .050 .010 .001
3.46 5.32 11.26 25.41
3.11 4.46 8.65 18.49
2.92 4.07 7.59 15.83
2.81 3.84 7.01 14.39
2.73 3.69 6.63 13.48
2.67 3.58 6.37 12.86
2.62 3.50 6.18 12.40
2.59 3.44 6.03 12.05
2.56 3.39 5.91 11.77
9
.100 .050 .010 .001
3.36 5.12 10.56 22.86
3.01 4.26 8.02 16.39
2.81 3.86 6.99 13.90
2.69 3.63 6.42 12.56
2.61 3.48 6.06 11.71
2.55 3.37 5.80 11.13
2.51 3.29 5.61 10.70
2.47 3.23 5.47 10.37
2.44 3.18 5.35 10.11
10
.100 .050 .010 .001
3.29 4.96 10.04 21.04
2.92 4.10 7.56 14.91
2.73 3.71 6.55 12.55
2.61 3.48 5.99 11.28
2.52 3.33 5.64 10.48
2.46 3.22 5.39 9.93
2.41 3.14 5.20 9.52
2.38 3.07 5.06 9.20
2.35 3.02 4.94 8.96
11
.100 .050 .010 .001
3.23 4.84 9.65 19.69
2.86 3.98 7.21 13.81
2.66 3.59 6.22 11.56
2.54 3.36 5.67 10.35
2.45 3.20 5.32 9.58
2.39 3.09 5.07 9.05
2.34 3.01 4.89 8.66
2.30 2.95 4.74 8.35
2.27 2.90 4.63 8.12
12
.100 .050 .010 .001
3.18 4.75 9.33 18.64
2.81 3.89 6.93 12.97
2.61 3.49 5.95 10.80
2.48 3.26 5.41 9.63
2.39 3.11 5.06 8.89
2.33 3.00 4.82 8.38
2.28 2.91 4.64 8.00
2.24 2.85 4.50 7.71
2.21 2.80 4.39 7.48
(continued)
914
Appendix
Table A.8 (continued) 10
12
15
20
25
30
40
50
60
60.19 241.88 6055.8 605621
60.71 243.91 6106.3 610668
61.22 245.95 6157.3 615764
61.74 248.01 6208.7 620908
62.05 249.26 6239.8 624017
62.26 250.10 6260.6 626099
62.53 251.14 6286.8 628712
62.69 251.77 6302.5 630285
62.79 252.20 6313.0 631337
120 63.06 253.25 6339.4 633972
1000 63.30 254.19 6362.7 636301
9.39 19.40 99.40 999.40
9.41 19.41 99.42 999.42
9.42 19.43 99.43 999.43
9.44 19.45 99.45 999.45
9.45 19.46 99.46 999.46
9.46 19.46 99.47 999.47
9.47 19.47 99.47 999.47
9.47 19.48 99.48 999.48
9.47 19.48 99.48 999.48
9.48 19.49 99.49 999.49
9.49 19.49 99.50 999.50
5.23 8.79 27.23 129.25
5.22 8.74 27.05 128.32
5.20 8.70 26.87 127.37
5.18 8.66 26.69 126.42
5.17 8.63 26.58 125.84
5.17 8.62 26.50 125.45
5.16 8.59 26.41 124.96
5.15 8.58 26.35 124.66
5.15 8.57 26.32 124.47
5.14 8.55 26.22 123.97
5.13 8.53 26.14 123.53
3.92 5.96 14.55 48.05
3.90 5.91 14.37 47.41
3.87 5.86 14.20 46.76
3.84 5.80 14.02 46.10
3.83 5.77 13.91 45.70
3.82 5.75 13.84 45.43
3.80 5.72 13.75 45.09
3.80 5.70 13.69 44.88
3.79 5.69 13.65 44.75
3.78 5.66 13.56 44.40
3.76 5.63 13.47 44.09
3.30 4.74 10.05 26.92
3.27 4.68 9.89 26.42
3.24 4.62 9.72 25.91
3.21 4.56 9.55 25.39
3.19 4.52 9.45 25.08
3.17 4.50 9.38 24.87
3.16 4.46 9.29 24.60
3.15 4.44 9.24 24.44
3.14 4.43 9.20 24.33
3.12 4.40 9.11 24.06
3.11 4.37 9.03 23.82
2.94 4.06 7.87 18.41
2.90 4.00 7.72 17.99
2.87 3.94 7.56 17.56
2.84 3.87 7.40 17.12
2.81 3.83 7.30 16.85
2.80 3.81 7.23 16.67
2.78 3.77 7.14 16.44
2.77 3.75 7.09 16.31
2.76 3.74 7.06 16.21
2.74 3.70 6.97 15.98
2.72 3.67 6.89 15.77
2.70 3.64 6.62 14.08
2.67 3.57 6.47 13.71
2.63 3.51 6.31 13.32
2.59 3.44 6.16 12.93
2.57 3.40 6.06 12.69
2.56 3.38 5.99 12.53
2.54 3.34 5.91 12.33
2.52 3.32 5.86 12.20
2.51 3.30 5.82 12.12
2.49 3.27 5.74 11.91
2.47 3.23 5.66 11.72
2.54 3.35 5.81 11.54
2.50 3.28 5.67 11.19
2.46 3.22 5.52 10.84
2.42 3.15 5.36 10.48
2.40 3.11 5.26 10.26
2.38 3.08 5.20 10.11
2.36 3.04 5.12 9.92
2.35 3.02 5.07 9.80
2.34 3.01 5.03 9.73
2.32 2.97 4.95 9.53
2.30 2.93 4.87 9.36
2.42 3.14 5.26 9.89
2.38 3.07 5.11 9.57
2.34 3.01 4.96 9.24
2.30 2.94 4.81 8.90
2.27 2.89 4.71 8.69
2.25 2.86 4.65 8.55
2.23 2.83 4.57 8.37
2.22 2.80 4.52 8.26
2.21 2.79 4.48 8.19
2.18 2.75 4.40 8.00
2.16 2.71 4.32 7.84
2.32 2.98 4.85 8.75
2.28 2.91 4.71 8.45
2.24 2.85 4.56 8.13
2.20 2.77 4.41 7.80
2.17 2.73 4.31 7.60
2.16 2.70 4.25 7.47
2.13 2.66 4.17 7.30
2.12 2.64 4.12 7.19
2.11 2.62 4.08 7.12
2.08 2.58 4.00 6.94
2.06 2.54 3.92 6.78
2.25 2.85 4.54 7.92
2.21 2.79 4.40 7.63
2.17 2.72 4.25 7.32
2.12 2.65 4.10 7.01
2.10 2.60 4.01 6.81
2.08 2.57 3.94 6.68
2.05 2.53 3.86 6.52
2.04 2.51 3.81 6.42
2.03 2.49 3.78 6.35
2.00 2.45 3.69 6.18
1.98 2.41 3.61 6.02
2.19 2.75 4.30 7.29
2.15 2.69 4.16 7.00
2.10 2.62 4.01 6.71
2.06 2.54 3.86 6.40
2.03 2.50 3.76 6.22
2.01 2.47 3.70 6.09
1.99 2.43 3.62 5.93
1.97 2.40 3.57 5.83
1.96 2.38 3.54 5.76
1.93 2.34 3.45 5.59
1.91 2.30 3.37 5.44
(continued)
Appendix
915
Table A.8 (continued) v1 = numerator df
m2 = denominator df
a
1
2
3
4
5
6
7
8
9
13
.100 .050 .010 .001
3.14 4.67 9.07 17.82
2.76 3.81 6.70 12.31
2.56 3.41 5.74 10.21
2.43 3.18 5.21 9.07
2.35 3.03 4.86 8.35
2.28 2.92 4.62 7.86
2.23 2.83 4.44 7.49
2.20 2.77 4.30 7.21
2.16 2.71 4.19 6.98
14
.100 .050 .010 .001
3.10 4.60 8.86 17.14
2.73 3.74 6.51 11.78
2.52 3.34 5.56 9.73
2.39 3.11 5.04 8.62
2.31 2.96 4.69 7.92
2.24 2.85 4.46 7.44
2.19 2.76 4.28 7.08
2.15 2.70 4.14 6.80
2.12 2.65 4.03 6.58
15
.100 .050 .010 .001
3.07 4.54 8.68 16.59
2.70 3.68 6.36 11.34
2.49 3.29 5.42 9.34
2.36 3.06 4.89 8.25
2.27 2.90 4.56 7.57
2.21 2.79 4.32 7.09
2.16 2.71 4.14 6.74
2.12 2.64 4.00 6.47
2.09 2.59 3.89 6.26
16
.100 .050 .010 .001
3.05 4.49 8.53 16.12
2.67 3.63 6.23 10.97
2.46 3.24 5.29 9.01
2.33 3.01 4.77 7.94
2.24 2.85 4.44 7.27
2.18 2.74 4.20 6.80
2.13 2.66 4.03 6.46
2.09 2.59 3.89 6.19
2.06 2.54 3.78 5.98
17
.100 .050 .010 .001
3.03 4.45 8.40 15.72
2.64 3.59 6.11 10.66
2.44 3.20 5.19 8.73
2.31 2.96 4.67 7.68
2.22 2.81 4.34 7.02
2.15 2.70 4.10 6.56
2.10 2.61 3.93 6.22
2.06 2.55 3.79 5.96
2.03 2.49 3.68 5.75
18
.100 .050 .010 .001
3.01 4.41 8.29 15.38
2.62 3.55 6.01 10.39
2.42 3.16 5.09 8.49
2.29 2.93 4.58 7.46
2.20 2.77 4.25 6.81
2.13 2.66 4.01 6.35
2.08 2.58 3.84 6.02
2.04 2.51 3.71 5.76
2.00 2.46 3.60 5.56
19
.100 .050 .010 .001
2.99 4.38 8.18 15.08
2.61 3.52 5.93 10.16
2.40 3.13 5.01 8.28
2.27 2.90 4.50 7.27
2.18 2.74 4.17 6.62
2.11 2.63 3.94 6.18
2.06 2.54 3.77 5.85
2.02 2.48 3.63 5.59
1.98 2.42 3.52 5.39
20
.100 .050 .010 .001
2.97 4.35 8.10 14.82
2.59 3.49 5.85 9.95
2.38 3.10 4.94 8.10
2.25 2.87 4.43 7.10
2.16 2.71 4.10 6.46
2.09 2.60 3.87 6.02
2.04 2.51 3.70 5.69
2.00 2.45 3.56 5.44
1.96 2.39 3.46 5.24
21
.100 .050 .010 .001
2.96 4.32 8.02 14.59
2.57 3.47 5.78 9.77
2.36 3.07 4.87 7.94
2.23 2.84 4.37 6.95
2.14 2.68 4.04 6.32
2.08 2.57 3.81 5.88
2.02 2.49 3.64 5.56
1.98 2.42 3.51 5.31
1.95 2.37 3.40 5.11
22
.100 .050 .010 .001
2.95 4.30 7.95 14.38
2.56 3.44 5.72 9.61
2.35 3.05 4.82 7.80
2.22 2.82 4.31 6.81
2.13 2.66 3.99 6.19
2.06 2.55 3.76 5.76
2.01 2.46 3.59 5.44
1.97 2.40 3.45 5.19
1.93 2.34 3.35 4.99
23
.100 .050 .010 .001
2.94 4.28 7.88 14.20
2.55 3.42 5.66 9.47
2.34 3.03 4.76 7.67
2.21 2.80 4.26 6.70
2.11 2.64 3.94 6.08
2.05 2.53 3.71 5.65
1.99 2.44 3.54 5.33
1.95 2.37 3.41 5.09
1.92 2.32 3.30 4.89
24
.100 .050 .010 .001
2.93 4.26 7.82 14.03
2.54 3.40 5.61 9.34
2.33 3.01 4.72 7.55
2.19 2.78 4.22 6.59
2.10 2.62 3.90 5.98
2.04 2.51 3.67 5.55
1.98 2.42 3.50 5.23
1.94 2.36 3.36 4.99
1.91 2.30 3.26 4.80
(continued)
916
Appendix
Table A.8 (continued) v1 = numerator df 10
12
15
20
25
30
40
50
60
120
1000
2.14 2.67 4.10 6.80
2.10 2.60 3.96 6.52
2.05 2.53 3.82 6.23
2.01 2.46 3.66 5.93
1.98 2.41 3.57 5.75
1.96 2.38 3.51 5.63
1.93 2.34 3.43 5.47
1.92 2.31 3.38 5.37
1.90 2.30 3.34 5.30
1.88 2.25 3.25 5.14
1.85 2.21 3.18 4.99
2.10 2.60 3.94 6.40
2.05 2.53 3.80 6.13
2.01 2.46 3.66 5.85
1.96 2.39 3.51 5.56
1.93 2.34 3.41 5.38
1.91 2.31 3.35 5.25
1.89 2.27 3.27 5.10
1.87 2.24 3.22 5.00
1.86 2.22 3.18 4.94
1.83 2.18 3.09 4.77
1.80 2.14 3.02 4.62
2.06 2.54 3.80 6.08
2.02 2.48 3.67 5.81
1.97 2.40 3.52 5.54
1.92 2.33 3.37 5.25
1.89 2.28 3.28 5.07
1.87 2.25 3.21 4.95
1.85 2.20 3.13 4.80
1.83 2.18 3.08 4.70
1.82 2.16 3.05 4.64
1.79 2.11 2.96 4.47
1.76 2.07 2.88 4.33
2.03 2.49 3.69 5.81
1.99 2.42 3.55 5.55
1.94 2.35 3.41 5.27
1.89 2.28 3.26 4.99
1.86 2.23 3.16 4.82
1.84 2.19 3.10 4.70
1.81 2.15 3.02 4.54
1.79 2.12 2.97 4.45
1.78 2.11 2.93 4.39
1.75 2.06 2.84 4.23
1.72 2.02 2.76 4.08
2.00 2.45 3.59 5.58
1.96 2.38 3.46 5.32
1.91 2.31 3.31 5.05
1.86 2.23 3.16 4.78
1.83 2.18 3.07 4.60
1.81 2.15 3.00 4.48
1.78 2.10 2.92 4.33
1.76 2.08 2.87 4.24
1.75 2.06 2.83 4.18
1.72 2.01 2.75 4.02
1.69 1.97 2.66 3.87
1.98 2.41 3.51 5.39
1.93 2.34 3.37 5.13
1.89 2.27 3.23 4.87
1.84 2.19 3.08 4.59
1.80 2.14 2.98 4.42
1.78 2.11 2.92 4.30
1.75 2.06 2.84 4.15
1.74 2.04 2.78 4.06
1.72 2.02 2.75 4.00
1.69 1.97 2.66 3.84
1.66 1.92 2.58 3.69
1.96 2.38 3.43 5.22
1.91 2.31 3.30 4.97
1.86 2.23 3.15 4.70
1.81 2.16 3.00 4.43
1.78 2.11 2.91 4.26
1.76 2.07 2.84 4.14
1.73 2.03 2.76 3.99
1.71 2.00 2.71 3.90
1.70 1.98 2.67 3.84
1.67 1.93 2.58 3.68
1.64 1.88 2.50 3.53
1.94 2.35 3.37 5.08
1.89 2.28 3.23 4.82
1.84 2.20 3.09 4.56
1.79 2.12 2.94 4.29
1.76 2.07 2.84 4.12
1.74 2.04 2.78 4.00
1.71 1.99 2.69 3.86
1.69 1.97 2.64 3.77
1.68 1.95 2.61 3.70
1.64 1.90 2.52 3.54
1.61 1.85 2.43 3.40
1.92 2.32 3.31 4.95
1.87 2.25 3.17 4.70
1.83 2.18 3.03 4.44
1.78 2.10 2.88 4.17
1.74 2.05 2.79 4.00
1.72 2.01 2.72 3.88
1.69 1.96 2.64 3.74
1.67 1.94 2.58 3.64
1.66 1.92 2.55 3.58
1.62 1.87 2.46 3.42
1.59 1.82 2.37 3.28
1.90 2.30 3.26 4.83
1.86 2.23 3.12 4.58
1.81 2.15 2.98 4.33
1.76 2.07 2.83 4.06
1.73 2.02 2.73 3.89
1.70 1.98 2.67 3.78
1.67 1.94 2.58 3.63
1.65 1.91 2.53 3.54
1.64 1.89 2.50 3.48
1.60 1.84 2.40 3.32
1.57 1.79 2.32 3.17
1.89 2.27 3.21 4.73
1.84 2.20 3.07 4.48
1.80 2.13 2.93 4.23
1.74 2.05 2.78 3.96
1.71 2.00 2.69 3.79
1.69 1.96 2.62 3.68
1.66 1.91 2.54 3.53
1.64 1.88 2.48 3.44
1.62 1.86 2.45 3.38
1.59 1.81 2.35 3.22
1.55 1.76 2.27 3.08
1.88 2.25 3.17 4.64
1.83 2.18 3.03 4.39
1.78 2.11 2.89 4.14
1.73 2.03 2.74 3.87
1.70 1.97 2.64 3.71
1.67 1.94 2.58 3.59
1.64 1.89 2.49 3.45
1.62 1.86 2.44 3.36
1.61 1.84 2.40 3.29
1.57 1.79 2.31 3.14
1.54 1.74 2.22 2.99
(continued)
Appendix
917
Table A.8 (continued) v1 = numerator df
m2 = denominator df
a
1
2
3
4
5
6
7
8
9
25
.100 .050 .010 .001
2.92 4.24 7.77 13.88
2.53 3.39 5.57 9.22
2.32 2.99 4.68 7.45
2.18 2.76 4.18 6.49
2.09 2.60 3.85 5.89
2.02 2.49 3.63 5.46
1.97 2.40 3.46 5.15
1.93 2.34 3.32 4.91
1.89 2.28 3.22 4.71
26
.100 .050 .010 .001
2.91 4.23 7.72 13.74
2.52 3.37 5.53 9.12
2.31 2.98 4.64 7.36
2.17 2.74 4.14 6.41
2.08 2.59 3.82 5.80
2.01 2.47 3.59 5.38
1.96 2.39 3.42 5.07
1.92 2.32 3.29 4.83
1.88 2.27 3.18 4.64
27
.100 .050 .010 .001
2.90 4.21 7.68 13.61
2.51 3.35 5.49 9.02
2.30 2.96 4.60 7.27
2.17 2.73 4.11 6.33
2.07 2.57 3.78 5.73
2.00 2.46 3.56 5.31
1.95 2.37 3.39 5.00
1.91 2.31 3.26 4.76
1.87 2.25 3.15 4.57
28
.100 .050 .010 .001
2.89 4.20 7.64 13.50
2.50 3.34 5.45 8.93
2.29 2.95 4.57 7.19
2.16 2.71 4.07 6.25
2.06 2.56 3.75 5.66
2.00 2.45 3.53 5.24
1.94 2.36 3.36 4.93
1.90 2.29 3.23 4.69
1.87 2.24 3.12 4.50
29
.100 .050 .010 .001
2.89 4.18 7.60 13.39
2.50 3.33 5.42 8.85
2.28 2.93 4.54 7.12
2.15 2.70 4.04 6.19
2.06 2.55 3.73 5.59
1.99 2.43 3.50 5.18
1.93 2.35 3.33 4.87
1.89 2.28 3.20 4.64
1.86 2.22 3.09 4.45
30
.100 .050 .010 .001
2.88 4.17 7.56 13.29
2.49 3.32 5.39 8.77
2.28 2.92 4.51 7.05
2.14 2.69 4.02 6.12
2.05 2.53 3.70 5.53
1.98 2.42 3.47 5.12
1.93 2.33 3.30 4.82
1.88 2.27 3.17 4.58
1.85 2.21 3.07 4.39
40
.100 .050 .010 .001
2.84 4.08 7.31 12.61
2.44 3.23 5.18 8.25
2.23 2.84 4.31 6.59
2.09 2.61 3.83 5.70
2.00 2.45 3.51 5.13
1.93 2.34 3.29 4.73
1.87 2.25 3.12 4.44
1.83 2.18 2.99 4.21
1.79 2.12 2.89 4.02
50
.100 .050 .010 .001
2.81 4.03 7.17 12.22
2.41 3.18 5.06 7.96
2.20 2.79 4.20 6.34
2.06 2.56 3.72 5.46
1.97 2.40 3.41 4.90
1.90 2.29 3.19 4.51
1.84 2.20 3.02 4.22
1.80 2.13 2.89 4.00
1.76 2.07 2.78 3.82
60
.100 .050 .010 .001
2.79 4.00 7.08 11.97
2.39 3.15 4.98 7.77
2.18 2.76 4.13 6.17
2.04 2.53 3.65 5.31
1.95 2.37 3.34 4.76
1.87 2.25 3.12 4.37
1.82 2.17 2.95 4.09
1.77 2.10 2.82 3.86
1.74 2.04 2.72 3.69
100
.100 .050 .010 .001
2.76 3.94 6.90 11.50
2.36 3.09 4.82 7.41
2.14 2.70 3.98 5.86
2.00 2.46 3.51 5.02
1.91 2.31 3.21 4.48
1.83 2.19 2.99 4.11
1.78 2.10 2.82 3.83
1.73 2.03 2.69 3.61
1.69 1.97 2.59 3.44
200
.100 .050 .010 .001
2.73 3.89 6.76 11.15
2.33 3.04 4.71 7.15
2.11 2.65 3.88 5.63
1.97 2.42 3.41 4.81
1.88 2.26 3.11 4.29
1.80 2.14 2.89 3.92
1.75 2.06 2.73 3.65
1.70 1.98 2.60 3.43
1.66 1.93 2.50 3.26
1000
.100 .050 .010 .001
2.71 3.85 6.66 10.89
2.31 3.00 4.63 6.96
2.09 2.61 3.80 5.46
1.95 2.38 3.34 4.65
1.85 2.22 3.04 4.14
1.78 2.11 2.82 3.78
1.72 2.02 2.66 3.51
1.68 1.95 2.53 3.30
1.64 1.89 2.43 3.13
(continued)
918
Appendix
Table A.8 (continued) v1 = numerator df 10
12
15
20
25
30
40
50
60
120
1000
1.87 2.24 3.13 4.56
1.82 2.16 2.99 4.31
1.77 2.09 2.85 4.06
1.72 2.01 2.70 3.79
1.68 1.96 2.60 3.63
1.66 1.92 2.54 3.52
1.63 1.87 2.45 3.37
1.61 1.84 2.40 3.28
1.59 1.82 2.36 3.22
1.56 1.77 2.27 3.06
1.52 1.72 2.18 2.91
1.86 2.22 3.09 4.48
1.81 2.15 2.96 4.24
1.76 2.07 2.81 3.99
1.71 1.99 2.66 3.72
1.67 1.94 2.57 3.56
1.65 1.90 2.50 3.44
1.61 1.85 2.42 3.30
1.59 1.82 2.36 3.21
1.58 1.80 2.33 3.15
1.54 1.75 2.23 2.99
1.51 1.70 2.14 2.84
1.85 2.20 3.06 4.41
1.80 2.13 2.93 4.17
1.75 2.06 2.78 3.92
1.70 1.97 2.63 3.66
1.66 1.92 2.54 3.49
1.64 1.88 2.47 3.38
1.60 1.84 2.38 3.23
1.58 1.81 2.33 3.14
1.57 1.79 2.29 3.08
1.53 1.73 2.20 2.92
1.50 1.68 2.11 2.78
1.84 2.19 3.03 4.35
1.79 2.12 2.90 4.11
1.74 2.04 2.75 3.86
1.69 1.96 2.60 3.60
1.65 1.91 2.51 3.43
1.63 1.87 2.44 3.32
1.59 1.82 2.35 3.18
1.57 1.79 2.30 3.09
1.56 1.77 2.26 3.02
1.52 1.71 2.17 2.86
1.48 1.66 2.08 2.72
1.83 2.18 3.00 4.29
1.78 2.10 2.87 4.05
1.73 2.03 2.73 3.80
1.68 1.94 2.57 3.54
1.64 1.89 2.48 3.38
1.62 1.85 2.41 3.27
1.58 1.81 2.33 3.12
1.56 1.77 2.27 3.03
1.55 1.75 2.23 2.97
1.51 1.70 2.14 2.81
1.47 1.65 2.05 2.66
1.82 2.16 2.98 4.24
1.77 2.09 2.84 4.00
1.72 2.01 2.70 3.75
1.67 1.93 2.55 3.49
1.63 1.88 2.45 3.33
1.61 1.84 2.39 3.22
1.57 1.79 2.30 3.07
1.55 1.76 2.25 2.98
1.54 1.74 2.21 2.92
1.50 1.68 2.11 2.76
1.46 1.63 2.02 2.61
1.76 2.08 2.80 3.87
1.71 2.00 2.66 3.64
1.66 1.92 2.52 3.40
1.61 1.84 2.37 3.14
1.57 1.78 2.27 2.98
1.54 1.74 2.20 2.87
1.51 1.69 2.11 2.73
1.48 1.66 2.06 2.64
1.47 1.64 2.02 2.57
1.42 1.58 1.92 2.41
1.38 1.52 1.82 2.25
1.73 2.03 2.70 3.67
1.68 1.95 2.56 3.44
1.63 1.87 2.42 3.20
1.57 1.78 2.27 2.95
1.53 1.73 2.17 2.79
1.50 1.69 2.10 2.68
1.46 1.63 2.01 2.53
1.44 1.60 1.95 2.44
1.42 1.58 1.91 2.38
1.38 1.51 1.80 2.21
1.33 1.45 1.70 2.05
1.71 1.99 2.63 3.54
1.66 1.92 2.50 3.32
1.60 1.84 2.35 3.08
1.54 1.75 2.20 2.83
1.50 1.69 2.10 2.67
1.48 1.65 2.03 2.55
1.44 1.59 1.94 2.41
1.41 1.56 1.88 2.32
1.40 1.53 1.84 2.25
1.35 1.47 1.73 2.08
1.30 1.40 1.62 1.92
1.66 1.93 2.50 3.30
1.61 1.85 2.37 3.07
1.56 1.77 2.22 2.84
1.49 1.68 2.07 2.59
1.45 1.62 1.97 2.43
1.42 1.57 1.89 2.32
1.38 1.52 1.80 2.17
1.35 1.48 1.74 2.08
1.34 1.45 1.69 2.01
1.28 1.38 1.57 1.83
1.22 1.30 1.45 1.64
1.63 1.88 2.41 3.12
1.58 1.80 2.27 2.90
1.52 1.72 2.13 2.67
1.46 1.62 1.97 2.42
1.41 1.56 1.87 2.26
1.38 1.52 1.79 2.15
1.34 1.46 1.69 2.00
1.31 1.41 1.63 1.90
1.29 1.39 1.58 1.83
1.23 1.30 1.45 1.64
1.16 1.21 1.30 1.43
1.61 1.84 2.34 2.99
1.55 1.76 2.20 2.77
1.49 1.68 2.06 2.54
1.43 1.58 1.90 2.30
1.38 1.52 1.79 2.14
1.35 1.47 1.72 2.02
1.30 1.41 1.61 1.87
1.27 1.36 1.54 1.77
1.25 1.33 1.50 1.69
1.18 1.24 1.35 1.49
1.08 1.11 1.16 1.22
Appendix
919
Table A.9 Critical values for studentized range distributions m v 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 24 30 40 60 120 1
a .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01
2 3.64 5.70 3.46 5.24 3.34 4.95 3.26 4.75 3.20 4.60 3.15 4.48 3.11 4.39 3.08 4.32 3.06 4.26 3.03 4.21 3.01 4.17 3.00 4.13 2.98 4.10 2.97 4.07 2.96 4.05 2.95 4.02 2.92 3.96 2.89 3.89 2.86 3.82 2.83 3.76 2.80 3.70 2.77 3.64
3 4.60 6.98 4.34 6.33 4.16 5.92 4.04 5.64 3.95 5.43 3.88 5.27 3.82 5.15 3.77 5.05 3.73 4.96 3.70 4.89 3.67 4.84 3.65 4.79 3.63 4.74 3.61 4.70 3.59 4.67 3.58 4.64 3.53 4.55 3.49 4.45 3.44 4.37 3.40 4.28 3.36 4.20 3.31 4.12
4 5.22 7.80 4.90 7.03 4.68 6.54 4.53 6.20 4.41 5.96 4.33 5.77 4.26 5.62 4.20 5.50 4.15 5.40 4.11 5.32 4.08 5.25 4.05 5.19 4.02 5.14 4.00 5.09 3.98 5.05 3.96 5.02 3.90 4.91 3.85 4.80 3.79 4.70 3.74 4.59 3.68 4.50 3.63 4.40
5 5.67 8.42 5.30 7.56 5.06 7.01 4.89 6.62 4.76 6.35 4.65 6.14 4.57 5.97 4.51 5.84 4.45 5.73 4.41 5.63 4.37 5.56 4.33 5.49 4.30 5.43 4.28 5.38 4.25 5.33 4.23 5.29 4.17 5.17 4.10 5.05 4.04 4.93 3.98 4.82 3.92 4.71 3.86 4.60
6 6.03 8.91 5.63 7.97 5.36 7.37 5.17 6.96 5.02 6.66 4.91 6.43 4.82 6.25 4.75 6.10 4.69 5.98 4.64 5.88 4.59 5.80 4.56 5.72 4.52 5.66 4.49 5.60 4.47 5.55 4.45 5.51 4.37 5.37 4.30 5.24 4.23 5.11 4.16 4.99 4.10 4.87 4.03 4.76
7 6.33 9.32 5.90 8.32 5.61 7.68 5.40 7.24 5.24 6.91 5.12 6.67 5.03 6.48 4.95 6.32 4.88 6.19 4.83 6.08 4.78 5.99 4.74 5.92 4.70 5.85 4.67 5.79 4.65 5.73 4.62 5.69 4.54 5.54 4.46 5.40 4.39 5.26 4.31 5.13 4.24 5.01 4.17 4.88
8 6.58 9.67 6.12 8.61 5.82 7.94 5.60 7.47 5.43 7.13 5.30 6.87 5.20 6.67 5.12 6.51 5.05 6.37 4.99 6.26 4.94 6.16 4.90 6.08 4.86 6.01 4.82 5.94 4.79 5.89 4.77 5.84 4.68 5.69 4.60 5.54 4.52 5.39 4.44 5.25 4.36 5.12 4.29 4.99
9 6.80 9.97 6.32 8.87 6.00 8.17 5.77 7.68 5.59 7.33 5.46 7.05 5.35 6.84 5.27 6.67 5.19 6.53 5.13 6.41 5.08 6.31 5.03 6.22 4.99 6.15 4.96 6.08 4.92 6.02 4.90 5.97 4.81 5.81 4.72 5.65 4.63 5.50 4.55 5.36 4.47 5.21 4.39 5.08
10 6.99 10.24 6.49 9.10 6.16 8.37 5.92 7.86 5.74 7.49 5.60 7.21 5.49 6.99 5.39 6.81 5.32 6.67 5.25 6.54 5.20 6.44 5.15 6.35 5.11 6.27 5.07 6.20 5.04 6.14 5.01 6.09 4.92 5.92 4.82 5.76 4.73 5.60 4.65 5.45 4.56 5.30 4.47 5.16
11 7.17 10.48 6.65 9.30 6.30 8.55 6.05 8.03 5.87 7.65 5.72 7.36 5.61 7.13 5.51 6.94 5.43 6.79 5.36 6.66 5.31 6.55 5.26 6.46 5.21 6.38 5.17 6.31 5.14 6.25 5.11 6.19 5.01 6.02 4.92 5.85 4.82 5.69 4.73 5.53 4.64 5.37 4.55 5.23
12 7.32 10.70 6.79 9.48 6.43 8.71 6.18 8.18 5.98 7.78 5.83 7.49 5.71 7.25 5.61 7.06 5.53 6.90 5.46 6.77 5.40 6.66 5.35 6.56 5.31 6.48 5.27 6.41 5.23 6.34 5.20 6.28 5.10 6.11 5.00 5.93 4.90 5.76 4.81 5.60 4.71 5.44 4.62 5.29
920
Appendix
Table A.10 Chi-squared curve tail areas Upper-tail area >.100 .100 .095 .090 .085 .080 .075 .070 .065 .060 .055 .050 .045 .040 .035 .030 .025 .020 .015 .010 .005 .001 .100 .100 .095 .090 .085 .080 .075 .070 .065 .060 .055 .050 .045 .040 .035 .030 .025 .020 .015 .010 .005 .001 .100 .100 .095 .090 .085 .080 .075 .070 .065 .060 .055 .050 .045 .040 .035 .030 .025 .020 .015 .010 .005 .001 .100 .100 .095 .090 .085 .080 .075 .070 .065 .060 .055 .050 .045 .040 .035 .030 .025 .020 .015 .010 .005 .001 14/17 131. a: 1=24
b: 15=24
c: 1e
1
133. s = 1 137. a. P(B0|survive) = b0/[1 – (b1 + b2)cd] P(B1|survive) = b1(1 – cd)/[1 – (b1 + b2)cd] P(B2|survive) = b2(1 – cd)/[1 – (b1 + b2)cd] b. .712, .058, .231
S:
FFF
SFF
FSF
FFS
FSS
SFS
SSF
SSS
X:
0
1
1
1
2
2
2
3
3. M = the absolute value of the difference between the outcomes with possible values 0, 1, 2, 3, 4, 5 or 6; W = 1 if the sum of the two resulting numbers is even and W = 0 otherwise, a Bernoulli random variable. 5. No, X can be a Bernoulli random variable where a success is an outcome in B, with B a particular subset of the sample space. 7. a. Possible values are 0, 1, 2, …, 12; discrete b. With N = # on the list, values are 0, 1, 2, …, N; discrete c. Possible values are 1, 2, 3, 4, …; discrete d. {x: 0 < x < 1} if we assume that a rattlesnake can be arbitrarily short or long; not discrete e. With c = amount earned per book sold, possible values are 0, c, 2c, 3c, …, 10,000c; discrete f. {y: 0 < y < 14} since 0 is the smallest possible pH and 14 is the largest possible pH; not discrete g. With m and M denoting the minimum and maximum possible tensions, respectively, possible values are {x: m < x < M}; not discrete h. Possible values are 3, 6, 9, 12, 15, … i.e. 3(1), 3(2), 3(3), 3(4), …giving a first element, etc.; discrete 9. a. X is a discrete random variable with possible values {2, 4, 6, 8, …} b. X is a discrete random variable with possible values {2, 3, 4, 5, …} 11. a. .10 c. .45, .25 13. a. b. c. d. e. f.
.70 .45 .55 .71 .65 .45
Answers to Odd-Numbered Exercises
15. a. (1, 2) (1, 3) (1, 4) (1, 5) (2, 3) (2, 4) (2, 5) (3, 4) (3, 5) (4, 5) b. p(0) = .3, p(1) = .6, p(2) = .1, p(x) = 0 otherwise c. F(0) = .30, F(1) = .90, F(2) = 1. The cdf is 8 0 > > < :30 F(x) = :90 > > : 1
x\0 0 x\1 1 x\2 2x
17. a. .81 b. .162 c. The fifth battery must be an A, and one of the first four must also be an A, so p(5) = P(AUUUA or UAUUA or UUAUA or UUUAA) = .00324 d. P(Y = y) = (y – 1)(.1)y−2(.9)2, y = 2, 3, 4, 5,… 19. b. p(1) = .301, p(2) = .176, p(3) = .125, p(4) = .097, p(5) = .079, p(6) = .067, p(7) = .058, p(8) = .051, p(9) = .046. Lower digits (such as 1 and 2) are much more likely to be the lead digit of a number than higher digits (such as 8 and 9). c. F(1) = .301, F(2) = .477, F(3) = .602, F(4) = .699, F(5) = .778, F(6) = .845, F(7) = .903, F(8) = .954, F(9) = 1. So, F(x) = 0 for x < 1; F(x) = .301 for 1 x < 2; F(x) = .477 for 2 x < 3; etc. d. .602, .301 21. F(x) = 0, x < 0; .10, 0 x < 1; .25, 1 x < 2; .45, 2 x < 3; .70, 3 x < 4; .90, 4 x < 5; .96, 5 x < 6; 1.00, 6 x 23. a. p(1) = .30, p(3) = .10, p(4) = .05, p(6) = .15, p(12) = .40 b. .30, .60 25. a. p(x) = (1/3)(2/3)x−1, x = 1, 2, 3, … b. p(y) = (1/3)(2/3)y−2, y = 2, 3, 4, … c. p(0) = 1/6, p(z) = (25/54)(4/9)z−1, z = 1, 2, 3, 4, …
935
c. $33.97 d. 13.66 33. Yes, because R(1/x2) is finite. 35. $700 37. Since $142.92 > $100, you expect to win more if you gamble. 39. a. –$1/19, –$1/19 b. The expected return for a $1 wager on roulette is the same no matter how you bet. c. $5.76, $2.76, $1.00 d. Low-risk/low-reward bets (such as a color) have smaller standard deviation than highrisk/high-reward bets (such as a single number). 43. a. 32.5 b. 7.5 c. V(X) = E[X(X − 1)] + E(X) – [E(X)]2 45. a. 1/4, 1/9, 1/16, 1/25, 1/100 b. l = 2.64, r = 1.54, P(|X – l| 2r) = .04 < .25, P(|X – l| 3r) = 0 < 1/9 The actual probability can be far below the Chebyshev bound, so the bound is conservative. c. 1/9, equal to the Chebyshev bound d. p(–1) = .02, p(0) = .96, p(1) = .02 47. MX(t) = .5et/(1 − .5et), E(X) = 2, V(X) = 2 49. a. :01e9t þ :05e10t þ :16e11t þ :78e12t b. 11.71, .3659 51. p(0) = .2, p(1) = .3, p(3) = .5, E(X) = 1.8, V(X) = 1.56 55. a. 5, 4
b. 5, 4
57. p(y) = (.25)y−1(.75) for y = 1, 2, 3, …. 2
59. MY(t) = et =2 , E(X) = 0, V(X) = 1 63. a: :124 b: :279 c: :635 d: :718 65. a: :873
29. a. .60 b. $110
b: :007 c: :716
31. a. 16.38, 272.298, 3.9936 b. $458.46
d: :277 e: 1:25; 1:09
936
Answers to Odd-Numbered Exercises
67. a: :786
b: :169
c: :382
69. a: :403
b: :787
c: :774
115. a. b. c. d.
71. .1478
nb(x; 2, .5) = (x+1).5x+2, x = 0, 1, 2, 3, … 3/16 11/16 4, 2
73. .407, assuming batteries’ voltage levels are independent
117. 2 + 2 + 2 = 6
75. a. .0104 c. .00197 d. 1500, 260
121. a: 160; 21:9
77. a: :017
125. mean 0.5968, sd 0.8548 (answers will vary)
b: :811; :425
c: :006; :902; :586
79. For p = .9 the probability is higher for B (.9963 versus .99 for A) For p = .5 the probability is higher for A (.75 versus .6875 for B) 81. a. 20, 16 (binomial, n = 100, p = .2) b. 70, 21 83. a: p ¼ 0 or 1 b: p ¼ :5 85. When p = .5, the true probability for k = 2 is .0414, compared to the bound of .25. When p = .5, the true probability for k = 3 is .0026, compared to the bound of .1111. When p = .75, the true probability for k = 2 is .0652, compared to the bound of .25. When p = .75, the true probability for k = 3 is .0039, compared to the bound of .1111.
119. a: :2817
b:7513
c: :4912; :9123
b: :6756
127. .9090 (answers will vary) 129. a. mean 13.5888, sd 2.9381 b. .1562 (answers will vary) 131. mean 3.4152, variance 5.97 (answers will vary) 133. b. 142 tickets 135. a. .2291 b. $8696 c. $7811 d. .2342, $7767, $7571 (answers will vary) 137. b. .9196 (answers will vary) 139. b. 3.114, .405, .636
89. a: :932 b: :065 c: :068 d: :491 e: :251
141. a. b(x; 15, .75) b. .6865 c. .313 d. 45/4, 45/16 e. .309
91. a: :011 b: :441 c: :554; :459
143. a. .013 b. 19 c. .266 d. Poisson(500)
d: :944
93. a: :219 b: :558
145. a: pðx; 2:5Þ
95. .857
c: :109
147. 1.813, 3.05
97. a: :122; :808; :283 b: 12; 3:464 c: :530; :011 99. a: :099 b: :135 c: 2 101. a: 4 b: :215 c: 1:15 years 103. a: :221
b: :067
b: 6; 800; 000
c: pðx; 1608:5Þ
109. a: :114 b: :879 c: :121 d: use Binð15; :1Þ 111. a: hðx; 15; 10; 20Þ
b: :0325
c: :6966
113. a. h(x; 10, 10, 20) b. .0325 c. h(x; n, n, 2n), E(X) = n/2, V(X) = n2/[4(2n–1)]
149. p(2) = p2, p(3) = (1 − p)p2, p(4) = (1 − p)p2, p(x) = [1 − p(2) − ⋯ − p(x − 3)](1 − p)p2, x = 5, 6, 7, … . Alternatively, p(x) = (1 − p)p(x − 1) + p(1 − p)p(x − 2), x = 5, 6, 7, … .99950841 151. a: 0029
b: 0767; :9702
153. a: :135
b: :00144
155. 3.590 157. a: No
b: :0273
c:
1 P x¼0
½pðx; 2Þ5
Answers to Odd-Numbered Exercises
937
159. b. .5l1 + .5l2 c. .5l1 + .5l2 + .25(l1 – l2)2 d. p(x; l1, l2) = .6 p(x; l1) + .4 p(x; l2)
29. 1/4, 1/16
161. .5
33. MY ðtÞ ¼ ðe5t e5t Þ=10t, Y * Unif[–5, 5]
165. X * b(x; 25, p), E(h(X)) = 500p + 750, pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rh(X) = 100 pð1 pÞ. Independence and constant probability might not be valid because of the effect that customers can have on each other. Also, store employees might affect customer decisions.
35. a. MY ðtÞ ¼ :04e10t =ð:04 tÞ for t < .04, mean = 35, variance = 625 b. MðtÞ ¼ :04=ð:04 tÞ for t < .04, mean = 25, variance = 625 c. MY ðtÞ ¼ :04=ð:04 tÞ; Y is a shifted exponential rv
31. a. v/20, v/800 b. 100.2p versus 100p c. 80p2
167. pð0Þ ¼ :07776; pð1Þ ¼ :10368; pð2Þ ¼ :19008; 39. a. .4850 b. .3413 c. .4938 d. .9876 e. .9147 f. .9599 g. .9104 h. .0791 i. .0668 j. .9876 pð3Þ ¼ :20736; pð4Þ ¼ :17280; pð5Þ ¼ :13824;
pð6Þ ¼ :06912; pð7Þ ¼ :03072; pð8Þ ¼ :01024
Chapter 4
41. a. 1.34 b. –1.34 c. .674 d. –.674 e. –1.555 43. a. .9772 b. .5 c. .9104 d. .8413 e. .2417 f. .6826
1. a. .25 b. .5 c. 7/16 3. b. .5 c. 11/16 d. .6328
45. a. .7977 b. .0004 c. The top 5% are the values above .3987.
5. a. 3/8 b. 1/8 c. .2969 d. .5781
47. The second machine
7. a. f(x) = :10 for 25 x 35 and = 0 otherwise b. .2 c. .4 d. .2
49. a. .2514, *0 b. 39.985 ksi
9. a. .699 b. .301, .301 c. .166 pffiffiffi 11. a. 1/4 b. 3/16 c. 15/16 d. 2 e. f(x) = x/2 for 0 x < 2, and f(x) = 0 otherwise
53. a. .8664 b. .0124 c. .2718
13. a. 3 b. 0 for x 1, 1 – 1/x for x > 1 c. 1/8, .088 3
15. a. F(x) = 0 for x 0, F(x) = x /8 for 0 < x < 2, F(x) = 1 for x 2 b. 1/64 c. .0137, .0137 d. 1.817 3
51. .0510
55. a. .794 b. 5.88 c. 7.94 d. .265 57. a. U(1.72) – U(.55) b. U(.55) – [1 – U(1.72)]; No, due to symmetry. 59. a. .4584 b. 135.8 kph c. .9265 d. .3173 e. .6844 61. a. .7286 b. .8643, .8159
17. b. 90th percentile of Y = 1.8(90th percentile of X) + 32 c. 100pth percentile of Y = a(100pth percentile of X) + b for a > 0.
63. a. .9932 b. .9875 c. .8064
19. a. 35, 25 b. .865
69. a. .15872 b. .0013495 c. .999936655 d. .00000028669
21. a. .8182, .1113 b. .314 23. a. A + (B − A)p b. (A + B)/2 c. (Bn+1 – An+1)/[(n+1)(B – A)] 25. 314.79 27. 248, 3.6
65. a. .0392 b. *1 actual actual actual actual
.15866 .0013499 .999936658 .00000028665
71. a. 120 b. 1.329 c. .371 d. .735 e. 0 73. a. 5, 4 b. .715 c. .411 75. a. 1 b. 1 c. .982 d. .129
938
Answers to Odd-Numbered Exercises
77. a. .449, .699, .148 b. .050, .018
119. Y = X2/16
79. a. \ Ai b. Exponential with k = .05 c. Exponential with parameter nk
2 ey =2 for y > 0 121. fY(y) = p2ffiffiffiffi 2p
81. a. Gamma, a = 3, b = 1/k b. .8165 85. a. .275, .599, .126 c. 15.84
b. 20.418, 17.365
89. a. .9295
c. 98.184
b. .2974
pffiffiffi 125. a. F(x) = x2/4, x ¼ 2 u c. sample mean and sd = 1.331 and 0.471 (answers will vary), l = 4/3 and pffiffiffi r ¼ 2=3 129. b. sample mean = 15.9188, close to 16 (answers will vary)
91. a. 7.53, 9.966 b. .7823, .1469 c. .6925; lognormal is not symmetric
131. $1480
93. a. 149.157, 223.595 b. .957 c. .0416 d. 148.41 e. 9.57 f. 125.90
133. b. F(x) = 1 – 16/(x + 4)2, x 0; F(x) = 0, x < 0 c. .247 d. 4 e. 16.67
95. a = b
135. a. .6563 b. 41.55 c. .3197
97. b. C(a + b)C(m + b)/[C(a + b + m)C(b)], b/(a + b)
137. a. .0003 b. .0888
99. Yes, since the pattern in the plot is quite linear.
139. a. F(x) =1.5(1 – 1/x), 1 x 3; F(x) = 0, x < 1; F(x) = 1, x > 3 b. .9, .4 c. 1.6479 d. .5333 e. .2662
101. Yes
141. a. 1.075, 1.075 b. .0614, .3331 c. 2.476
103. Yes, because the plot is reasonably straight
143. b. $95,600, .3300
105. Form a new variable, the logarithms of a TN value, and then construct a normal plot for its values. Because of the linearity of this plot, normality is plausible. 107. The pattern in the normal probability plot is curved downward, consistent with a right-skewed distribution. It is not plausible that shower flow rate has a normal population distribution. 109. The plot deviates from linearity, especially at the low end, where the smallest three observations are too small relative to the others. The plot works for any k because k is a scale parameter. 111. fY(y) = 2/y3, y > 1 113. fY(y) = yey
2
=2
,y>0
115. fY(y) = 1/16, 0 < y < 16 117. fY(y) = 1/[p(1 + y2)]
145. b. F(x) = .5e2x, x 0; F(x) = 1 – .5e−.2x, x > 0 c. .5, .6648, .2555, .6703 147. a. k ¼ ða 1Þ5a1 for a > 1 b. F(x) = 0, x 5; F(x) = 1 – (5/x)a–1, x > 5 c. 5(a − 1)/(a − 2) 149. b. .4602, .3636 c. .5950 d. 140.178 151. a. Weibull b. .542 153. a. k
b. axa–1/ba
c. FðxÞ ¼ 1 eaðxx =ð2bÞÞ , 0 x b; F(x) = 0, x < 0; F(x) = 1 – e−ab/2, 2 x [ b; f ðxÞ ¼ að1 x=bÞ eaðxx =ð2bÞÞ , 2
0 x b; f(x) = 0, x < 0, f(x) = 0, x > b This gives total probability less than 1, so some probability is located at infinity (for items that last forever).
157. F(q*) = .818
Answers to Odd-Numbered Exercises
939 y
Chapter 5
pY(y)
1. a. .20 b. .42 c. The probability of at least one hose being in use at each pump is .70. d. x 0 1 2 pX ðxÞ :16 :34 :50 y pY ðyÞ
0 :24
1 :38
2 :38
c. no d. 0.35, 0.32 e. 95.72 25. .15 27. L2
33. –.1082, –.0131
3. a. .15 b. .40 c. .22 = P(A) = P(|X1 – X2| 2) d. .17, .46
35. .238, .51 37. VðhðX; YÞÞ ¼ Eðh2 ðX; YÞÞ ½EðhðX; YÞÞ2 , 13.34 41. q = 1 when a > 0
5. a. .0305 b. .1829 c. probability = .1073, marginal evidence
43. a. 87,850, 4370.37 b. yes, no c. .0027
7. a. .054 b. .00018
45. .0336, .2310
9. a. .030 b. .120 c. .10, .30 d. .38 e. yes, p(x,y) = pX(x) pY(y)
47. .0314 49. a. 45 min 68.33
11. a. 3/380,000 b. .3024 c. .3593 d. 10kx2 + .05, 20 x 30 e. no 13. a. pðx; yÞ ¼ el1 l2 lx1 ly2 =x!y! b. el1 l2 ½1 þ l1 þ l2 l1 l2 ðl1 þ l2 Þm , Poisson with parameter c. e m! l1 þ l2 15. a. e−x−y, x 0, y 0 d. .3298
b. .3996
c. .5940
b. 68.33
c. –1, 13.67
d. –5,
51. a. 50, 10.308 b. .0075 c. 50 d. 111.5625 e. 131.25 53. a. .9616
b. .0623
55. a. .5, n(n + 1)/4 b. .25, n(n + 1)(2n + 1)/24 57. 10:52.76 61. a. Bin(10, 18/38) b. Bin(15, 18/38) c. Bin(25, 18/38) f. no
−2ky + e−3ky for y 0, 17. a. F(y) = 1 – 2e F(y) = 0 for y < 0; f(y) = 4ke−2ky – 3ke−3ky for y 0 b. 2/(3k)
65. c. Gamma(n, 1/k)
19. a. .25 b. 1/p c. 2/p d. fX ð xÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi. 2 r 2 x2 ðpr 2 Þ for –r x r, fY ð yÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi. 2 r 2 y2 ðpr 2 Þ for –r y r, no 21. 1/3
pX(x)
2 .09
31. –2/3
e. dependent, .30 = P(X = 2 and Y = 2) 6¼ P(X = 2) P(Y = 2) = (.50)(.38)
b. x
1 .14
29. 1/4 h
P(X 1) = .50
23. a. .11
0 .77
0
1
2
3
.78
.12
.07
.03
67. a. 2 c. 0, 2n, 1/(1 – t2)n d. 1/(1 – t2/2n)n 69. a: fX ðxÞ ¼ 2x; 0\x\1 b: fYjX ðyjxÞ ¼ 1=x; 0\y\x\1 c: :6 d: no; the domain is not a rectangle e: EðYjX ¼ xÞ ¼ x=2 f: VðYjX ¼ xÞ ¼ x2 =12 71. a: fX ðxÞ ¼ 2e2x ; 0\x\1 b: fYjX ðyjxÞ ¼ eyþx ; 0\x\y\1 c: PðY [ 2jx ¼ 1Þ ¼ 1=e
940
Answers to Odd-Numbered Exercises
d: no; the domain is not rectangular e: EðYjX ¼ xÞ ¼ x þ 1 f: VðYjX ¼ xÞ ¼ 1 73. a. x/2, x2/12 b. 1/x, 0 < y < x < 1 c. –ln(y), 0 < y < 1 d. 1/4, 7/144 e. 1/4, 7/144 75. a. pY|X(0|1) = 4/17, pY|X(1|1) = 10/17, pY|X(2|1) = 3/17 b. pY|X(0|2) = .12, pY|X(1|2) = .28, pY|X(2|2) = .60 c. .40 d. pX|Y(0|2) = 1/19, pX|Y(1|2) = 3/19, pX|Y(2|2) = 15/19 77. a. x2/2, x4/12 b. 1/x2, 0 < y < x2 < 1 pffiffiffi c. ð1= yÞ 1, 0 < y < 1 79. a. p(1,1) = p(2,2) = p(3,3) = 1/9, p(2,1) = p(3,1) = p(3,2) = 2/9 b. pX(1) = 1/9, pX(2) = 3/9, pX(3) = 5/9 c. pY|X(1|1) = 1, pY|X(1|2) = 2/3, pY|X(2|2) = 1/3, pY|X(1|3) = .4, pY|X(2|3) = .4, pY|X(3|3) = .2 d. E(Y|X=1) = 1, E(Y|X=2) = 4/3, E(Y|X=3) = 1.8, no e. V(Y|X=1) = 0, V(Y|X=2) = 2/9, V(Y|X=3) = .56 81. a. pX|Y(1|1) = .2, pX|Y(2|1) = .4, pX|Y(3|1) = .4, pX|Y(2|2) = 1/3, pX|Y(3|2) = 2/3, pX|Y(3|3) = 1 b. E(X|Y=1) = 2.2, E(X|Y=2) = 8/3, E(X|Y=3) = 3 c. V(X|Y=1) = .56, V(X|Y=2) = 2/9, V(X|Y=3) = 0 83. a. pX(x) = .1, x = 0, 1, 2, …, 9; pY|X(y|x) = 1/9, y = 0, 1, 2, …, 9, y 6¼ x; pX,Y(x, y) = 1/90, x, y = 0, 1, 2, …, 9, y 6¼ x b. E(Y|X=x) = 5 – x/9, x = 0, 1, 2, …, 9 85. a. .6x, .24x b. 60 c. 60 87. 176, 12.68 89. a. 1 + 4p, 4p(1 – p) b. 2598, 16,518,196 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi c. 2598(1 + 4p), 16518196 þ 93071200p 26998416p2 d. 2598 and 4064, 7794 and 7504, 12,990 and 9088 91. a. normal, mean = 984, variance = 38,988 b. .1379 c. 1237 93. a. N(158, 8.72) b. N(170, 8.72) c. .4090
95. a. .8875x + 5.2125 b. 111.5775 c. 10.563 d. .0951 97. a. 2x – 10
b. 9
c. 3
d. .0228
99. a. .1410 b. .1165 With positive correlation, the deviations from their means of X and Y are likely to have the same sign. 101. a.
1 ðy21 þy22 Þ=4 4p e
2 b. p1ffiffiffiffi ey1 =4 4p
c. Yes
103. a. f(y) = y(2 – y), 0 y 1 b. f(w) = 2(1 – w), 0 w 1 2
105. 4y3[ln(y3)] for 0 < y3 < 1 5
109. a. g5(y) = 5x4/10 , 25/3 b. 20/3 c. 5 d. 1.409 111. gY5 jY1 ðy5 j4Þ ¼ ½2=3½ðy5 4Þ=63 , 4 < y5 < 10; 8.8 113. 1/(n+1), 2/(n+1), 3/(n+1), …, n/(n+1) Cðnþ1ÞCðiþ1=hÞ
115. CðiÞCðnþ1þ1=hÞ, Cðnþ1ÞCðiþ2=hÞ CðiÞCðnþ1þ2=hÞ
h
Cðnþ1ÞCðiþ1=hÞ CðiÞCðnþ1þ1=hÞ
i2
117. a. .0238 b. $2025 121. a. nðn 1Þ½Fðyn Þ Fðy1 Þn2 f ðy1 Þf ðyn Þ for y1 \yn R b. fW ðwÞ ¼ nðn 1Þ½Fðw þ w1 Þ Fðw1 Þn2 f ðw1 Þf ðw þ w1 Þdw1 c. nðn 1Þwn2 ð1 wÞ for 0 w 1 123. fT ðtÞ ¼ et=2 et for t > 0 125. a. 3/81,250 R 30x b. fX ðxÞ ¼ 20x kxydy ¼ kð250x 10x2 Þ 0 x 20 R 30x kxydy ¼ kð450x 30x2 þ 12x3 Þ: 0 20 x 30 fY(y) = fX(y) dependent c. .3548 d. 25.969 e. −32.19, −.894 f. 7.651 127. 7/6 131. c. If p(0) = .3, p(1) = .5, p(2) = .2, then 1 is the smaller of the two roots, so extinction is certain in this case with l < 1. If p(0) = .2, p(1) = .5, p(2) = .3,
Answers to Odd-Numbered Exercises
then 2/3 is the smaller of the two roots, so extinction is not certain with l > 1. 133. a. P((X, Y) 2 A) = F(b, d) – F(b, c) − F(a, d) + F(a, b) b. P((X, Y) 2 A) = F(10, 6) − F(10, 1) − F(4, 6) + F(4, 1) P((X, Y) 2 A) = F(b, d) − F(b, c − 1) − F(a − 1, d) + F(a − 1, b − 1) c. At each (x*, y*), F(x*, y*) is the sum of the probabilities at points (x, y) such that x x* and y y* x Fðx; yÞ 100 y
200 100
:50 :30
1 :50
0
:20
:25
135. a. 2x, x b. 40 c. 100 2 , 1500 hours ð1 1000tÞð2 1000tÞ
141. a. 2360, 73.7021 b. .9713 143. .8340 145. a.
r2W r2W þr2E
147. 26, 1.64
b. .9999
Chapter 6 1. a.
x
25
32.5
40
45
52.5
65
pðxÞ
.04
.20
.25
.12
.30
.09
EðXÞ ¼ 44:5 ¼ l b.
s2 p(s2)
0
112.5
312.5
800
.38
.20
.30
.12
EðS2 Þ ¼ 212:25 ¼ r2 3.
x/n p(x/n)
250
d. F(x, y) = .6x2y + .4xy3, 0 x 1; 0 y 1; F(x, y) = 0, x 0; F(x, y) = 0, y 0; F(x, y) = .6x2 + .4x, 0 x 1, y > 1; F(x, y) = .6y + .4y3, x > 1, 0 y 1; F(x, y) = 1, x > 1, y > 1 P(.25 x .75, .25 y .75) = .23125 e. F(x, y) = 6x2y2, x + y 1, 0 x 1; 0 y 1, x 0, y 0 F(x, y) = 3x4 − 8x3 + 6x2 + 3y4 – 8y3 + 6y2 – 1, x + y > 1, x 1, y 1 F(x, y) = 0, x 0; F(x, y) = 0, y 0; F(x, y) = 3x4 − 8x3 + 6x2, 0 x 1, y>1 F(x, y) = 3y4 – 8y3 + 6y2, 0 y 1, x>1 F(x, y) = 1, x > 1, y > 1
137.
941
0
.1
.2
.3
.4
0.0000
0.0000
0.0001
0.0008
0.0055
.5
.6
.7
.8
.9
1.0
0.0264
0.0881
0.2013
0.3020
0.2684
0.1074
x
5. a.
pðxÞ
1
1.5
2
2.5
3
3.5
4
.16
.24
.25
.20
.10
.04
.01
b.
PðX 2:5Þ = .85
c.
r
0
1
2
3
p(r) .30 .40 .22 .08
d. 7.
.24 x
pðxÞ
x
pðxÞ
x
pðxÞ
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.000045 0.000454 0.002270 0.007567 0.018917 0.037833 0.063055
1.4 1.6 1.8 2.0 2.2 2.4 2.6
0.090079 0.112599 0.125110 0.125110 0.113736 0.094780 0.072908
2.8 3.0 3.2 3.4 3.6 3.8 4.0
0.052077 0.034718 0.021699 0.012764 0.007091 0.003732 0.001866
11. a. 12, .01 b. 12, .005 c. With less variability, the second sample is more closely concentrated near 12. 13. a. No, the distribution is clearly not symmetric. A positively skewed distribution— perhaps Weibull, lognormal, or gamma. b. .0746 c. .00000092. No, 82 is not a reasonable value for l.
942
Answers to Odd-Numbered Exercises
15. a. .8366
d. The sample proportion of students exceeding 100 in IQ is 30/33 = .91 e. .112, S/X
b. no
17. 43.29 19. a. .9772, .4772
b. 10
21. a. .9838 b. .8926 c. .9862 and .8934, both quite close 27. 1=X 29. Because v2m is the sum of m independent random variables, each distributed as v21 , the Central Limit Theorem applies. 35. a. 3.2
b. 10.04, the square of (a)
39. a. 4.32 41. a. m2/(m2 − 2), m2 > 2 b. 2m22 ðm1 þ m2 2Þ =½m1 ðm2 2Þ2 ðm2 4Þ, m2 > 4 49. a. The approximate value, .0228, is smaller because of skewness in the chi-squared distribution b. This approximation gives the answer .03237, agreeing with the software answer to this number of decimals. 53. a. .9686
b. .90
c. .87174
55. .048 57. a. .9544 for all n b. .8839, .9234, .9347; increases with n toward (a) 59. a. 2.6, 1.2
b. 390, 14.7
c. *1
61. .9686
3. a. 1.3481, X b. .0846 c. 1.3481, X d. 1.78, X + 1.282S e. .6736 ^1 ¼ NX, ^ 5. h h2 ¼ T ND, ^ h3 ¼ T X=Y; 1,703,000, 1,591,300, 1;601;438:281 7. a. 120.6 b. 1,206,000, 10,000X c. .8 ~ d. 120, X pffiffiffiffiffiffiffiffi 9. a. X, 2.113 b. l=n, .119 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 11. b. p1 ð1 p1 Þ=n1 þ p2 ð1 p2 Þ=n2 c. In part (b) replace p1 with X1/n1 and replace p2 with X2/n2 d. −.245 e. .0411 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 13. c. n2 =½ðn 1Þ2 ðn 2Þx2 P 17. a. ^ h ¼ Xi2 =ð2nÞ
b. 74.505
19. 4/9 21. a. p ^ ¼ 2^ k :30 ¼ :20 b. ^ p ¼ ð100^ k 9Þ=70 25. a. .15 b. yes c. .4437 ^ ¼ ð2x 1Þ=ð1 xÞ = 3 27. a. h b. ^ h ¼ ½n=R lnðxi Þ 1 = 3.12 ^ ¼ r=x ¼ :15 This is the number of suc29. p cesses over the number of trials, the same as the result in Exercise 25. It is not the same as the estimate of Exercise 19.
63. .0722
31. a. ð2phÞn=2 eRxi =2h
65. a. .5774, .8165, .9045 b. 1.312, 4.303, 18.216
b. n2 lnð2pÞ n2 lnðhÞ 2hi P 2 c. xi =n P d. n= x2i P 2 33. a. xi =2n=74.505, the same as Exercise 17 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi b. 2^ h lnð2Þ=10.16
67. a. .049 b. .09 Chapter 7 1.
~ a. 113.73, X b. 113, X c. 12.74, S, an estimator for the population standard deviation
2
Rx2
Answers to Odd-Numbered Exercises
35. ^k ¼ lnð^pÞ=24 ¼ :0120 37. a. X
~ b. X
943
Chapter 8 1. a. 99.5% b. 85% c. 2.97 d. 1.15
39. No, statistician A does not have more information. Qn Pn 41. i¼1 xi ; i¼1 xi P 43. xi
3. a. A narrower interval has a lower probability b. No, l is not random c. No, the interval refers to l, not individual observations d. No, a probability of .95 does not guarantee 95 successes in 100 trials
45. (min{Xi}, max{Xi}) 5. a. (4.52, 5.18)
47. 2X(n − X)/[n(n − 1)] 49. X, by the Rao-Blackwell Theorem, because X is sufficient for l 51. a. 1=p2 ð1 pÞ b. n=p2 ð1 pÞ c. p2 ð1 pÞ=n
7. Increase n by a factor of 4. Decrease the width by a factor of 5. pffiffiffi 9. a. x za r= n b. 4.418 c. 59.70
13. a. 1.341 b. 1.753 c. 1.708 d. 1.684 e. 2.704 15. a. 2.228 b. 2.131 c. 2.947 d. 4.604 e. 2.492 f. 2.715 17. a. Yes b. (4.89, 5.79) c. (.5868, .6948) fl oz 19. a. (63.12, 66.66) b. No, data indicates the population is not normal
b. Yes
59. a. 1/p, ð1 pÞ=np2
c. 55 d. 94
11. 950; .8724 (normal approximation), .8731 (binomial)
53. a. If we ignore the boundary, 1/h2 b. h2/n c. Both are less than h2/n; Cramér-Rao does not apply because the boundaries of the uniform variable X include h itself pffiffiffi 55. a. x b. Nðl; r= nÞ c. Yes d. They agree 57. a. 2/r2
b. (4.12, 5.00)
b. ð1 pÞ=np2 c. Yes
61. k^ ¼ 6=ð6t6 t1 t5 Þ ¼ 6=ðx1 þ 2x2 þ þ 6x6 Þ = .0436, where x1 = t1, x2 = t2 – t1, …, x6 = t6 – t5 63. 2.912, 2.242 67. 5.93, 11.66 r2 is unbiased 69. b. no, Eð^ r2 Þ ¼ r2 =2, so 2^ 73. .448, .4364 75. d(X) = (–1)X, d(200) = 1, d(199) = –1 P 2 P xi = 30.040, the estimated 77. b^ ¼ xi yi P ^ i Þ2 = ^2 ¼ 1n ðyi bx minutes per item; r 16.912; 25b^ = 751
21. a. (29.26, 40.78) b. (–3.61, 73.65); times are normally distributed; no 23. a. (18.94, 24.86) b. narrower d. (12.09, 31.71)
c. narrower
25. a. Assuming normality, a 95% lower confidence bound is 8.11. When the bound is calculated from repeated independent samples, roughly 95% of such bounds should be below the population mean. b. A 95% lower prediction bound is 7.03. When the bound is calculated from repeated independent samples, roughly 95% of such bounds should be below the value of an independent observation. 27. a. 378.85
b. 413.14
c. (340.16, 401.22)
29. (–8228.0, 116,042.4); yes (note that negative values in PI make its validity suspect)
944
Answers to Odd-Numbered Exercises
31. a. (169.36, 179.37) b. (134.30, 214.43), which includes 152 c. The second interval is much wider, because it allows for the variability of a single observation. d. The normal probability plot gives no reason to doubt normality. This is especially important for part (b), but the large sample size implies that normality is not so critical for (a). 33. a. (18.413, 20.102) b. 18.852 c. data indicates the population distribution is not normal 35. (18.3, 19.7) days 37. 0.056; yes, though potentially not by much 39. 97 41. a. 80% b. 98% c. 75% 43. (.798, 845) 45. a. .504 b. Yes, since p > .504 > .5 47. .584 49. (.513, .615) 51. .441; yes 53. a. 381 b. 339 55. (.028, .167) 57. a. 22.307 b. 34.381 c. 44.313 d. 49.925 e. 11.523 f. 10.519 59. (0.4318, 1.4866), (.657, 1.219) 61. b. (2.34, 5.60) c. False 63. a. (7.91, 12.00) b. Yes c. The accompanying R code assumes the data has been read in as the vector x. N=5000 xbar=rep(0,N) for (i in 1:N){ resample = sample(x,length(x), replace = T) xbar[i]=mean(resample) }
d. (7.905, 12.005) for one simulation; bootstrap distribution is somewhat skewed, so validity is questionable e. (8.204, 12.091) for one simulation f. Bootstrap percentile interval; population and bootstrap distributions are both skewed 65. a. (26.61, 32.94) b. Because of outliers, weight gain does not seem normally distributed. However, with n = 68, the effects of the CLT might be enough to validate use of t procedures anyway. d. (26.66, 32.90) for one simulation; yes, because histogram is bell-shaped e. (26.69, 32.88) for one simulation f. All three are close, so one-sample t should be considered valid 67. a. (38.46, 38.84) b. Although a normal probability plot is not perfectly straight, there is not enough deviation to reject normality. d. (38.47, 38.83) for one simulation; possibly invalid because bootstrap distribution is somewhat skewed e. (38.51, 38.81) for one simulation f. All three intervals are surprisingly similar. g. Yes: all CIs are well above normal body temperature of 37°C 69. a. (170.75, 183.57) b. Plot is reasonably linear, CI in part (a) is legitimate. d. (170.76, 183.46) for one simulation; very similar to (a) e. (171.04, 183.00) for one simulation f. All three are valid, so use the shortest: (171.04, 183.00). None of the CIs capture the true l value 71. 246 73. a. .614 b. 4727.8, no c. Yes: .66(7700) = 5082 > 5000 75. a. (.163, .174) 77. (.1295, .2986)
b. (.089, .326)
Answers to Odd-Numbered Exercises
79. a. At 95% confidence, average TV viewing time for the population of all 0–11 months old children is between 0.8 and 1.0 hours per day. (Similar interpretation for others.) b. Samples with larger standard deviations and/or smaller sample sizes will result in wider intervals. c. Yes: none of the intervals overlap pffiffiffiffiffiffiffiffi 81. c. r2 =Rx2i , r= Rx2i d. Spread out: variance is inversely proportional to sum of squares of x values pffiffiffiffiffiffiffiffi ^ t:025;n1 s= Rx2 ; (29.93, 30.15) e. b i 85. a. v2:95;2n =2RXi , .0098 b. expðt v2:05;2n =2RXi Þ = .058
945
c. A type I error involves saying that the two companies are not equally favored when they are. A type II error involves saying that the two companies are equally favored when they are not. d. Bin(25, .5); .0433 e. bð:3Þ ¼ bð:7Þ ¼ :488; bð:4Þ ¼ bð:6Þ ¼ :845; power = .512 for .3 and .7, .155 for .4 and .6 11. a. b. c. d. e. f.
H0: l = 10 versus Ha: l 6¼ 10 .01 .5319. .0078 c = 2.58 c = 1.96 x =10.02, so do not reject H0
pffiffiffi pffiffiffi 87. a. ðx t:025;n1;d s= n; x t:975;n1;d s= nÞ b. (3.01, 4.46)
13. c. .0004, *0, P(type I error) a = .01 when l < l0
89. a. 1/2n b. n/2n c. (n+1)/2n, 1 − (n + 1)/2n−1, (29.9, 39.3) with confidence level .9785
15. a. .0301 b. .003 c. .004
91. a. P(A1 \ A2) = .95 = .9025 b. P(A1 \ A2) .90 c. P(A1 \ A2) 1 − 2a; P(A1 \ A2 \ ⋯ \ Ak) 1 − ka
17. Test H0: l = .5 versus Ha: l 6¼ .5
2
Chapter 9 1. a. yes b. no c. no d. yes e. no f. yes 5. H0: r = .05 versus Ha: r < .05. Type I error: Conclude that the standard deviation is less than .05 mm when it is really equal to .05 mm. Type II error: Conclude that the standard deviation is .05 mm when it is really less than .05.
a. Do not reject H0 because t.025,12 = 2.179 > |1.6| b. Do not reject H0 because t.025,12 = 2.179 > |−1.6| c. Do not reject H0 because t.005,24 = 2.797 > |−2.6| d. Reject H0 because t.005,24 = 2.797 < |−3.9| 19. a. Do not reject H0 because |–2.27| < 2.576 b. .2266 c. 22 21. z = –2.14, so reject H0 at .05 level but not at .01 level 23. Because t = 2.24 > 1.708 = t.05,25, reject H0: l = 360. Yes, this suggests contradiction of prior belief.
7. A type I error here involves saying that the plant is not in compliance when in fact it is. A type II error occurs when we conclude that the plant is in compliance when in fact it isn’t. A government regulator might regard the type II error as being more serious.
25. a. Because |–1.40| < 2.064, H0 is not rejected at the .05 level. b. 600 lies in the CI for l
9. a. R1 b. Reject H0
27. a. no, t = −.02 b. .58 c. n = 20 total observations
946
Answers to Odd-Numbered Exercises
29. Since 1.04 < 2.132, we do not reject H0 at the .05 significance level. 31. a. Because t = .50 < 1.89 = t.05,7 do not reject H0. b. .73 33. Because t = −1.24 > −1.40 = −t.10,8, we do not have evidence to question the prior belief. 0 37. a. 1 F ta;n1 ; n 1; l plffiffiffi0 r= n l0 l0 pffiffiffi b. F ta=2;n1 ; n 1; r= n l0 l0 pffiffiffi þ 1 F ta=2;n1 ; n 1; r= n 39. Since |–2.469| 1.96, reject H0. 41. a. Do not reject H0: p = .10 in favor of Ha: p > .10 because 16 or more blistered plates would be required for rejection at the .05 level. Because H0 is not rejected, there could be a type II error. b. bð:15Þ ¼ :4920 when n ¼ 100; bð:15Þ ¼ :2743 when n ¼ 200 c. 362 43. a. Do not reject H0: p = .02 in favor of Ha: p < .02 because z = −1.01 is not in the rejection region at the .05 level. There is no strong evidence suggesting that the inventory be postponed. b. bð:01Þ ¼ :195 c. 1 bð:05Þ 0 45. a. Test H0: p = .05 versus Ha: p 6¼ .05. Since z = 3.07 > 2.58, H0 is rejected. The company’s premise is not correct. b. bð:10Þ ¼ :033 47. Using n = 25, the probability of 5 or more leaky faucets is .0980 if p = .10, and the probability of 4 or fewer leaky faucets is .0905 if p = .3. Thus, the rejection region is 5 or more, a = .0980, and b = .0905. 49. a. reject
b. reject
c. do not reject
d. reject e. do not reject 51. a. .0778 b. .1841 e. .5438
c. .0250
d. .0066
53. a. P = .0403 b. P = .0176 c. P = .1304 d. P = .6532 e. P =.0021 f. P = .00022 55. Based on the given data, there is no reason to believe that pregnant women differ from others in terms of serum receptor concentration. 57. a. Because the P-value is .166, no modification is indicated. b. .9974 59. Because t = −1.759 and the P-value = .082, which is less than .10, reject H0: l = 3.0 against a two-tailed alternative at the 10% level. However, the P-value exceeds .05, so do not reject H0 at the 5% level. There is just a weak indication that the percentage is not equal to 3% (lower than 3%). 61. a. Test H0: l = 10 versus Ha: l < 10 b. Because the P-value is .017 < .05, reject H0, suggesting that the pens do not meet specifications. c. Because the P-value is .045 > .01, do not reject H0, suggesting there is no reason to say the lifetime is inadequate. d. Because the P-value is .0011, reject H0. There is good evidence showing that the pens do not meet specifications. 63. Do not reject H0 at .01 or .05, reject H0 at .10 65. b. 36.614 c. yes 67. a. Rxi > c b. yes 69. Yes, the test is UMP for the alternative Ha: h > .5 since the tests for H0: h = .5 versus Ha: h = p0 all have the same form for p0 > .5. 71. b. .0502 c. .04345, .05826, no not most powerful 73. –2ln(K) = 3.041, P-value = .081
d. .05114;
Answers to Odd-Numbered Exercises
75. a. .98, .85, .43, .004, .0000002 b. .40, .11, .0062, .0000003 c. Because the null hypothesis will be rejected with high probability, even with only slight departure from the null hypothesis, it is not very useful to do a .01 level test. 2
S r20 2r40 =ðn1Þ
77. a. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi b. P-value = P(Z –3.59) .0002, so reject H0. 79. a. b. c. d.
16.803 reject H0 because 15 is not > 16.803 no reject H0 at .10, uncertain at .01
81. The following R code performs the bootstrap simulation described in this section. mu0 = 113; N = 5000 x = c(117.6,109.5,111.6,109.2,119.1,110.8) w = x - mean(x) + mu0 wbar = rep(0,N) # allocating space for bootstrap means for (i in 1:N){ resample = sample(w,length(w),replace=T) wbar[i] = mean(resample) }
The P-value is estimated by the proportion i values that are at or below the of these w observed x value of 112.9667. In one run of this code, that proportion was .5018, so do not reject H0. 83. a. H0: the reformulated drug is no safer than the original, recalled drug. Ha: the reformulated drug is safer than the recalled drug. b. Type I error: The FDA rejects H0 and concludes the new drug is safer, when in fact it isn’t. Type II error: The FDA fails to recognize that Ha is true, yet the new drug is indeed safer. c. Type I (arguably); lower a 85. Yes, only 25 required 87. t = 6.4, P-value 0, reject H0
947
89. a. no b. t = .44, P-value = .33, do not reject H0 91. Assuming normality, calculate t = 1.70, which gives a two-tailed P-value of .102. Do not reject the null hypothesis H0: l = 1.75. 93. The P-value for a lower tail test is .0014, so it is reasonable to reject the idea that p = .75 and conclude that fewer than 75% of mechanics can identify the problem. 95. Because the P-value is .013 > .01, do not reject the null hypothesis at the .01 level. 97. a. For testing H0: l = l0 versus Ha: l > l0 at level a, reject H0 if 2Rxi/l0 > v2a;2n For testing H0: l = l0 versus Ha: l < l0 at level a, reject H0 if 2Rxi/l0 < v21a;2n For testing H0: l = l0 versus Ha: l 6¼ l0 at level a, reject H0 if 2Rxi/l0 > v2a=2;2n or if 2Rxi/l0 < v21a=2;2n b. Because Rxi = 737, the test statistic value is 2Rxi/l0 = 19.65, which gives a P-value of .52. There is no reason to reject the null hypothesis.
99. a. yes Chapter 10 1.
a. −.4 b. .0724, .269 c. Although the CLT implies that the distribution will be approximately normal when the sample sizes are each 100, the distribution will not necessarily be normal when the sample sizes are each 10.
3.
a. z = 4.84 1.96, reject H0 b. (1251, 2949)
5.
a. Ha says that the average calorie output for sufferers is more than 1 cal/cm2/min below that for non-sufferers. Reject H0 in favor of Ha because z = −2.90 < −2.33 b. .0019 c. .82, .18 d. 66
948
Answers to Odd-Numbered Exercises
7. a. We must assume here that the population elapsed time distributions are both normal. b. z = 1.32 < 2.576, do not reject H0 9. 22, no 11. b. It decreases. 13. a. 17
b. 21
c. 18
d. 26
41. a. ðx yÞ ta=2;mþn2 sp b. (–455, 45) c. (–448, 38)
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1=m þ 1=n
43. t = 3.88 3.365, so reject H0 45. a. (.000046, .000446); yes, because 0 does not fall in the CI b. t = 2.68, P-value = .01, reject H0
15. H0: l1 l2 = 0 versus Ha: l1 l2 < 0; t = –15.83, reject H0
47. a. yes b. $10,524 c. |t| = |–1.21| < 1.729, so do not reject H0; yes
17. a. t = –18.64 –1.686, strongly reject H0 b. t = 15.66 1.680, again strongly reject H0 c. at most .10
49. a. two-sample t b. t = 2.47 1.681, reject H0 c. paired t d. t = –4.34 –1.717, reject H0
19. a. ð219:6; 538:4Þ b. t = 2.20, P-value = .014, reject H0 21. a. No, mean < sd so positively skewed; no b. ($115, $375) 23. Because t = −3.35 < −3.30 = t.001,42, yes, there is evidence that experts do hit harder. 25. b. No c. Because |t| = |−.38| < 2.23 = t.025,10, no, there is no evidence of a difference.
51. b. (12.67, 25.16) 53. t = –2.2, P-value = .028, reject H0 57. a. Because |z| = |−4.84| > 1.96, conclude that there is a difference. Rural residents are more favorable to the increase. b. .9967 59. (.016, .171)
27. Because the one-tailed P-value is .0004 < .01, conclude at the .01 level that the difference is as stated. This could result in a type I error.
61. a. (–.294, –.207)
29. Yes, because t = 2.08 with P-value = .046.
65. a. p1 = the proportion of all students who would agree to be surveyed by Melissa, p2 = the proportion of all students who would agree to be surveyed by Kristine; z = 3.00, P-value = .003, reject H0 b. No
31. b. (127.6, 202.0) c. 131.75 33. Because t = 1.82 with P-value .046 < .05, conclude at the .05 level that the difference exceeds 1. 35. a. The slender distribution appears to have a lower mean and lower variance. b. With t = 1.88 and a P-value of .097, there is no significant difference at the .05 level. 37. With t = 2.19 and a two-tailed P-value of .031, there is a significant difference at the .05 level but not the .01 level.
63. H0: p1 p2 = 0 versus Ha:p1 p2 < 0, z = –2.01, P-value = .022, reject H0
67. 769 69. a. b. c. d.
H0: p3 ¼ p2 versus Ha: p3 [ p2 ^ p3 ^ p2 ¼ ðX3 X2 Þ=n pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðX3 X2 Þ= X2 þ X3 z = 2.68, P-value = .0037, reject H0 at .01 but not at .001.
71. a. 3.69 b. 4.82 c. .207 e. 4.30 f. .212 g. .95
d. .271 h. .94
Answers to Odd-Numbered Exercises
73. f = 1.814 < F.10,9,7 = 2.72, so P-value > .10 > .01, do not reject H0 75. f = 4.38 < F.01,11,9 = 5.18, do not reject H0 77. (0.87, 2.41) 79. a. (.158, .735) b. Bootstrap distribution of differences looks quite normal. c. (.171, .723) for one simulation d. (.156, .740) for one simulation e. All three intervals are quite similar. f. Students on lifestyle floors appear to have a higher mean GPA, somewhere between *.16 higher and *.73 higher. 81. a. (0.593, 1.246); normal probability plots show departures from normality, CI might not be valid. b. The R code below assumes two vectors, L and N, contain the original data. ratio = rep(0,5000) for (i in 1:5000){ L.resamp = sample(L,length(L), replace=T) N.resamp = sample(N,length(N), replace=T) ratio[i] = sd(L.resamp)/sd(N.resamp) }
CI = (0.568, 1.289) for one simulation 83. a. The bootstrap distribution of differences of medians is definitely not normal. b. (0.38, 10.44) for one simulation c. (0.4706, 10.0294) for one simulation 85. a. t = 2.62, df = 17, P-value = .018, reject H0 at the .05 level b. In the R code below, the data is read is as a data frame called df with two columns, Time and Group. The first lists the times for each rat, while the second has B and C labels. N = 5000 diff = rep(0,N) for (i in 1:N){
949 resample = sample(df$Time, length (df$Time), replace=T) C.resamp = resample[df $Group==''C''] B.resamp = resample[df $Group==''B''] diff[i] = mean(C.resamp) - mean(B. resamp) }
P-value = 2(proportion of simulated differences > 10.59 – 5.71) = .02 for one simulation c. Results of (a) and (b) are similar 87. a. f = 4.46; F.95,6,5 = 0.228 < 4.46 < F.05,6,5 = 4.95, so do not reject H0 at .10 level. b. Use code similar to Exercise 85, but change the last line to calculate the ratio of resampled variances. Observed ratio = 4.48, proportion of ratio values 4.48 was .086 for one simulation, so P-value = 2(.086) = .172, and H0 is again not rejected. 89. a. Use the code from Exercise 85. Observed difference = 3.47, proportion of simulated differences 3.47 was .019 for one simulation, so P-value = 2(.019) = .038. Reject H0 at the .05 level. b. Results are similar; not surprising since both methods are valid. 91. a. ($6.40, $11.85) b. Use code provided in Chapter 8; bootstrap distribution of d is not normal. c. ($6.44, $11.81) for one simulation d. ($6.23, $11.51) for one simulation e. (a) and (c) are similar, while (d) is shifted to the left; (d) is most trustworthy f. On average, books cost between $6.23 and $11.51 more with Amazon than at the campus bookstore! 95. The difference is significant at the .05, .01, and .001 levels. 99. b. No, given that the 95% CI includes 0, the test at the .05 level does not reject equality of means. 101. (−299.2, 1517.8)
950
103. (1020.2, 1339.9) Because 0 is not in the CI, we would reject equality of means at the .01 level. 105. Because t = 2.61 and the one-tailed P-value is .007, the difference is significant at the .05 level using either a one-tailed or a two-tailed test. 107. a. l1 = true mean AEDI score improvement for all 2001 students; H0: l1 = 0 versus Ha: l1 > 0; t = 2.41, df = 36, P-value = .011, reject H0. b. Similarly, t = 2.19, P-value = .020, reject H0. c. H0: l1 – l2 = 0 versus H0: l1 – l2 < 0, t = –0.23, P-value = .411, do not reject H0. The data does not suggest an “Enron effect.”
Answers to Odd-Numbered Exercises
Chapter 11 1. a. Reject H0: l1 = l2 = l3 = l4 = l5 in favor of Ha: l1, l2, l3, l4, l5 not all the same, because f = 5.57 > 2.69 = F.05,4,30. b. Using Table A.8, .001 < P-value < .01. (The P-value is .0018) 3. SSTr = 2304, SSE = 4200, f = 5.76 F.05,2,21 = 3.47, reject H0 5. a. SSTr = 9982.4, MSTr = 1109.16 b. Source df SS MS Brand Error Total
113. Because the paired t = 3.88 and the twotailed P-value is .008, the difference is significant at the .05 and .01 levels, but not at the .001 level. 115. a. t = 11.86 > 2.33, reject H0 at .01 level. b. t = 8.99, again clearly reject H0. c. Yes, because students were randomly assigned to experimental groups. 117. .902, .826, .029, .00000003 119. Because z = 4.25 and the one-tailed P-value is .00001, the difference is highly significant and companies do discriminate. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 121. With Z ¼ ðX YÞ= X=n þ Y=m, the result is z = −5.33, two-tailed P-value = .0000001, so one should conclude that there is a significant difference in parameters
9982.4 3900.0
1109.16 130.00
8.53
8.53 F.01,9,30 = 3.07, so reject H0. 7.
109. Because t = 7.50 and the one-tailed P-value is .0000001, the difference is highly significant, assuming normality. 111. The two-sample t test is inappropriate for paired data. The paired t gives a mean difference .3, t = 2.67, and the two-tailed P-value is .045, so the means are significantly different at the .05 level. We are concluding tentatively that the label understates the alcohol percentage.
9 30 39
f
Source Type Error Total
df 3 20 23
SS 127375 33839 161214
MS 42458 1692
f 25.09
P-value .000, so reject H0. 9. a. SSTr = 270, MSTr = 90, SSE = 17446, MSE = 167.75 b. f = 0.54 < F.05,3,104 2.69, so H0 is not rejected at the .05 level. 11. b. H0: l1 = l2 = l3 = l4 = l5 vs Ha: not all l’s are equal, f = [18797.5/4] / [251.5/15] = 4699.38/16.77 = 280.28, strongly reject H0. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 15. Q:05;5;15 ¼ 4:37, and 4:37 272:8=4 ¼ 36:09 3 437.5
17.
3 427.5
19.
4 562.02
1 462.0 1 462.0
3 698.07
21. (6.401, 10.589)
4 469.3
2 512.8
4 469.3
1 713.00
2 502.8
5 532.1 5 532.1
2 756.93
Answers to Odd-Numbered Exercises
23. a.
SOO BSP CW SWE BMIPS ST 117 122 127 129 141 142
951 BSF N GMIPS GS 144 147 148 175
b. (–16.79, –0.73)
b. For the transformed data, y:: = 2.46, SSTr = 26.104, SSE = 7.748, f = 70.752, so H0 is clearly rejected. c. PW
25. 422.16 < SSE < 431.88 27. SSTr = 465.5, SSE = 124.5, f = 17.12 F.05,3,14 = 3.34, so reject H0 at .05 level. 29. a. With large sample sizes, normality is less important. 11.32/9.13 < 2 indicates equal variances is plausible. b. SSTr = 2445.7, SSE = 118,632.6, f = 20.92 F.05,4,1015 = 2.38, so H0 is rejected. c. Q.05,5,1015 Graduate 45.55
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 3.86, dij ¼ 3:86 116:9 2 Ji þ Jj
Freshman Sophomore 48.95 51.45
MK 2.28
1.30
Junior 52.89
43. hðxÞ ¼ arcsinð
NW 2.62
50K 2.75
FR 2.82
pffiffiffiffiffiffiffi x=nÞ
45. a. MSA ¼ 7:65, MSE ¼ 4:93, fA ¼ 1:55. Since 1.55 < F.05,4,12 = 3.26, don’t reject H0A. b. MSB = 14.70, fB = 2.98 < F.05,2,12 = 3.49, don’t reject H0B. 47. a.
Source Method Block Error Total
df 5 16 80 101
SS 596,748 529,100 987,380 2,113,228
MS 119,349.6 3306.9 12342.3
f 9.67 0.27
Senior 52.92
31. li = true mean impact of social media, as a percentage of sales, for the ith category; SSTr = 3804, SSE = 76973, f = 8.85 F.01,2,358 4.66, so H0 is rejected at the .01 significance level. 33. a. The distributions of the polyunsaturated fat percentages for each of the four regimens must be normal with equal variances. b. SSTr = 8.334, SSE = 77.79, f = 1.714 < F.10,3,50 = 2.20, so P-value > .10 and H0 is not rejected. 35. li = true mean change in CMS under the ith treatment; SSTr = 19.84, SSE = 16867.6, f = 0.1129 < F.05,2,96 3.09, so H0 is not rejected at the .05 level. 37. When H0 is true, all the ai’s are 0, and E(MSTr) = r2. Otherwise, E(MSTr) > r2. 39. k = 10, F.05,3,14 = 3.344. From R, b = pf(3.344,df1=3,df2=14,ncp=10)= .372, and so power = 1 – b = 1 – .372 = .628. 41. a. The sample standard deviations are very different.
b. H0: a1 ¼ ¼ a5 = 0 versus Ha: not all a’s are 0. Since 9.67 F.01,5,80 = 3.255, we reject H0 at the .01 level. c. I = 6, J = 17, MSE = 12342.3, Q.01,6,80 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 4.93, HSD = 4:93 12342:3=17 = 132.8. Wrist acc. 449
49. a.
Hip acc. 466
Pedometer 557
Wrist + LFE 579
Hip + LFE 606
Hand tally 668
Source
df
SS
MS
f
Spindle speed Feed rate Error Total
2
16106
8052.8
10.47
0.026
P-value
2 4 8
2156 3078 21339
1077.8 769.4
1.40
0.346
b. The test statistic value and P-value for H0A: a1 ¼ a2 ¼ a3 = 0 versus HaA: not all a’s = 0 are f = 10.47 and P = .026. Since .026 .05, we reject H0A at the .05 level and conclude that mean temperature varies with spindle speed. c. The test statistic and P-value for H0B: b1 ¼ b2 ¼ b3 = 0 versus HaB: not all b’s = 0 are f = 1.40 and P = .346. Since .346 > .05, do not reject H0B at the .05 level; conclude that feed rate has no statistically significant effect on mean temperature.
952
Answers to Odd-Numbered Exercises
51. c.
Source
df
SS
MS
f
Flavor Block (Subject) Error Total
2 53
20,797 135,833
10,398.5 2,562.9
35.0 8.6
106 161
31,506 188,136
297.2
With f = 35.0 F.01,2,106 4.81, H0A: a1 ¼ a2 ¼ a3 = 0 is rejected at the .01 level. d. Yes 53.
Source
df
SS
MS
f
Current Voltage Error Total
2 2 4 8
106.78 56.05 1115.75 1278.58
53.39 28.03 278.94
0.19 0.10
P-value 0.833 0.907
According to the ANOVA table, neither factor has a statistically significant effect at the .10 level: both P-values are > .10. 55. With f = 8.69 > 6.01 = F.01,2,18, there are significant differences among the three treatment means.The normal plot of residuals shows no reason to doubt normality, and the plot of residuals against the fitted values shows no reason to doubt constant variance. There is no significant difference between treatments B and C, but Treatment A differs (it is lower) significantly from the others at the .01 level. 57. Because f = 8.87 > 7.01 = F01,4,8, reject the hypothesis that the variance for B is 0.
d. Because 2.81 < 3.01 = F..05,6,24, there is no significant difference among the B means at the .05 level. e. Using d = 64.93,
SS 30763 34185.6 43581.2 97436.8 205966.6
MS 15381.5 11395.2 7263.5 4059.9
2
3960:2
4010:88
4029:10
65. a. Factor #1 = firing distance, levels = 25 yd, 50 yd. Factor #2 = bullet brand, levels = Federal, Remington, Winchester. Treatments: (25, Fed), (25, Rem), (25, Win), (50, Fed), (50, Rem), and (50, Win). b. The interaction plot suggests a huge distance effect. There appears to be very little bullet manufacturer effect. The nonparallel pattern suggests perhaps a slight interaction effect. c. Source
df
SS
Distance 1 568.97 Bullet 2 2.97 Distance 2 2.48 * bullet Error 444 1041.49 Total 449 1615.92
F 3.79 2.81 1.79
b. Because 1.79 < 2.51 = F.05,6,24, there is no significant interaction. c. Because 3.79 > 3.40 = F.05,2,24, there is a significant difference among the A means at the .05 level.
1
63. a. With f = 1.55 < 2.81 = F.10,2,12, there is no significant interaction at the .10 level. b. With f = 376.27 > 18.64 = F.001,2,12, there is a significant difference between the formulation means at the .001 level. With f = 19.27 > 12.97 = F.001,1,12, there is a significant difference among the speed means at the .001 level. c. Main effects Formulation: (1) 11.19, (2) −11.19 Speed: (60) 1.99, (70) −5.03, (80) 3.04
61. a. Source df A 2 B 3 Interaction 6 Error 24 Total 35
3
67.
Source DF pen 3 surface 2 Interaction 6 Error 12 Total 23
MS
f
P-value
568.969 242.56 0.000 1.487 0.63 0.531 1.242 0.53 0.589 2.346
SS MS F 1387.5 462.50 0.34 2888.1 1444.04 1.07 8100.3 1350.04 1.97 8216.0 684.67 20591.8
Answers to Odd-Numbered Exercises
953
With f = 1.97 < 2.33 = F.10,6,12, there is no significant interaction at the .10 level. With f = .34 < 3.29 = F.10,3,6, there is no significant difference among the pen means at the .10 level. With f = 1.07 < 3.46 = F.10,2,6, there is no significant difference among the surface means at the .10 level. 69.
Source
df
Adj SS
Adj MS
f
Distance
2
562424
281212
360.70
0.000
Temperature
2
11757
5879
7.54
0.004
Distance * Temperature
4
21715
5429
6.96
0.001
780
Error
18
14033
Total
26
609930
N
Mean
Grouping
0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4
3 3 3 3 3 3 3 3 3
935.000 860.000 806.667 676.667 643.333 610.000 538.333 511.667 505.000
A A
220 180 200 200 220 180 220 200 180
B B C C C
D D
73. a. MSAB/MSE b. MSA/MSAB, MSB/MSAB 75.
Source
df
SS
MS
f
Treatment Block Error Total
3 8 24 35
81.1944 66.5000 29.0556 176.7500
27.0648 8.3125 1.2106
22.36 6.87
1 8.56
4 9.22
3 10.78
2 12.44
77. a. H0: l1 ¼ l2 ¼ l3 ¼ l4 vs. Ha: at least two of the li’s are different; f = 3.68 < F.01,3,20 = 4.94, thus fail to reject H0. The means do not appear to differ. b. We reject H0 when the P-value a. Since .029 > .01, we still fail to reject H0.
P-value
The ANOVA table indicates a highly statistically significant interaction effect (f = 6.96, P-value = .001). The interaction by itself indicates that both nozzle-bed distance and temperature play a significant role in determining strut width. Apply Tukey’s method here to the nine (distance, temperature) pairs to identify honestly significant differences. Distance*Temperature
Since 22.36 > F:05;3;24 ¼ 3:01, reject H0A. There is an effect due to treatments. Next, Q.05,4,24 = 3.90, so Tukey’s HSD is pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 3:90 1:2106=9 = 1.43.
E E E
79. SSTr = 6.172, SSE = 1045.75, f = 3.086/18.674 = 0.165 < F.05,2,56 = 3.16, so do not reject H0. 81. Source Diet Error Total
DF 4 25 29
SS .929 2.690 3.619
MS F .232 2.15 .108
Because f = 2.15 < 2.76 = F.05,4,25, there is no significant difference among the diet means at the .05 level. b. (−.144, .474) Yes, the interval includes 0. c. .53 83. a. SSTr = 19812.6, SSE = 1083126, f = 9906.3/7125.8 = 1.30 < F.05,2,152 = 3.06, and H0 is not rejected. b. Q.05,3,152 3.347 and dij ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 1 1 3:347 7125:8 2 Ji þ Jj for each pair. Then d12 = 42.1, d13 = 37.2, and d23 = 40.7. None of the sample means are nearly this far apart, so Tukey’s method provides no statistically significant differences. This is consistent with the results in part (a).
954
Answers to Odd-Numbered Exercises
Chapter 12 1.
a. Both the BMI and peak foot pressure distributions appear positively skewed with some gaps and possible high outliers.
Stem-and-leaf of BMI 1 12 6 13 13 14 19 15 21 16 21 17 16 18 13 19 13 20 9 21 5 22 4 23 3 24 1 25 1 26 Leaf Unit = 0.1
8 00588 2456689 000569 69 01156 677 0156 0126 4 1 27
Stem-and-leaf pressure 7 3 16 3 18 4 (8) 4 16 5 14 5 8 6 5 6 4 7 3 7 3 8 2 8
of Foot 0012344 566666678 11 56789999 34 577778 024 6 4 1 59
5 Leaf Unit = 10
b. No c. The scatterplot suggests some positive association between BMI and peak foot pressure, but the relationship does not appear to be very strong, and there are many outliers from the overall pattern. 3. Yes. 5. b. Yes c. The relationship of y to x is roughly quadratic. 7.
a. b. c. d.
48.75 mpg –.0085 mpg –4.25 mpg 4.25 mpg
9. a. .095 m3/min b. −.475 m3/min c. .83 m3/min, 1.305 m3/min d. .4207, .3446 e. .0036 11. a. −.01 h, −.10 h d. .4624 13.
a. y = .63 + .652x b. 23.46, -2.46
b. 3.0 h, 2.5 h
c. .3653
c. d. e. 15. a. b. c. d.
392, 5.72 95.6% y = 2.29 + .564x, R2 = 68.8% y = –14.6497 + .09092x 1.8997 –.9977, –.0877, .0423, .7823 42%
17. a. b. c. d. e. 19. a. b. c. 21. b. c.
Yes slope, .827; intercept, −1.13 40.22 5.24 97.5% y = 75.212 – .20939x, 54.274 79.1% 2.56 y = –0.398 + 3.080x A 1-cm increase in palprebal fissure width corresponds to an estimated 3.080 cm2 increase in average/expected OSA. d. 3.452 cm2 e. 3.452 cm2 ^ , new intercept = 25. new slope = 1:8b 1 ^ þ 32 1:8b 0 ^ ¼ b ^ 29. b^ 0 ¼ Y and b 1 1 31. a. .0756 b. .813 c. The n = 7 sample is preferable (larger Sxx). 33. H0 : b1 ¼ 0 versus Ha: b1 6¼ 0, t = 22.64, P-value 0, so there is a useful linear relationship. CI = (.748, .906) 35. a. b^1 = 1.536, and a 95% CI is (.632, 2.440) b. Yes, for the test of H0: b1 = 0 versus Ha: b1 6¼ 0, we find t = 3.62, with P-value .0025. At the .01 level conclude that there is a useful linear relationship. c. Because 5 is beyond the range of the data, predicting at a dose of 5 might involve too much extrapolation. d. The observation does not seem to be exerting undue influence. 37. a. Yes, for the test of H0: b1 = 0 versus Ha: b1 6¼ 0, we find t = –6.73, with
Answers to Odd-Numbered Exercises
P-value < 10−9. At the .01 level conclude that there is a useful linear relationship. b. (–2.77, –1.42) 43. a. b. c. d.
600 is closer to x = 613.5 than is 750 (2.258, 3.188) (1.336, 4.110) at least 90%
45. a. b. c. d. e.
y = –1.5846 + 2.58494x, 83.73% (2.16, 3.01) (–0.125, 0.058) (–0.559, 0.491) H0: lY|.7 = 0 versus Ha: lY|.7 6¼ 0; reject H0 because 0 is not in the confidence interval (0.125, 0.325) for lY|.7
47. (86.3, 123.5) 49. a. t = 4.88, P-value 0, so reject H0. A useful relationship exists. b. (64.2, 161.3) c. (10228, 11362) d. (8215, 13379) e. Wider, because 85 is farther from the mean x-value of 68.65 than is 70. f. No, extrapolation g. (8707, 10633); (10020, 11574); (10845, 13004) 51. a. Yes b. t = 6.45, P-value = .003, reject H0. A useful relationship exists. c. (.05191, .11544) d. (.00048, .16687) 55. a. For the test of H0: q = 0 versus Ha: q > 0, we find r = .7482, t = 3.91, with P-value < .05. At the .05 level conclude that there is a positive correlation. b. R2 = .56; it is the same no matter which variable is the predictor. 57. a. t = 1.74 < 2.179, so do not reject H0: q ¼ 0. b. R2 = 20% 59. a. (.829, .914) b. z = 2.412, P-value = .008, so reject H 0: q = 0
955
c. R2 = 77.1% d. Still 77.1% 61. a. Reject the null hypothesis in favor of the alternative. b. No, with a large sample size a small r can be significant. c. Because t = 2.200 > 1.96 = t.025,9998 the correlation is statistically (but not necessarily practically) significant at the .05 level. 65. a. .184, −.238, −.426 b. The mean that is subtracted is not the mean x1;n1 of x1, x2, …, xn-1, or the mean x2;n of x2, x3, …, xn. Also, the is not denominator of r1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Pn1 P 2 n 2 ðx x Þ ðx x Þ . i 1;n1 i 2;n 1 2 However, if n is large then r1 is approximately the same as the correlation. A similar relationship applies to r2. c. No d. After performing one test at the .05 level, doing more tests raises the probability of at least one type I error to more than .05. 67. The plot suggests that the regression model assumptions of linearity/model adequacy and constant error variance are both plausible. 69. a. The plot does not show curvature, but equal variance is not satisfied. b. The standardized residual plot is similar to (a). The normality plot suggests normality of the true errors is plausible. 71. a. For testing H0: b1 = 0 versus Ha: b1 6¼ 0, t = 10.97, with P-value .0004. At the .001 level conclude that there is a useful linear relationship. b. The residual plot shows curvature, so the linear relationship of part (a) is questionable. c. There are no extreme standardized residuals, and the plot of standardized residuals is similar to the plot of ordinary residuals.
956
73. a. The plot indicates there are no outliers, but there appears to be higher variance for middle values of filtration rate. b. ei =e i ’s range between .57 and .65, which are close to se. c. Similar to the plot in (a). 75. The first data set seems appropriate for a straight-line model. The second data set shows a quadratic relationship, so the straight-line relationship is inappropriate. The third data set is linear except for an outlier, and removal of the outlier will allow a line to be fitted. The fourth data set has only two values of x, so there is no way to tell if the relationship is linear. 77. a. $24,000 b. $16,300 79. b. 9.193 c. f = 82.75, P-value 0, so at least one of the four predictors is useful, but not necessarily all four. d. Compare each to t:00125;95 = 3.106. Predictors 1, 2, and 4 are useful. 81. a. y = –77 + 4.397x1 + 165x2 c. $2,781,500 83. a. R2 = 34.05%, se = 0.967 b. H0 : b1 ¼ b2 ¼ b3 ¼ 0 versus Ha: not all three b’s are 0; f = 2.065 < 3.49, so do not reject H0 at the .05 level c. Yes 85. a. y = 148 – 133x1 + 128.5x2 + 0.0351x3 b. For x1: t = –0.26, P = .798. For x2: t = 9.43, P < .0001. For x3: t = 1.42, P = .171. Only x2 is a statistically significant predictor of y. c. R2 = 86.33%, R2a = 84.17% d. R2 = 86.28%, R2a = 84.91% 87. a. f = 87.6, P-value 0, strongly reject H0 b. R2a = 93.5% c. (9.095, 11.087) 89. a. Always decreases b. 61.432 GPa c. f = 227.88 > 5.568, so H0 is rejected.
Answers to Odd-Numbered Exercises
d. (55.458, 67.406) e. (53.717, 69.147) 91. a. Both plots exhibit curvature. b. No c. The plots suggest that all model assumptions are satisfied. d. All second-order terms should be retained. 93. a. H0: b1 ¼ b2 ¼ 0 versus Ha: not both b1 and b2 are 0, f = 22.91 F.05,2,9 = 4.26, so reject H0. Yes, there is a useful relationship. b. H0: b2 ¼ 0 versus Ha: b2 6¼ 0, t = 4.01, P-value < .005, reject H0. Yes. c. (.5443, 1.9557) d. (2.91, 6.29) 95. a. The quadratic terms are important in providing a good fit to the data. b. A 95% PI is (.560, .771). 97. a. rRI = .843 (P-value = .000), rRA = .621 (.001), rIA = .843 (.000) b. Rating = 2.24 + 0.0419 IBU – 0.166 ABV. Because the predictors are highly correlated, one is redundant. c. Linearity is an issue. e. The regression is quite effective, with R2 = 87.2%. The ABV coefficient is not significant, so ABV is not needed. The highly significant positive coefficient for IBU and negative coefficient for its square show that Rating increases with IBU, but the rate of increase is lower at higher IBU. 2 3 2 3 1 1 1 1 6 1 1 7 617 1 7 y ¼ 6 7; 99. a. X ¼ 6 41 405 1 1 5 1 1 1 4 2 3 2 3 4 0 0 6 40 4 05 b ¼ 425 0 0 4 4 2 3 1:5 b. b ¼ 4 :5 5 1
Answers to Odd-Numbered Exercises
c.
2 3 0 627 7 ^y ¼ 6 415 3
957
2
3 1 6 1 7 7 y ^y ¼ 6 4 1 5 1
SSE = 4, MSE = 4 d. (−12.2, 13.2) e. For the test of H0: b1 = 0 versus Ha: b1 6¼ 0, we find |t| = .5 < t.025,1 = 12.7, so do not reject H0 at the .05 level. The x1 term does not play a significant role. f. Source DF SS MS F Regression Error Total
101.
2 1 3
5 4 9
2.5 0.625 4.0
With f = .625 < 199.5 = F.05,2,1, there is no significant relationship at the .05 level.
Rxi , a. X0 X ¼ n Rxi Rx2i
Rx2i Rxi 1 ðX0 XÞ1 ¼ nRx2 ðRx 2 iÞ Rxi n i
Ry i b. X0 y ¼ , Rxi yi
^b ¼ y ðSxy =Sxx Þx Sxy =Sxx
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi P 103. b^0 ¼ y; se ¼ ðy yÞ2 =ðn 1Þ; pffiffiffi y t:025;n1 s= n P P 105. a. b^0 ¼ m þ1 n m1 þ n yi ¼ y; b^1 ¼ m1 m1 yi P mþn 1 y1 y2 m þ 1 yi ¼ n b. ^y ¼ ½y1 . . .y1y2 . . .y2 0 ; P P SSE = m1 ðyi y1 Þ2 þ nm þ 1 ðyi y2 Þ2 ; pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi se ¼ SSE=ðm þ n 2Þ; pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sb^1 ¼ se 1=m þ 1=n ^ ¼ 128:166; b ^ ¼ 14:333; d. b 0
1
^y ¼ ½121; 121; 121; 135:33; 135:33; 135:330 ; SSE = 116.666, se = 5.4, 95% CI for b1 (−26.58, −2.09)
109. a. With f = 12.04 > 9.55 = F.01,2,7, there is a significant relationship at the .01 level. To test H0: b1 = 0 versus Ha: b1 6¼ 0, we find |t| = 2.96 > t.025,7 = 2.36, so reject
b.
c.
d.
e.
H0 at the .05 level. The foot term is needed. To test H0: b2 = 0 versus Ha: b2 6¼ 0, we find |t| = 0.02 < t.025,7 = 2.36, so do not reject H0 at the .05 level. The height term is not needed. The highest leverage is .88 for the fifth point. The height for this student is given as 54 inches, too low to be correct for this group of students. Also this value differs by 8" from the wingspan, an extreme difference. Point 1 has leverage .55, and this student has height 75, foot length 13, both quite high. Point 2 has leverage .31, and this student has height 66 and foot length 8.5, at the low end. Point 7 has leverage .31 and this student has both height and foot length at the high end. Point 2 has the most extreme residual. This student has a height of 66″ and a wingspan of 56″ differing by 10″, so the extremely low wingspan is probably wrong. For this data set it would make sense to eliminate points 2 and 5 because they seem to be wrong. However, outliers are not always mistakes and one needs to be careful about eliminating them.
111. a. p(10) = .060, p(50) = .777 b. odds(10) = .0639, odds(50) = 3.49 d. $37.50 113. a. H0: b1 = 0 versus H0: b1 6¼ 0, z = –2.026, H0 is rejected at a = .05 b. (.675, .993) ^ ¼ :00430 115. a. b^0 ¼ :0573 and b 1 c. H0: b1 = 0 versus H0: b1 6¼ 0, z = 0.74, H0 is not rejected. 117. a. b. 119. c. d.
.912 .794 .484 z1 = –0.38, z2 = 1.75, so do not reject H0: b1 = 0 but do reject H0: b2 = 0 in favor of Ha: b2 6¼ 0.
958
Answers to Odd-Numbered Exercises
121. a. Flood damage increases with flood level, but there are two “jumps” at 2–3 ft and 5–6 ft. b. No 123. a. 50.73% b. .7122 c. To test H0: b1 = 0 versus Ha: b1 6¼ 0, we have t = 3.93, with P-value .0013. At the .01 level conclude that there is a useful linear relationship. d. (1.056, 1.275) e. ^y = 1.014, y − ^y = −.214 125. No, if the relationship of y to x is linear, then the relationship of y2 to x is quadratic. 127. a. b. c. d. e. f.
Yes ^y = 98.293, y − ^y = .117 se = .155 R2 = .794 95% CI for b1: (.0613, .0901) The new observation is an outlier, and has a major impact: The equation of the line changes from y = 97.50 + .0757x to y = 97.28 + .1603x se changes from .155 to .291 R2 changes from .794 to .616
129. a. The paired t procedure gives t = 3.54 with a two-tailed P-value of .002, so at the .01 level we reject the hypothesis of equal means. b. The regression line is y = 4.79 + .743x, and the test of H0: b1 = 0 vs Ha: b1 6¼ 0, gives t = 7.41 with a P-value of .10, do not reject H0. 11. a. (−1, -.967), [−.967, −.431), [−.431, 0), [0, .431), [.431, .967), [.967, 1) b. (−1, .49806), [.49806, .49914), [.49914, .50), [.50, .50086), [.50086, .50194), [.50194, 1) c. Because v2 = 5.53 with P-value >.10, do not reject H0.
Answers to Odd-Numbered Exercises
959
13. a. With h = P(male), ^h ¼ :504, v2 = 3.45, df = 4 – 1 – 1 = 2. Do not reject H0 because v2 v2:05;2 ¼ 5:992. b. No, because the expected count for the last category is too small. ^ = 3.167 which gives v2 = 103.9 with 15. l P-value < .001, so reject the assumption of a Poisson model. 17. The observed test statistic value is v = 6.668 < 10.645 (df = 9 – 1 – 2 = 6), so H0 is not rejected at the .10 level. 2
19. v2 = 2.788 z2; P-values are the same. Reject H0 at the .10 level but not at the .05 level. 21. a. v2 = 4.504 < v2:05;4 = 9.488, so H0 is not rejected. 23. v2 = 9.858 < v2:05;6 = 12.592, so H0 is not rejected at the .05 level. 2 25. a. Reject H0 because v = 11.954 at 2 df and P-value = .003. b. Very large sample sizes make the test capable of detecting even slight deviations from H0.
27. ^eijk ¼ ni:: n:j: n::k =n2 ; df = IJK – (I + J + K) + 2 = 28. 29. a. Because :6806\v2:10;2 ¼ 4:605, H0 is not rejected. b. Now v2 ¼ 6:806 4:605, and H0 is rejected. c. 677 2 31. a. With v = 6.45 and P-value .040, reject independence at the .05 level. b. With z = −2.29 and P-value .022, reject independence at the .05 level. c. Because the logistic regression takes into account the order in the professorial ranks, it should be more sensitive, so it should give a lower P-value. d. There are few female professors but many assistant professors, and the
assistant professors will be the professors of the future. 33. v2 = 5.934, df = 2, P-value = .059. So, at the .05 level, we (barely) fail to reject H0. 35. v2 = 29.775 v2:05;ð41Þð31Þ = 12.592. So, H0 is rejected at the .05 level. 37. a. H0: The population proportion of Late Game Leader Wins is the same for all four sports; Ha: The proportion of Late Game Leader Wins is not the same for all four sports. With v2 = 10.518 > 7.815 = v2:05;3 , reject the null hypothesis at the .05 level. Sports differ in terms of coming from behind late in the game. b. Yes (baseball) 39. v2 = 881.36, df = 16, P-value is effectively zero. USA respondents were more amenable to torture than the Europeans, while South Korean respondents were vastly more likely than anyone else to say it’s “sometimes” okay to torture terror suspects. 2 41. a. No, v = 9.02 > 7.815 = v2:05;3 . b. With v2 = .157 < 6.251 = v2:10;3 , there is no reason to say the model does not fit.
versus 43. a. H0 : p0 ¼ p1 ¼ ¼ p9 ¼ :10 Ha: at least one pi 6¼ .10, with df = 9. b. H0: pij = .01 for i and j = 0, 1, 2, …, 9 versus Ha: at least one pij 6¼ .01, with df = 99. c. No, there must be more observations than cells to do a valid chi-square test. d. The results give no reason to reject randomness. Chapter 14 1. (y4, y15) = ($55,000, $61,000) 3. ðY14 ; Y27 Þ 5. P-value = 1 – B(18; 25, .5) = .007, so reject H0 at the .05 level.
960
7. P-value = 1 – B(16; 24, .5) = .032, so reject H0 at the .05 level. 9. Assuming distribution of differences is symmetric, let p = the true proportion of individuals who would perceive a longer time for the shorter exam (positive difference) in this experiment. Hypotheses are equivalent to H0: p = .5 versus Ha: p > .5. P-value 0, so H0 is strongly rejected. 11. s+ = 27, and since 27 is neither 64 nor 14, we do not reject H0. 13. s+ = 22 < 24, so H0 is not rejected at the .05 level. 15. Test H0: lD = 0 versus H0: lD 6¼ 0. sþ ¼ 72 64, so H0 is rejected at level .05. 17. a. Test H0: lD = 0 versus Ha: lD < 0. s+ = 2 6, and so H0 is rejected at the .055 level. b. Test H0: lD = 0 versus Ha: lD > 0. Because s+ = 28 < 30, H0 cannot be rejected at this level. 19. a. Assume that the population distribution of differences is at least symmetric. With s+ = 3 6, H0 is rejected. P-value = .021. b. Assume that the population distribution of differences is normal. t = –2.54, df = 7, P-value = .019, so H0 is rejected. c. The P-values were .035 (sign test), .021 (Wilcoxon signed-rank test), and .019 (paired t test). As is typical, the P-value decreases with more powerful tests. But, all three tests agree that H0 is rejected at the .05 level, and the sign test has the fewest assumptions. 21. (7.22, 7.73) 23. (–.1745, –.0110) 25. With w = 38, reject H0 at the .05 level because the rejection region is {w 36}.
Answers to Odd-Numbered Exercises
27. Test H0: l1 − l2 = 1 versus Ha: l1 − l2 > 1. After subtracting 1 from the original process measurements, we get w = 65. Do not reject H0 because w < 84. 29. b. Test H0: l1 − l2 = 0 vs Ha: l1 − l2 < 0. With a P-value of .0027 we reject H0 at the .01 level. 31. Test H0: l1 l2 ¼ 0 vs Ha: l1 l2 [ 0. W has mean m(m + n + 1)/2 = 59.5 and variance mn(m + n + 1)/12 = 89.25. z = 2.33, P-value = .01, so H0 is rejected at the .05 level. 33. Pain: z = –1.40, P-value = .0808. Depression: z = –2.93, P-value = .0017. Anxiety: z = –4.32, P-value < .0001. Fail to reject first H0, reject last two. Chance of at least one type I error is no more than .03. 35. (16, 87) 37. h = 21.43, df = 3, P-value < .0001, so H0 is strongly rejected. 39. h = 9.85, df = 2, P-value = .007 < .01, so H0 is rejected. 43. a. Rank averages of the three positions/ rows are r1 ¼ 12=6 ¼ 2; r2 ¼ 13=6 ¼ 2:1 6; r3 ¼ 11=6 ¼ 1:8 3; Fr = 0.333, df = 2, P-value .85, so H0 is certainly not rejected. b. Fr = 6.34, df = 2, P-value = .042, so reject H0 at the .05 level. c. Fr = 1.85, df = 2, P-value = .40, so H0 is not rejected. 45. H0: l1 ¼ ¼ l10 versus Ha: not all li’s are equal. Fr = 78.67, df = 9, P-value 0, so H0 is resoundingly rejected. The four algorithms inspired by quantum computing (Q’s in the name) have much lower rank means, suggesting they are far better at minimizing entropy. 49. Test H0: l1 l2 ¼ 0 vs Ha: l1 l2 [ 0. Rank sum for first sample = 26, P-value = .014. Reject H0 at .05 but not .01.
Answers to Odd-Numbered Exercises
51. mean = 637.5, variance = 10731.25 a. z = –0.21, P-value = .83. The data do not contradict a claim the sensory test is reliable. b. z = 1.70, P-value = .09. Reject H0 at .10 level. This might indicate a lack of reliability of the sensory test for the population of healthy patients. 53. Test H0: l1 ¼ ¼ l5 versus Ha: not all li’s are equal. h = 20.21 13.277, so H0 is rejected at the .01 level. 55. li = population mean skin potential (mV) with the ith emotion (i = 1 for fear, etc.). The hypotheses are H0: l1 ¼ ¼ l4 versus Ha: not all li’s are equal. Fr = 6.45 < v2:05;3 = 7.815, so we fail to reject H0 at the .05 level. 57. Because w′ = 26 < 27, do not reject the null hypothesis at the 5% level.
961
Chapter 15 1. a. p(.50) = .80 and p(.75) = .20 b. p(.50|HHHTH) = .6124 and p(.75|HHHTH) = .3876 3. a. Gamma(9, 5/3) b. Gamma(145, 5/53) 5. a1 = a0 + Rxi, b1 = 1/(n + 1/b0) 7. a1 ¼ a0 þ nr, b1 ¼ b0 þ Rxi nr 9. Normal, l1 ¼
s0 l0 þ R ln xi , s1 ¼ s0 þ n s0 þ n
11. a. 13.68 b. (11.54, 15.99) 15. Normal, mean = 116.77, variance = 10.227, same as previous 17. (.485, .535) 19. a. a0b0
b.
a0 þ RXi n þ 1=b0
References
Overview and Descriptive Statistics Chambers, John, William Cleveland, Beat Kleiner, and Paul Tukey, Graphical Methods for Data Analysis, Brooks/Cole, Pacific Grove, CA, 1983. A highly recommended presentation of graphical and pictorial methodology in statistics. Cleveland, William S., The Elements of Graphing Data (2nd ed.), Hobart Press, Summit, NJ, 1994. A very nice survey of graphical methods for displaying and summarizing data. Freedman, David, Robert Pisani, and Roger Purves, Statistics (4th ed.), Norton, New York, 2007. An excellent, very nonmathematical survey of basic statistical reasoning and concepts. Hoaglin, David, Frederick Mosteller, and John Tukey, Understanding Robust and Exploratory Data Analysis, Wiley-Interscience, New York, 2000. Discusses why, as well as how, exploratory methods should be employed; it is good on details of stem-and-leaf displays and boxplots. Moore, David S. and William I. Notz, Statistics: Concepts and Controversies (9th ed.), Freeman, San Francisco, 2016. An extremely readable and entertaining paperback that contains an intuitive discussion of problems connected with sampling and designed experiments. Peck, Roxy, et al. (eds.), Statistics: A Guide to the Unknown (4th ed.), Thomson-Brooks/Cole, Belmont, CA, 2005. Contains many short, nontechnical articles describing various applications of statistics. Utts, Jessica, Seeing Through Statistics (4th ed), Cengage Learning, Boston, 2014. The focus is on statistical literacy and critical thinking; a wonderful exposition.
Probability and Probability Distributions Balakrishnan, N., Norman L. Johnson, and Samuel Kotz, Continuous Univariate Distributions, Vol 1 (3rd ed.),
Wiley, Hoboken, NJ, 2016. Encyclopedic, not for bedtime reading. Carlton, Matthew A. and Jay L. Devore, Probability with STEM Applications,Wiley, Hoboken, NJ, 2021. An expansion of the material in Chapters 2–6 of this book (Modern Mathematical Statistics with Applications) plus additional material. Gorroochurn, Prakash, Classic Problems of Probability, Wiley, Hoboken, NJ, 2012. An entertaining excursion through 33 famous probability problems. Johnson, Norman L, Adrienne W. Kemp, and Samuel Kotz, Univariate Discrete Distributions (3rd ed.), Wiley, Hoboken, NJ, 2005. Encyclopedic, not for bedtime reading. Olofsson, Peter, Probabilities: The Little Numbers That Rule Our Lives (2nd ed.), Wiley, Hoboken, NJ, 2015. A very non-technical and thoroughly charming introduction to the quantitative assessment of uncertainty. Ross, Sheldon, A First Course in Probability (10th ed.), Prentice Hall, Upper Saddle River, NJ, 2018. Rather tightly written and more mathematically sophisticated than this text but contains a wealth of interesting examples and exercises. Ross, Sheldon, Introduction to Probability Models (12th ed.), Academic Press, Cambridge, MA, 2019. Another tightly written exposition of somewhat more advanced topics, again with a wealth of interesting examples and exercises. Winkler, Robert, Introduction to Bayesian Inference and Decision (2nd ed.), Probabilistic Publishing, Sugar Land, Texas, 2003. A very good introduction to subjective probability.
Basic Inferential Statistics Casella, George and Roger L. Berger, Statistical Inference (2nd ed.), Cengage Learning, Boston, 2001. The focus is on the theory of mathematical statistics; exposition is appropriate for advanced undergraduates and MS. level students. Hopefully there will be a new edition soon.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. L. Devore et al., Modern Mathematical Statistics with Applications, Springer Texts in Statistics, https://doi.org/10.1007/978-3-030-55156-8
963
964 Daniel, Cuthbert and Fred S. Wood, Fitting Equations to Data: Computer Analysis of Multifactor Data (2nd ed.), Wiley-Interscience, New York, 1999. Contains many insights and methods that evolved from the authors’ extensive consulting experience. DeGroot, Morris, and Mark Schervish, Probability and Statistics (4th ed.), Pearson, Englewood Cliffs, NJ, 2011. Includes an excellent discussion of both general properties and methods of point estimation; of particular interest are examples showing how general principles and methods can yield unsatisfactory estimators in particular situations. Davison, A.C. and D.V. Hinkley, Bootstrap Methods and Their Application, Cambridge University Press, Cambridge, UK, 1997. General principles and methods interspersed with many examples. Efron, Bradley, and Robert Tibshirani, An Introduction to the Bootstrap, Chapman and Hall, New York, 1993. The first general accessible exposition, and still informative and authoritative. Good, Philip, A Practitioner’s Guide to Resampling for Data Analysis, Data Mining, and Modeling, Chapman and Hall, New York, 2019. Obviously brand new, and hopefully informative about recent bootstrap methodology. Meeker, William Q., Gerald J. Hahn, and Luis A. Escobar, Statistical Intervals: A Guide for Practitioners and Researchers (2nd ed.), Wiley, Hoboken, NJ, 2016. Everything you ever wanted to know about statistical intervals (confidence, prediction, tolerance, and others). Rice, John, Mathematical Statistics and Data Analysis (3rd ed.), Cengage Learning, Boston, 2013. A nice blending of statistical theory and data analysis. Wilcox, Rand, Introduction to Robust Estimation and Hypothesis Testing (4th ed.), Academic Press, Cambridge, MA, 2016. Presents alternatives to the inferential methods based on t and F distributions.
Specific Topics in Inferential Statistics Agresti, Alan, An Introduction to Categorical Data Analysis (3rd ed.), Wiley, New York, 2018. An excellent treatment of various aspects of categorical data analysis by one of the most prominent researchers in this area. Chatterjee, Samprit and Ali Hadi, Regression Analysis by Example (5th ed.), Wiley, Hoboken, NJ, 2012. A relatively brief but informative discussion of selected topics. Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin, Bayesian Data Analysis (3rd ed.), Chapman and Hall, New York, 2013. A comprehensive survey of theoretical,
References practical, and computational issues in Bayesian inference; its authors have made many contributions to Bayesian methodology. Hollander, Myles, and Douglas Wolfe, Nonparametric Statistical Methods (3rd ed.), Wiley, Hoboken, NJ, 2013. A very good reference on distribution-free (i.e. non-parametric) methods with an excellent collection of tables. Lehmann, Erich, Nonparametrics: Statistical Methods Based on Ranks (revised ed.), Springer, New York, 2006. An excellent development of the most important “classical” (i.e. pre-bootstrap) methods, presented with a great deal of insightful commentary. Miller, Rupert, Beyond ANOVA: The Basics of Applied Statistics, Wiley, Hoboken, NJ, 1986. An excellent source of information about assumption checking and alternative methods of analysis. Montgomery, Douglas, Design and Analysis of Experiments (9th ed.), Wiley, New York, 2019. An up-todate presentation of ANOVA models and methodology. Kutner, Michael, Christopher Nachtsheim, John Neter, and William Li, Applied Linear Statistical Models (5th ed.), McGraw-Hill, New York, NY, 2005. The first 14 chapters constitute an extremely readable and informative survey of regression analysis. The second half of the book contains a well-presented survey of ANOVA. The level throughout is comparable to that of the present text (generally without proofs); the comprehensive discussion makes the book an excellent reference. Ott, R. Lyman, and Michael Longnecker, An Introduction to Statistical Methods and Data Analysis (7th ed.), Cengage Learning, Boston, 2015. Includes several chapters on ANOVA and regression methodology that can profitably be read by students desiring a nonmathematical exposition; there is a good chapter on various multiple comparison methods. Rencher, Alvin C. and William F. Christensen, Methods of Multivariate Analysis (3rd ed.), Wiley, Hoboken, NJ, 2012. Arguably the definitive text on multivariate analysis, including extensive information on the multivariate normal distribution.
Statistical Computing and Simulation Datar, Radhika and Harish Garg, Hands On Exploratory Data Analysis with R, Packt Publishing, Birmingham, UK, 2019. Explains how the R software package can be used to explore various types of data. Law, Averill M., Simulation Modeling and Analysis (5th ed.), McGraw-Hill, New York, 2014. An authoritative survey of various aspects and methods of simulation.
Index
A Additive model, 674 for ANOVA, 674–676 for linear regression analysis, 712 for multiple regression analysis, 767 Additive model equation, 704 Adjusted coefficient of multiple determination, 772 Adjusted R2, 772 Alternative hypothesis, 502 Analysis of covariance, 790 Analysis of variance (ANOVA), 639 additive model for, 674–676, 687 data transformation for, 667 definition of, 639 expected value in, 646, 662, 679, 687 fixed vs. random effects, 667 Friedman test, 888 fundamental identity of, 644, 653, 677, 690, 722 interaction model for, 687–695 Kruskal–Wallis test, 887 Levene test, 649–650 linear regression and, 746, 749, 764, 798, 805 mean in, 640, 642, 643 mixed effects model for, 682, 692 multiple comparisons in, 653–660, 666, 679–680, 691 noncentrality parameter for, 663, 671 notation for, 642, 707 power curves for, 663–664 randomized block experiments and, 680–682 regression identity of, 722–723 sample sizes in, 663–664 single-factor, 640–672 two-factor, 672–695 type I error in, 645–646 type II error in, 662 Anderson-Darling test, 835
ANOVA table, 647 Approximate 100(1–a)% confidence interval for p, 476 Ansari–Bradley test, 888 Association, causation and, 301, 754 Asymptotic normal distribution, 372, 436, 443, 445, 754 Asymptotic relative efficiency, 867, 876 Autocorrelation coefficient, 757 Average definition of, 26 deviation, 33 pairwise, 446, 868–869, 876 rank, 887 weighted, (see Weighted average)
B Balanced study design, 642 Bar graph, 9, 18 Bartlett's test, 649 Bayes estimator, 897 Bayesian approach to inference, 855, 896–897 Bayes' Theorem, 80–83, 896–897 Bernoulli distribution, 118, 139, 150, 375, 441–443, 442, 898 Bernoulli random variable, 112 binomial random variable and, 146, 376 Cramer–Rao inequality for, 443 definition of, 112 expected value, 126 Fisher information on, 438–439, 442 Laplace's rule of succession and, 900 mean of, 127 mle for, 443 moment generating function for, 139, 140, 143
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. L. Devore et al., Modern Mathematical Statistics with Applications, Springer Texts in Statistics, https://doi.org/10.1007/978-3-030-55156-8
pmf of, 118 score function for, 440 in Wilcoxon's signed-rank statistic, 314 Beta distribution, 244–245, 897 Beta functions, incomplete, 244 Bias, 400, 487 Bias-corrected and accelerated interval, 490, 617 Bimodal histogram, 17 Binomial distribution basics of, 144–151 Bayesian approach to, 897–900 multinomial distribution and, 286 normal distribution and, 223–224, 375 Poisson distribution and, 157–159 Binomial experiment, 144–147, 150, 158, 286, 375, 823 Binomial random variable X, 146 Bernoulli random variables and, 150, 375 cdf for, 148 definition of, 146 distribution of, 148 expected value of, 150, 151 in hypergeometric experiment, 167 in hypothesis testing, 504–505, 526–529 mean of, 150–151 moment generating function for, 151 multinomial distribution of, 286 in negative binomial experiment, 168 normal approximation of, 223–224, 375 pmf for, 148 and Poisson distribution, 157–159 standard deviation of, 150 unbiased estimation, 406, 434
965
966 variance of, 150, 151 Binomial theorem, 151, 168–170 Bioequivalence tests, 637 Birth process, pure, 445 Bivariate, 2 Bivariate data, 2, 706, 710, 720, 777, 820 Bivariate normal distribution, 330–334, 550, 749–750 Blocking, 680–682 Bonferroni confidence intervals, 499, 740–742, 776 Bootstrap, 484 Bootstrap distribution, 485 Bootstrap procedure for confidence intervals, 484–492, 617–619 for paired data, 622–624 Bootstrap P-value, 557 Bound on the error of estimation, 457 Box–Muller transformation, 342 Boxplot, 37–39 comparative, 39–42 Branching process, 352 Bootstrap sample, 485 Bootstrap standard error, 486 Bootstrap t confidence interval, 489
C Categorical characteristic, 1 Categorical data classification of, 18 graphs for, 18 in multiple regression analysis, 787–790 Pareto diagram, 25 sample proportion in, 18 Cauchy distribution mean of, 384, 408 median of, 408 minimal sufficiency for, 432 reciprocal property, 274 standard normal distribution and, 342 uniform distribution and, 263 variance of sample mean for, 414 Causation, association and, 301, 754 cdf. See Cumulative distribution function Cell counts/frequencies, 824–826, 829–834, 840–847 Cell probabilities, 828, 829, 834 Censored experiments, 31, 409–410 Censoring, 409 Census, 1 Central Limit Theorem (CLT) basics of, 371–377, 395–396
Index Law of Large Numbers and, 376 proof of, 395–396 sample proportion distribution and, 224 Wilcoxon rank-sum test and, 877 Wilcoxon signed-rank test and, 864 Central t distribution, 383–384, 497 Chebyshev's inequality, 137, 156, 187, 228, 337, 380 Chi-squared critical value, 481 Chi-squared distribution censored experiment and, 496 in confidence intervals, 457–458, 482 critical values for, 382, 458, 481–482, 550, 825, 834–835 definition of, 236 degrees of freedom for, 380, 381 exponential distribution and, 389 F distribution and, 385–386 gamma distribution and, 231, 388 in goodness-of-fit tests, 823–839 Rayleigh distribution and, 263 standard normal distribution and, 263, 391–392, 395 of sum of squares, 333, 643 t distribution and, 383, 385 in transformation, 259 Weibull distribution and, 274 Chi-squared random variable in ANOVA, 645 cdf for, 380 expected value of, 313 in hypothesis testing, 565 in likelihood ratio tests, 549, 553 mean of, 388 moment generating function of, 380 pdf of, 381 standard normal random variables and, 381–382 in Tukey’s procedure, 659 variance of, 390 Chi-squared test degrees of freedom in, 825, 831, 833, 841, 845 for goodness of fit, 823–829 for homogeneity, 841–843 for independence, 844–846 P-value for, 836–837 for specified distribution, 828–829 z test and, 836 Classes, 14 Class intervals, 14–16, 346, 364, 835, 837 Coefficient of determination, 721
definition of, 720–722 F ratio and, 774 in multiple regression, 772 sample correlation coefficient and, 746 Coefficient of (multiple) determination, 771 Coefficient of skewness, 138, 144, 211 Coefficient of variation, 44, 272, 423 Cohort, 351 Combination, 70–72 Comparative boxplot, 42–43, 588, 589, 642 Complement of an event, 52, 59 Complete second-order model, 786 Composite, 542 Compound event, 51, 61 Concentration parameter, 895 Conceptual population, 6, 128, 359, 569 Conditional expectation, 320 Conditional density, 319 Conditional distribution, 317–327, 428, 435, 749, 833, 889 Conditional mean, 320–327 Conditional probability, 75–83, 87–88, 236–238, 428, 431–432 Conditional probability density function, 317 Conditional probability mass function, 317 Conditional variance, 320–322, 433 Confidence bound, 460–461, 464, 567, 575, 593 Confidence interval adjustment of, 480 in ANOVA, 654, 659–660, 666, 678, 680, 692 based on t distribution, 470–473, 575–577, 592–594, 646, 659–661, 730–733 Bonferroni, 499, 740–742 bootstrap procedure for, 484–491, 622, 617–619, 624 for a contrast, 660 for a correlation coefficient, 753 vs. credibility interval, 899–903 definition of, 451 derivation of, 457 for difference of means, 579–580, 581–583, 592–595, 617–619, 632–633, 647–651, 666, 679, 693 for difference of proportions, 609 distribution-free, 855–860 for exponential distribution parameter, 458
Index in linear regression, 704–707, 739–741 for mean, 452–456, 458, 464–465, 484–488, 490–491 for median, 419–421 in multiple regression, 773, 822 one-sided, 460, 567, 593 for paired data, 593–595, 623 for ratio of variances, 615–616, 621 sample size and, 456 Scheffé method for, 702 sign, 860 for slope coefficient, 729 for standard deviation, 481–482 for variance, 481–482 width of, 453, 456–457, 467, 478, 490, 568 Wilcoxon rank-sum, 873–875 Wilcoxon signed-rank, 867–869 Confidence level definition of, 451, 454–456 simultaneous, 654–659, 666, 672, 741 in Tukey's procedure, 654–659, 666, 679, 680 Confidence set, 867 Conjugate prior, 893 Consistent, 408 Consistency, 377, 424, 443–444 Consistent estimator, 377, 424, 443–444 Contingency tables, two-way, 840–848 Continuity correction, 223–224 Continuous random variable(s) conditional pdf for, 318, 903 cumulative distribution function of, 195–200 definition of, 114, 190 vs. discrete random variable, 192 expected value of, 203–204 joint pdf of (see Joint probability density functions) marginal pdf of, 281–283 mean of, 203, 204 moment generating of, 208–210 pdf of (see Probability density function) percentiles of, 198–200 standard deviation of, 205–207 transformation of, 258–262, 336–341 variance of, 205–207 Contrast, 660 Contrast of means, 659–660 Convenience samples, 6 Convergence in distribution, 164, 258
967 in mean square, 377 in probability, 377 Convex function, 275 Convolution, 307 Correction factor, 167 Correlation coefficient, 299 autocorrelation coefficient and, 757 in bivariate normal distribution, 330–334, 749 confidence interval for, 753 covariance and, 298 Cramér–Rao inequality and, 441–442 definition of, 299, 746 estimator for, 749 Fisher transformation, 751 for independent random variables, 299 in linear regression, 746, 750, 751 measurement error and, 355 paired data and, 596–597 sample (see Sample correlation coefficient) Covariance, 296 correlation coefficient and, 299 Cramér–Rao inequality and, 441–442 definition of, 296 of independent random variables, 300–301 of linear functions, 298 matrix format for, 799 Covariance matrix, 799 Covariate, 790 Cramér–Rao inequality, 441–442 Credibility interval, 899 Critical values chi-squared, 381 F, 386 standard normal (z), 217 studentized range, 654 t, 384, 481 tolerance, 469 Wilcoxon rank-sum interval, 876, 925 Wilcoxon rank-sum test, 871–879, 888, 924 Wilcoxon signed-rank interval, 867, 923 Wilcoxon signed-rank test, 864, 865, 886 Cumulative distribution function, 119, 195 Cumulative distribution function for a continuous random variable, 195
for a discrete random variable, 119 joint, 352 of order statistics, 343, 344 pdf and, 194 percentiles and, 199 pmf and, 119, 121, 122 transformation and, 258 Cumulative frequency, 25 Cumulative relative frequency, 25
D Danger of extrapolation, 716 Data, 1 bivariate, 2, 706, 720, 783 categorical (see Categorical data) censoring of, 31, 409–410 characteristics of, 1 collection of, 5–7 definition of, 1 multivariate, 2, 19 qualitative, 18 univariate, 2 Deductive reasoning, 4 Degrees of freedom (df), 34 in ANOVA, 644–647, 676, 688 for chi-squared distribution, 380–382 in chi-squared tests, 825, 831, 833, 842 for F distribution, 385 in regression, 718, 771 sample variance and, 35 for Studentized range distribution, 654 for t distribution, 383, 458, 575, 581 type II error and, 663 Delta method, 207 De Morgan’s laws, 55 Density, 16 conditional, 317–319 curve, 191 function (pdf), 191 joint, 279 marginal, 281 scale, 17 Density curve, 191 Density scale, 16 Dependence, 87–91, 283–288, 300, 319, 844 Dependent, 88, 284, 704 Dependent events, 87–91 Dependent variable, 704 Descriptive statistics, 1–39 Design matrix, 796 Deviations from the mean, 33 Dichotomous trials, 144
968 Difference statistic, 411 Discrete, 113 Discrete random variable(s) conditional pmf for, 317 cumulative distribution function of, 119–122 definition of, 113 expected value of, 126 joint pmf of (see Joint probability mass function) marginal pmf of, 279 mean of, 127 moment generating of, 139 pmf of (see Probability mass function) standard deviation of, 132 transformation of, 261 variance of, 132 Disjoint events, 53 Dotplots, 11 Double-blind experiment, 605 Dummy variable, 787 Dunnett’s method, 660
E Effect, 662 Efficiency, 442 Efficiency, asymptotic relative, 867, 876 Efficient estimator, 442 Empirical rule, 221 Erlang distribution, 238, 272 Error(s) estimated standard, 400, 731, 801 estimation, 402 family vs. individual, 659 measurement, 213, 249, 406, 550 prediction, 468, 741, 768 rounding, 35 standard, 400, 801 type I, 504 type II, 504 Error Sum of Squares (SSE), 643, 718 Estimated regression function, 760, 768, 772 Estimated regression line, 713, 714 Estimated standard error, 99, 400, 731, 801 Estimator, 398, 582 Event(s), 51 complement of, 52 compound, 51, 61 definition of, 51 dependent, 87–91 disjoint, 53 exhaustive, 80 independent, 87–91
Index indicator function for, 430 intersection of, 52 mutually exclusive, 53 mutually independent, 90 simple, 51 union of, 52 Venn diagrams for, 53 Expected counts, 824 Expected mean squares in ANOVA, 662, 666, 690, 704 F test and, 679, 682, 690, 693 in mixed effects model, 682, 692 in random effects model, 668, 682–683 in regression, 765 Expected or mean value, 203 Expected value, 127 conditional, 320 of a continuous random variable, 203 covariance and, 296 of a discrete random variable, 126 of a function, 129, 294 heavy-tailed distribution and, 129, 135 of jointly distributed random variables, 294 Law of Large Numbers and, 376 of a linear combination, 303 of mean squares (see Expected mean squares) moment generating function and, 139, 208 moments and, 137 in order statistics, 342–343, 348 of sample mean, 346, 368 of sample standard deviation, 405, 446 of sample total, 368 of sample variance, 405 Experiment, 49 binomial, 144, 285, 823 definition of, 50 double-blind, 605 observational studies in, 571 paired data, 596 paired vs. independent samples, 603 randomized block, 680–682 randomized controlled, 571 repeated measures designs, 681 simulation, 363–366 Explanatory variable, 704 Exponential distribution, 234 censored experiments and, 409 chi-squared distribution and, 381 confidence interval for parameter, 457
double, 550 estimators for parameter, 409, 416 goodness-of-fit test for, 835 mixed, 272 in pure birth process, 445 shifted, 427, 552 skew in, 346 standard gamma distribution and, 234 Weibull distribution and, 239 Exponential random variable(s) Box–Muller transformation and, 342 cdf of, 235 expected value of, 234 independence of, 288 mean of, 234 in order statistics, 343, 347 pdf of, 234 transformation of, 258, 338, 340 variance of, 234 Exponential regression model, 820 Exponential smoothing, 48 Extreme outliers, 36–39 Extreme value distribution, 253
F Factorial notation, 69 Factorization theorem, 429 Factors, 639 Failure rate function, 274 Family of probability distributions, 118, 250 F distribution chi-squared distribution and, 385 definition of, 385 expected value of, 387 for model utility test, 735, 772, 799 noncentral, 663 pdf of, 386 Finite population correction factor, 167 First quartile, 36 Fisher information, 436 Fisher information I(h), 438 Fisher–Irwin test, 607 Fisher transformation, 751 Fitted (or predicted) values, 678, 688, 717, 719, 760, 770 Fixed effects, 667 Fixed effects model, 667, 682, 693 Fourth spread, 36, 357 Frequency, 12 Frequency distribution, 12 Friedman’s test, 882, 886, 888 F test
Index in ANOVA, 647, 683, 690 Bartlett’s test and, 649 coefficient of determination and, 772 critical values for, 386, 612, 646 distribution and, 385, 612, 646 for equality of variances, 612, 621 expected mean squares and, 663, 679, 682, 690, 693 Levene test and, 649 power curves and, 509, 521 P-value for, 613, 614, 622, 647 in regression, 772 sample sizes for, 663 single-factor, 670 vs. t test, 664 two-factor, 682 type II error in, 663
G Galton, 333–334, 724, 749 Galton–Watson branching process, 352 Gamma distribution, 231 chi-squared distribution and, 380 definition of, 231 density function for, 232 Erlang distribution and, 238 estimation of parameters, 416, 421, 424 exponential distribution and, 234–236 Poisson distribution and, 901 standard, 231 Weibull distribution and, 239 Gamma function, 231 incomplete, 233, 253 properties of, 231 Gamma random variables, 232 Geometric distribution, 169, 188 Geometric random variables, 169 Global F test, 774 Goodness-of-fit test for composite hypotheses, 829 definition of, 823 for homogeneity, 841–843 for independence, 844–847 simple, 823 Gossett, 654 Grand mean, 642
H Half-normal plot, 258 Hat matrix, 798 Histogram bimodal, 17
969 class intervals in, 14–16 construction of, 2 density, 16, 17, 928 multimodal, 17 Pareto diagram, 25 for pmf, 117 symmetric, 17 unimodal, 17 Hodges–Lehmann estimator, 446 Homogeneity, 841–843 Honestly Significant Difference (HSD), 656 Hyperexponential distribution, 272 Hypergeometric distribution, 165–167 and binomial distribution, 167 Hypergeometric random variable, 165–167 Hyperparameters, 894 Hypothesis alternative, 502 composite, 829–836 definition of, 502 errors in testing of, 504–509 notation for, 501 null, 502 research, 502 simple, 542 Hypothetical population, 5
I Ideal power function, 546 Inclusion-exclusion principle, 61 Inclusive inequalities, 152 Incomplete beta function, 244 Incomplete gamma function, 233, 253 Independence chi-squared test for, 844 conditional distribution and, 319 correlation coefficient and, 300 covariance and, 299, 302 of events, 87–90 of jointly distributed random variables, 283–284, 287 in linear combinations, 303–304 mutual, 90 pairwise, 92, 107 in simple random sample, 359 Independent, 88, 284, 287, 704 Independent and identically distributed (iid), 359 Independent variable, 704 Indicator (or dummy) variable, 787 Indicator function, 430 Inductive reasoning, 4 Inferential statistics, 4–5 Inflection point, 213
Intensity function, 187 Interaction, 675, 687, 689–691, 692–695, 785–790 Interaction effect, 786 Interaction parameters, 687 Interaction plots, 675 Interaction sum of squares, 690 Interaction term, 786 Intercept, 250, 706, 715 Intercept coefficient, 706 Interpolation, 717 Interquartile range (iqr), 36 Intersection of events definition of, 52 multiplication rule for probability of, 78–80, 88 Invariance principle, 423 Inverse cdf method, 175
J Jacobian, 337, 340 Jensen's inequality, 275 Joint cumulative distribution function, 352 Jointly distributed random variables bivariate normal distribution of, 330–334 conditional distribution of, 317–326 correlation coefficients for, 299 covariance between, 296 expected value of function of, 294 independence of, 283–284 linear combination of, 303–309 in order statistics, 346–348 pdf of, (see Joint probability density functions) pmf of (see Joint probability mass functions) transformation of, 336–341 variance of function of, 302, 305 Jointly sufficient statistics, 431 Joint marginal density function, 293 Joint pdf, 285 Joint pmf, 285 Joint probability density function, 279 Joint probability mass function, 277–279 Joint probability table, 278
K k-out-of-n system, 153 Kruskal—Wallis test, 880–881 Kth central moment, 137 Kth moment, 137
970 Kth moment about the mean, 137 Kth moment of the distribution, 416 Kth population moment, 416 Kth sample moment, 416 k-tuple, 67–68
L Lag 1 autocorrelation coefficient, 757 Laplace distribution, 550 Laplace's rule of succession, 900 Largest extreme value distribution, 271 Law of Large Numbers, 376–377, 384–385, 443 Law of total probability, 80 Least squares estimates, 714, 731, 763, 768–769 Least Squares Regression Line (LSRL), 714 Level a test, 508 Level of a factor, 639, 673, 682 Levels, 639 Levene test, 649–650 Leverages, 802 Likelihood function, 420, 543, 547 Likelihood ratio chi-squared statistic for, 550 definition of, 543 mle and, 548 model utility test and, 820 in Neyman—Pearson theorem, 543 significance level and, 543, 544 sufficiency and, 447 tests, 548 Likelihood ratio test, 548 Likelihood ratio test statistic, 548 Limiting relative frequency, 57, 58 Linear combination, 303 distribution of, 310 expected value of, 303 independence in, 303 variance of, 304 Linear probabilistic model, 703, 715 Linear regression additive model for, 704, 705, 767 ANOVA in, 734, 790 confidence intervals in, 730, 739 correlation coefficient in, 745–754 definition of, 706 degrees of freedom in, 718, 771, 798 least squares estimates in, 713–723, 764 likelihood ratio test in, 820 mles in, 718
Index model utility test in, 648, 772, 798 parameters in, 706, 713–723, 767 percentage of explained variation in, 720–721 prediction interval in, 737, 741, 775 residuals in, 717, 758, 771 summary statistics in, 715 sums of squares in, 718–723, 771 t ratio in, 732, 751 Line graph, 116–117 Location parameter, 253 Logistic distribution, 349 Logistic regression model contingency tables for, 847–848 definition of, 807–808 fit of, 808–809 mles in, 809 in multiple regression analysis, 790 Logit function, 807, 808 Log-likelihood function, 420 Lognormal distribution, 242–244, 376 Lognormal random variables, 242–243 Long-run (or limiting) relative frequency, 58 Lower confidence bound, 460 Lower confidence bound for l, 464 Lower quartile, 36
M Main effects for factor A, 687 Main effects for factor B, 687 Mann—Whitney test, 871–878 Marginal distribution, 279, 281, 317 Marginal probability density functions, 281 Marginal probability mass functions, 279 Margin of error, 457 Matrices in regression analysis, 795–804 Maximum a posteriori (MAP), 897 Maximum likelihood estimator, 420 for Bernoulli parameter, 443 for binomial parameter, 443 Cramér—Rao inequality and, 442 data sufficiency for, 434 Fisher information and, 436 for geometric distribution parameter, 631 in goodness-of-fit testing, 830 in homogeneity test, 841 in independence test, 844
in likelihood ratio tests, 547 in linear regression, 720, 726 in logistic regression, 808 sample size and, 424 score function and, 443 McNemar's test, 611, 636 Mean conditional, 320–321 correction for the, 644 deviations from the, 33, 244, 650, 718, 830 of a function, 129, 294 vs. median, 28 moments about, 137 outliers and, 27, 28 population, 27 regression to the, 334, 749 sample, 26 Mean square expected, 663, 679, 682, 683, 693 lack of fit, 766 pure error, 766 Mean square error definition of, 403 of an estimator, 403 MVUE and, 407 sample size and, 406 Mean Square for Error (MSE), 403, 645 Mean Square for Treatments (MSTr), 645 Mean-square value, 133 Mean value, 127 Mean vector, 799 Measurement error, 406 Median, 198 in boxplot, 37–38 of a distribution, 27–28 as estimator, 444 vs. mean, 28 outliers and, 27, 28 population, 28 sample, 26 statistic, 342 Memoryless property, 236 Mendel's law of inheritance, 826, 827 M-estimator, 425, 448 Method of Moments Estimators (MMEs), 416 Midfourth, 47 Midrange, 399 Mild outlier, 38 Minimal sufficient statistic, 432, 434 Minimize absolute deviations principle, 763 Minimum variance unbiased estimator, 407
Index Mixed effects model, 682 Mixed exponential distribution, 283 mle. See Maximum likelihood estimate Mode, 46 of a continuous distribution, 271 of a data set, 46 of a discrete distribution, 186 Model equation, 662 Model utility test, 732 Moment generating function, 139, 208 of a Bernoulli rv, 139 of a binomial rv, 151 of a chi-squared rv, 316 of a continuous rv, 208 definition of, 139, 208 of a discrete rv, 139 of an exponential rv, 258 of a gamma rv, 232 of a linear combination, 309 and moments, 141, 208 of a negative binomial rv, 169 of a normal rv, 221 of a Poisson rv, 160 of a sample mean, 395 uniqueness property of, 140, 208 Moments definition of, 137 method of, 397, 416, 418, 424 and moment generating function, 208 Monotonic, 259 Multimodal histogram, 17 Multinomial distribution, 286 Multinomial experiment, 286, 823, 824 Multiple comparisons procedure, 653 Multiple logit function, 812 Multiple regression additive model, 767, 795 categorical variables in, 787, 790 coefficient of multiple determination, 772 confidence intervals in, 776 covariance matrices in, 799 degrees of freedom in, 771 diagnostic plots, 777 fitted values in, 769 F ratio in, 774 interaction in models for, 785, 788, 790 leverages in, 802, 803 model utility test in, 772 normal equations in, 768, 796 parameters for, 767 and polynomial regression, 783 prediction interval in, 776
971 principle of least squares in, 768, 796 residuals in, 775, 777 squared multiple correlation in, 721 sums of squares in, 771 Multiplication rule, 78, 79, 82, 88, 101 Multiplicative exponential regression model, 820 Multiplicative power regression model, 820 Multivariate, 2 Multivariate hypergeometric distribution, 291 Mutually exclusive events, 53, 92 Mutually independent events, 90 MVUE. See Minimum variance unbiased estimator
N Negative binomial distribution, 168–170 Negative binomial random variable, 168 estimation of parameters, 417 Negatively skewed, 17 Newton’s binomial theorem, 169 Neyman factorization theorem, 429 Neyman-Pearson theorem, 543–545, 547 Noncentrality parameter, 497, 663, 671 Noncentral F distribution, 663 Noncentral t distribution, 497 Nonhomogeneous Poisson Process, 187 Nonparametric methods, 855–888 Nonstandard normal distribution, 218 Normal distribution, 213 asymptotic, 372, 436, 443 binomial distribution and, 223–224, 375 bivariate, 330–333, 389, 550, 754 confidence interval for mean of, 452–454, 456, 460, 463 continuity correction and, 223–224 density curves for, 213 and discrete random variables, 222, 223 of linear combination, 389 lognormal distribution and, 242, 376 nonstandard, 218 pdf for, 212
percentiles for, 215–217, 220, 221, 248 probability plot, 247 standard, 214 t distribution and, 383–386 z table, 214–216 Normal equations, 714, 768, 805 Normal probability plot, 247 Normal random variable, 214 Null distribution, 517, 862 Null hypothesis, 502 Null hypothesis of homogeneity, 841 Null hypothesis of independence, 845 Null set, 53, 56 Null value, 503, 512
O Observational study, 571 Observed counts, 824 Odds, 808 Odds ratio, 808, 847–848 One-sample t CI, 464 One-sided confidence interval, 460 One-way ANOVA, 639 Operating characteristic curve, 154 Ordered categories, 847–848 Ordered pairs, 66–67 Order statistics, 342–347, 402, 431–432, 552, 856 sufficiency and, 431–432 Outliers, 37 in a boxplot, 37–39 extreme, 37 leverage and, 802 mean and, 27, 29, 490–491 median and, 27, 29, 37, 490, 491 mild, 37 in regression analysis, 763
P Paired data in before/after experiments, 594, 611 bootstrap procedure for, 622–624 confidence interval for, 592–594 definition of, 591 vs. independent samples, 597 in McNemar′s test, 611 permutation test for, 624 t test for, 594–596 in Wilcoxon signed-rank test, 864–866 Pairwise average, 868, 869, 876 Pairwise independence, 107 Parallel connection, 54, 90, 92, 93, 343, 344
972 Parameter(s), 118 confidence interval for, 457, 458 estimator for a, 397–410 Fisher information on, 436–444 goodness-of-fit tests for, 827–829, 830–833 hypothesis testing for, 503, 526 location, 253, 432 maximum likelihood estimate of, 420–425, 434 moment estimators for, 416–418 MVUE of, 407–409, 424, 434, 442 noncentrality, 663 null value of, 503 of a probability distribution, 118–119 in regression, 706–707, 713–723, 741, 749, 767, 808 scale, 232, 240, 253–255, 431 shape, 254–255, 431–434 sufficient estimation of, 428 Parameter space, 547 Pareto diagram, 25 Pareto distribution, 201, 211, 262 Partial F test, 793 pdf. See Probability density function Pearson’s chi-squared, 825 Percentiles for continuous random variables, 198–200 in hypothesis testing, 534 in probability plots, 247–253 sample, 29, 248, 253 of standard normal distribution, 215–217, 247–253 Permutation, 68 Permutation test, 619–625 PERT analysis, 245 Pie chart, 18 Pivotal quantity, 457 Plot probability, 247–256, 435, 577, 750, 777 scatter, 704–706, 720–721, 746, 750 pmf. See Probability mass function Point estimate/estimator, 398 biased, 406–408 bias of, 403–406 bootstrap techniques for, 410, 484–492 bound on the error of estimation, 457 censoring and, 409–410 consistency, 424, 443–444 for correlation coefficient, 749–750
Index and Cramér–Rao inequality, 441–444 definition of, 27, 359, 398 efficiency of, 442 Fisher information on, 436–444 least squares, 714–718 maximum likelihood (mle), 418–425 of a mean, 27, 359, 398, 407 mean squared error of, 403 moments method, 416–418, 424 MVUE of, 407, 424, 434, 442 notation for, 397, 399 of a standard deviation and, 399, 405 standard error of, 410 of a variance, 399, 405 Point prediction, 468, 716 Poisson distribution, 156 Erlang distribution and, 238 expected value, 160, 164 exponential distribution and, 235 gamma distribution and, 896 goodness-of-fit tests for, 833–835 in hypothesis testing, 542–544 mode of, 186 moment generating function for, 160 parameter of, 160 and Poisson process, 160–161, 235 variance, 160, 164 Poisson process, 160–161 nonhomogeneous, 187 Polynomial regression model, 783–784 Pooled, 582 Pooled t procedures and ANOVA, 549, 581–582, 664 vs. Wilcoxon rank-sum procedures, 875 Population, 1 Population mean, 27 Population median, 28 Population (or true) regression line, 707 Population standard deviation, 34 Positively skewed, 17 Posterior distribution, 890 Posterior probability, 80–83, 899, 900 Power, 509 Power curves, 509, 663–664 Power function, 545 Power function of a test, 545–547, 663–664 Power model for regression, 820
Power of a test Neyman–Pearson theorem and, 545–547 type II error and, 522, 544–548, 582 Precision, 400, 451, 456, 468, 478, 594, 682, 895 Predicted values, 678, 717, 770 Prediction interval, 469, 741 Bonferroni, 742 vs. confidence interval, 469, 741–742, 776 in linear regression, 738, 741–742 in multiple regression, 775 for normal distribution, 467–469 Prediction level, 469, 742, 776 Predictor, 704 Predictor variable, 704, 767, 783–786 Principle of least squares, 713–723, 763, 768, 777 Prior distribution, 889 Prior probability, 80, 889 Probability, 49 conditional, 75–83, 86–88, 236, 317–319, 428, 431–432 continuous random variables and, 114, 189–262, 279–288, 317–319 counting techniques for, 66–72 definition of, 49 density function (see Probability density function) of equally likely outcomes, 61–62 histogram, 118, 190–191, 222–224, 361–362 inferential statistics and, 4, 9, 357 Law of Large Numbers and, 376–377, 384–385 law of total, 80 mass function (see Probability mass function) of null event, 56 plots, 247–256, 435, 569, 750, 760, 777 posterior/prior, 80–82, 889, 897, 900 properties of, 55–62 relative frequency and, 57–58, 363–364 sample space and, 49–53, 55–56, 62, 66, 109 and Venn diagrams, 53, 60, 76 Probability density function (pdf), 191 conditional, 318–320
Index definition of, 191 joint, 277–348, 383, 420, 429–431, 434, 542, 547 marginal, 281–283, 339–340 vs. pmf, 192 Probability distribution, 116, 191 Bernoulli, 112, 116–119, 128, 139–140, 143, 150, 375, 441, 442, 443, 892 beta, 244–245 binomial, 144–151, 157–159, 223–224, 375, 417–418, 475–476, 503–507 bivariate normal, 330–334, 550, 752 Cauchy, 263, 274, 342 chi-squared, 261, 380–382, 408 conditional, 317–326 continuous, 114, 189–276 discrete, 111–188 exponential, 234–236, 239, 409 extreme value, 253–254 F, 385–386 family, 118, 250, 253–256, 646 gamma, 230–237, 253–254 geometric, 121–122, 129, 169, 261 hyperexponential, 272 hypergeometric, 165–167, 305–306 joint, 277–356, 749–750, 830 Laplace, 317, 550–551 of a linear combination, 303–311, 331 logistic, 349 lognormal, 242–244, 376 multinomial, 286, 823 negative binomial, 168–170 normal, 213–225, 242–253, 330–334, 368–376, 388, 828 parameter of a, 118–119 Pareto, 201, 211, 262 Poisson, 156–161, 235 Rayleigh, 200, 263, 414, 427 of a sample mean, 357–366, 368–377 standard normal, 214–217 of a statistic, 357–377 Studentized range, 654 symmetric, 17, 29, 138, 200, 203, 213 t, 383–385, 386, 536, 594 uniform, 192–193, 195 Weibull, 239–241 Probability generating function, 187 Probability histogram, 118 Probability mass function, 116 conditional, 317–318 definition of, 116–123
973 joint, 277–281 marginal, 279 Probability of the event, 55 Probability plot, 247 Product rules, 66–67 Proportion population, 475, 526–529, 602–607 sample, 225, 375, 413, 602, 845 trimming, 29, 399, 406, 408–409 Pure birth process, 445 P-value, 532 for chi-squared test, 826–827 definition of, 532 for F tests, 613–615 for t tests, 536–539 type I error and, 534 for z tests, 534–536
Q Quadratic regression model, 783 Qualitative data, 18 Quantile, 198, 225, 237, 365, 382, 387, 458, 490, 618, 829, 855–857, 899 Quantitative characteristic, 1 Quartiles, 36
R Random effects, 667 Random effects model, 667–668, 682–683, 692–695 Random interval, 453–455 Random number generator, 6, 74, 95, 174, 265, 854 Randomized block experiment, 680–682 Randomized controlled experiment, 571 Randomized response technique, 415 Random sample, 343, 359 Random variable, 111, 112 continuous, 189–275 definition of, 112 discrete, 111–188 jointly distributed, 277–356 standardizing of, 218 types of, 113 Range, 32 definition of, 32 in order statistics, 342–345 population, 458 sample, 32, 342–345 Studentized, 654–655 Rank average, 886
Rao-Blackwell theorem, 433–434, 443, 444 Ratio statistic, 550 Rayleigh distribution, 263, 414, 427 Regression coefficient, 727–735, 767–770, 795–797, 799–800 effect, 334, 749 function, 704, 760, 766, 771, 783, 788 line, 707–709, 713–723, 727–732 linear, 706–709, 713–723, 727–734, 737–742 logistic, 806–809 matrices for, 795–804 to the mean, 334 multiple, 767–776 multiplicative exponential model, 820 multiplicative power model for, 820 plots for, 760–764 polynomial, 783–784 quadratic, 783–784 through the origin, 448–496 Regression effect, 749 Regression sum of squares, 723 Regression to the mean, 749 Rejection region, 504 cutoff value for, 504–508 definition of, 504 lower-tailed, 505, 513–514 in Neyman–Pearson theorem, 543–547 two-tailed, 513 type I error and, 504 in union-intersection test, 637 upper-tailed, 506, 513–514 Relative frequency, 12–18, 57–58 Repeated-measures, 681 Repeated measures designs, 681 Replications, 57, 363–365, 455, 672 Resample, 485 Research hypothesis, 502 Residual plots, 678, 688, 760–763 Residuals in ANOVA, 648, 677, 691, 717 definition of, 643 leverages and, 801–802 in linear regression, 717, 758–762 in multiple regression, 771 standard error, 757 standardizing of, 758, 777 variance of, 758, 801 Residual standard deviation, 718, 771 Residual sum of squares, 718
974 Residual vector, 798 Response, 704 Response variable, 6, 704, 807 Response vector, 796 Robust estimator, 409 Ryan-Joiner test, 256
S Sample, 1 convenience, 6 definition of, 1 outliers in, 37–38 simple random, 6, 359 size of (see Sample size) stratified, 6 Sample coefficient of variation, 44 Sample correlation coefficient, 746 in linear regression, 745–746, 751 vs. population correlation coefficient, 749, 751–754 properties of, 746–747 strength of relationship, 747 Sample mean, 26 definition of, 26 population mean and, 368–377 sampling distribution of, 368–377 Sample median, 27 definition of, 27 in order statistics, 342–343 vs. population median, 491 Sample moments, 416 Sample percentiles, 248 Sample proportion, 225 Sample size, 9 in ANOVA, 663–664 asymptotic relative efficiency and, 867–875 Central Limit Theorem and, 375 confidence intervals and, 456–457, 458, 464 definition of, 9 in finite population correction factor, 167 for F test, 663–664 for Levene test, 649–650 mle and, 424, 443 noncentrality parameter and, 663–664, 671 Poisson distribution and, 156 for population proportion, 475–478 probability plots and, 252 in simple random sample, 359 type I error and, 508, 516, 517, 570, 605
Index type II error and, 508, 516, 517, 527, 569, 582 variance and, 377 z test and, 515–517, 527–528 Sample space, 50 definition of, 49 determination, 457, 466–467, 516, 522, 528, 570–571, 605 probability of, 55–62 Venn diagrams for, 53 Sample standard deviation, 33 in bootstrap procedure, 487 confidence bounds and, 460 confidence intervals and, 464 definition of, 33 as estimator, 406, 446 expected value of, 405, 446 independence of, 389, 390 mle and, 423 population standard deviation and, 359, 405, 446 sample mean and, 33, 389 sampling distribution of, 360, 361, 383, 406, 446 variance of, 561 Sample total, 368 Sample variance, 33 in ANOVA, 642 calculation of, 35 definition of, 33 distribution of, 359–362, 383 expected value of, 405 population variance and, 34, 388–389, 402 Sampling distribution, 359 bootstrap procedure and, 485, 617, 855 definition of, 357, 359 derivation of, 360–363 of intercept coefficient, 818 of mean, 360–362, 481–484 permutation tests and, 855 simulation experiments for, 363–366 of slope coefficient, 727–734 Scale parameter, 231, 239–240, 253–254, 431 Scatter plot, 704–705 Scheffe´ method, 702 Score function, 439–441 Segmented bar chart, 843 Series connection, 343–344 Set theory, 51–53 Shape parameters, 254–255, 432 Shapiro-Wilk test, 256 Siegel–Tukey test, 888 Signed-rank interval, 867 Signed ranks, 862 Significance
practical, 553–554, 836 statistical, 554, 571, 836 Significance level, 508 definition of, 508 joint distribution and, 552 likelihood ratio and, 542 observed, 533 Sign interval, 860 Sign test, 858, 860 Simple events, 51, 61, 66 Simple hypothesis, 542, 829 Simple random sample, 6 definition of, 6, 359 independence in, 359 sample size in, 359 Simulation experiment, 359, 363–366, 491, 538 Single-classification, 639 Single-factor, 639 Skewed data coefficient of skewness, 138, 211 definition of, 17 in histograms, 17, 487 mean vs. median in, 28 measure of, 138 probability plot of, 253, 486–487 Skewness coefficient, 138 Slope, 706–707, 715, 728, 730, 808 Slope coefficient, 706 confidence interval for, 730 definition of, 706–707 hypothesis tests for, 732 least squares estimate of, 714 in logistic regression model, 808 Standard beta distribution, 244 Standard deviation, 132, 205 normal distribution and, 213 of point estimator, 400–402 population, 133, 205 of a random variable, 133, 205 sample, 32 z table and, 218 Standard error, 150, 400–402 Standard error of the mean, 369 Standard gamma distribution, 231 Standardized residuals, 758 Standardized variable, 218 Standard normal distribution, 214 Cauchy distribution and, 342 chi-squared distribution and, 381 critical values of, 217 definition of, 214 density curve properties for, 214–217 F distribution and, 385, 387 percentiles of, 215–217 t distribution and, 387, 391 Standard normal random variable, 214, 387
Index Statistic, 359 Statistical hypothesis, 501 Stem-and-leaf display, 9–11 Step function, 121 Stratified samples, 6 Studentized range distribution, 654 Student t distribution, 383 Sufficient, 429 Sufficient statistic(s), 429, 430, 432–436, 443, 444, 446, 447 Summary statistics, 715, 719, 731, 755 Sum of squares error, 643, 718, 722 interaction, 690 lack of fit, 766 pure error, 766 regression, 723, 797 total, 677, 734, 773 treatment, 644–648 Support, 116, 191 Symmetric, 17, 200 Symmetric distribution, 17, 138, 200
T Taylor series, 207, 667 t confidence interval heavy tails and, 867, 876, 869 in linear regression, 730, 738 in multiple regression, 776 one-sample, 463–465 paired, 594–596 pooled, 582 two-sample, 575, 596 t critical value, 463 t distribution central, 497 chi-squared distribution and, 391, 579, 590 critical values of, 384, 463, 524, 536 definition of, 390 degrees of freedom in, 390, 475 density curve properties for, 384, 463 F distribution and, 663 noncentral, 497 standard normal distribution and, 383, 384, 464 Student, 383–385 Test of hypotheses, 502 Test statistic, 503, 504 Third quartile, 36 Time series, 48, 757 Tolerance interval, 469 Total sum of squares, 644, 720 Transformation, 167, 258–262, 336–341
975 Treatment, 640, 642–643, 672, 673 Treatment sum of squares SSTr, 643 Tree diagram, 67–68, 79, 82, 89 Trial, 144–147 Trimmed mean, 29 definition of in order statistics, 342–343 outliers and, 29 as point estimator, 398 population mean and, 406, 409 Trimming proportion, 29, 409 True (or population) regression coefficients, 767 True (or population) regression function, 767 True regression line, 707–709, 713, 727–728 t test vs. F test, 664 heavy tails and, 867, 876, 869 likelihood ratio and, 547, 548 in linear regression, 734 in multiple regression, 775–777, 822 one-sample, 463–465, 536, 547–549, 592, 875 paired, 592 pooled, 581–582, 664 P-value for, 536–537 two-sample, 575–578, 596, 664 type I error and, 517–519, 576 type II error and, 517–520, 582 vs. Wilcoxon rank-sum test, 875 vs. Wilcoxon signed-rank test, 866–867 Tukey's procedure, 654–659, 666, 679–680, 691 Two one-sided tests, 637 Two-proportion z interval, 606 Two-sample t confidence interval for µ1 − µ2, 575 Two-sample t test, 575 Two-way contingency table, 840 Type I error, 504 definition of, 544 Neyman–Pearson theorem and, 543 power function of the test and, 545 P-value and, 532–533 sample size and, 516 significance level and, 508 vs. type II error, 508 Type II error, 504 definition of, 504 vs. type I error, 508 Type II error probability in ANOVA, 663–665, 699 degrees of freedom and, 597
for F test, 663–665, 686 in linear regression, 736 Neyman–Pearson theorem and, 542–545 power of the test and, 546 sample size and, 515, 549–550, 554, 580, 582 in tests concerning means, 515 in tests concerning proportions, 527–528, 605–606 t test and, 582 vs. type I error probability, 508 in Wilcoxon rank-sum test, 875 in Wilcoxon signed-rank test, 866–867
U Unbiased estimator, 400–410 minimum variance, 406–408 Unbiased tests, 547 Uncorrelated, 300 Uncorrelated random variables, 300, 304 Uniform distribution, 192 beta distribution and, 893 Box–Muller transformation and, 342 definition of, 192 discrete, 135 transformation and, 260–261 Uniformly most powerful (UMP) level a test, 547 Uniformly most powerful test, 545–546 Unimodal histogram, 17–18 Union-intersection test, 637 Union of events, 51 Univariate data, 2 Upper confidence bound, 460 Upper confidence bound for l, 464 Upper quartile, 36
V Variable(s), 2 covariate, 790 in a data set, 9 definition of, 1 dependent, 704 dummy, 787–789 explanatory, 704 independent, 704 indicator, 787–789 predictor, 704 random, 110 response, 704 Variable utility test, 775 Variance, 132, 205
976
Index
conditional, 319–321 confidence interval, 481–484 of a function, 134–135, 207–208, 355 of a linear function, 134–137, 305 population, 133, 205 precision and, 895 of a random variable, 132, 205 sample, 32–36 Variances, comparing two, 611–616 Venn diagram, 53, 60, 76, 77
extreme value distribution and, 253 probability plot, 253–254 Weighted average, 127, 203, 323, 582, 895 Weighted least squares, 763 Weighted least squares estimates, 763 Wilcoxon rank-sum interval, 876 Wilcoxon rank-sum test, 871–875 Wilcoxon signed-rank interval, 869 Wilcoxon signed-rank test, 861–867
W Walsh averages, 868 Weibull distribution, 239 basics of, 239–243 chi-squared distribution and, 274 estimation of parameters, 422, 425–426
Z z confidence interval for a correlation coefficient, 753 for a difference between means, 581 for a difference between proportions, 609
for a mean, 456 for a proportion, 475 z critical values, 217 z curve area under, maximizing of, 552 rejection region and, 513 t curve and, 384 z test chi-squared test and, 850 for a correlation coefficient, 753 for a difference between means, 565–581 for a difference between proportions, 603 for a mean, 514, 519 for a Poisson parameter, 462, 561 for a proportion, 527 P-value for, 534